PDA

View Full Version : pdftotext issue


JonnyNoog
07-13-2006, 06:57 PM
Hi,

I am trying to get pdftotext to work with phpdig. I have followed the instructions in the sticky at the top of the forum section and the output I am getting is this:


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: D:\Internet\WWWROOT\anmc\Xpdf\pdftotext.exe
Does parse pdf exist: 1


This is all I get if I try to re-index the whole site or if I try to reindex only a sub-section of the site. As you can see by the path, I am unfortunately forced to be installing on a Windows box.

I have tried out pdftotext via the command line and it appears to work... It makes a text file in the Xpdf dir that contains the expected text from the PDF I gave it.

I've searched the forum repeatedly, but nothing I have yet found has solved my problem, any help would be greatly appreciated. :)

sandychan
07-13-2006, 07:22 PM
May I know your system configuration?

JonnyNoog
07-13-2006, 08:49 PM
IIS 5 with PHP Version 4.3.1 (CGI I think)

MySQL 3.23.52

Not sure what else is relevant...?

JonnyNoog
07-13-2006, 09:05 PM
Erm... PhpDig v.1.8.8, that's probably relevant, hey :).

Is there a way to edit posts on this forum by the way? Can't seem to see the option... Or am I just having a blonde day?

JonnyNoog
07-14-2006, 02:21 AM
Well after much stuffing around, I have now installed PHP 5. The is_executable() function not being available for PHP 4 with Windows as I have found out (only took me like 4 hours to get that all worked out! :what: ). So I now am getting the output as below:


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: D:\Internet\WWWROOT\anmc\Xpdf\pdftotext.exe
Does parse pdf exist: 1
Is parse pdf executable: 1


Still no PDF indexing action to be seen. Any help much appreciated, I think I'm going to now get as far away from the computer as possible before I smash it with a hammer. :angry:

JonnyNoog
07-14-2006, 10:21 PM
So coming back to my problem with fresh eyes, it looks like the extra lines in robot_functions.php:

echo "<br>Command is: " . $command . "<br>";
echo "Result contains: ";
print_r($result);
echo "<br>Return value is: " . $retval . "<br><br>";

Are not being run... Which would lead me to think that the switch statement in robot_functions.php (switch ($result_test['status']) is not running and setting $usetool to true.

Any help, any help at all would be greatly appreciated at this point. If I can't get PDF indexing working with phpdig then I'll be forced to use some other search engine and I really like phpdig! :yes: I'm really not any kind of PHP guru at all and I have a suspicion that perhaps my problem stems from the fact the I am forced to be setting phpdig and pdftotext up on a Windows system with IIS... Perhaps some kind of permission problem with the pdftotext executable and the php exec() function, I don't know... :bang:

JonnyNoog
07-15-2006, 12:40 AM
Well... Following on down the path from my last post, $result_test['status'] was not being set as 'PDF' so the switch statement was not in turn running the 'PDF' case. So I wanted to see what would happen if I told phpdig index the full address to a particular PDF.

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: D:\Internet\WWWROOT\anmc\Xpdf\pdftotext.exe
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: D:\Internet\WWWROOT\anmc\Xpdf\pdftotext.exe ../admin/temp/67264762.tmp 2>&1
Result contains: Array ( )
Return value is: 0

5:http://XXX/docs/Modified_Form_A_0607.pdf
(time : 00:00:11)

Success! :smoke: It indexed the PDF. So it now seems that in fact, pdftotext was working the whole time and the problem was that it just wasn't finding the PDF files to index in the first place, because I hadn't set phpdig to look for enough links on each level... I think.

But all's well that ends well I guess. Too bad I can't rename this thread to the jonny vs. jonny thread...

How are you doing now jonny?

I'm doing well thanks, jonny...

That's great to hear, jonny. Take care.