PDA

View Full Version : can't index pdf using pdftotext


rom
08-06-2004, 08:25 AM
My server is running php 4.3.8 on a linux system, and I am trying to search pdfs using the pdftotext external binary.

I am able to get phpdig to search html files. Pdftotext converts pdfs and places a txt file in the same directory, when run from the command line, but I haven't been able to configure phpdig to index a linked pdf file on my website.

I have followed all the instructions on the thread "External Binaries Problem Checklist", and have inserted the recommended echo statements in spider.php and robot_functions.php. The output when reindexing shown below.

Thanks very much for any assistance.

SITE : http://www.goeco.com/
Exclude paths :
- cgi-bin/


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
1:http://www.goeco.com/index2.html
(time : 00:00:05)
+ +
level 1...


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
2:http://www.goeco.com/fr_band.html
(time : 00:00:15)

(the same output as above for various other linked pages, until we get to:)

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
15:http://www.goeco.com/profile.pdf
(time : 00:01:33)

No link in temporary table
links found : 15
http://www.goeco.com/index2.html
http://www.goeco.com/fr_band.html
http://www.goeco.com/home.html
http://www.goeco.com/contact.html
http://www.goeco.com/sustainability.html
http://www.goeco.com/response.html
http://www.goeco.com/training.html
http://www.goeco.com/sites.html
http://www.goeco.com/wastes.html
http://www.goeco.com/impacts.html
http://www.goeco.com/audits.html
http://www.goeco.com/ems.html
http://www.goeco.com/services.html
http://www.goeco.com/vision.html
http://www.goeco.com/profile.pdf
Optimizing tables...
Indexing complete ! [Back] to admin interface.

Charter
08-06-2004, 09:47 AM
Hi. The "is parse pdf executable" is coming up false so check that pdftotext is set to 755 permission.

rom
08-06-2004, 11:25 AM
Hi Charter,

Thanks very much for your quick reply. I had set the permissions correctly, but then moved the file to a new directory, so somehow it was changed to the wrong settings. It is now 755, and this is the output from the echos.

...similar to what was there before except as shown below...

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home5/goeco/HTML/pdftotext -cork ../admin/temp/78556292.tmp
Result contains: Array ( )
Return value is: 1

15:http://www.goeco.com/profile.pdf
(time : 00:01:35)

No link in temporary table
links found : 15
http://www.goeco.com/index2.html
http://www.goeco.com/fr_band.html
http://www.goeco.com/home.html
http://www.goeco.com/contact.html
http://www.goeco.com/sustainability.html
http://www.goeco.com/response.html
http://www.goeco.com/training.html
http://www.goeco.com/sites.html
http://www.goeco.com/wastes.html
http://www.goeco.com/impacts.html
http://www.goeco.com/audits.html
http://www.goeco.com/ems.html
http://www.goeco.com/services.html
http://www.goeco.com/vision.html
http://www.goeco.com/profile.pdf
Optimizing tables...

Charter
08-06-2004, 11:30 AM
Hi. Now the command:

/home5/goeco/HTML/pdftotext -cork ../admin/temp/78556292.tmp

is failing so find:

$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2;

and replace with:

$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2.' 2>&1';

and see what error it shows on reindex.

rom
08-06-2004, 04:06 PM
Hi Charter,

Thanks again for responding so quickly. Here is the latest error message. Was I supposed to have created a cork file somewhere?

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home5/goeco/HTML/pdftotext -cork ../admin/temp/54831932.tmp 2>&1
Result contains: Array ( [0] => Error: Couldn't open file '-cork' )
Return value is: 1

15:http://www.goeco.com/profile.pdf
(time : 00:01:34)

No link in temporary table

Charter
08-06-2004, 04:11 PM
Hi. The flag cork is an option that doesn't seem available to you so just set the following in the config file:

define('PHPDIG_OPTION_PDF',''); // two single quotes, no space between

rom
08-06-2004, 04:26 PM
Hi Charter,

Thanks. It is working now!

Have a good weekend.

Rom

rom
08-07-2004, 01:09 PM
I'm working on another website now, and have not been able to get phpdig to index the pdfs on this one either. Have followed all your previous directions, and as an example have received the echos shown below.

Thanks very much for your assistance.

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to:
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
38:http://www.cgxenergy.ca/regionalOverview.html
(time : 00:03:33)

rom
08-07-2004, 01:24 PM
Never mind.

Just released the define('PHPDIG_INDEX_PDF',true); was still set to false.

rom
08-07-2004, 03:50 PM
Still stuck, unfortunately. The HTML pages seem OK, but indexing PDFs has given several error messages. After the last one, spidering appears to stop without going through the other 100 or so links.

Thanks again.

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
100:http://www.cgxenergy.ca/affiliated.html
(time : 00:09:18)



Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home/cgxenerg/HTML/investors/pdftotext ../admin/temp/21314312.tmp 2>&1
Result contains: Array ( )
Return value is: 0

101:http://www.cgxenergy.ca/investors/MBerryNotesMar25_04.pdf
(time : 00:09:24)


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home/cgxenerg/HTML/investors/pdftotext ../admin/temp/24175282.tmp 2>&1
Result contains: Array ( [0] => Error: Copying of text from this document is not allowed. )
Return value is: 3

102:http://www.cgxenergy.ca/investors/OctagonMar08_04.pdf
(time : 00:09:29)


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable

Charter
08-07-2004, 04:03 PM
>> Error: Copying of text from this document is not allowed.

Hi. PhpDig using pdftotext cannot index the PDF if the PDF is set to not allow it.

rom
08-07-2004, 07:19 PM
Hi Charter,

Will phpdig still be able to index the other PDFs? Only some gave the copying error.

Is the "copying of text" a security setting on the PDF?

Is the "copying of text" error the reason that the spidering is dieing part way through?

Thanks,

Rom

Charter
08-07-2004, 07:26 PM
Hi. PhpDig can index almost any PDF that allows it, save for PDFs that take so much memory as to cause the script to barf due to lack of memory.

Whomever writes the PDF can set whether the copying of text from the PDF is allowed. I'm not sure about the dieing issue after trying to index an index protected PDF.

How many times do you find PhpDig trying to index an index protected PDF before it dies?

rom
08-07-2004, 08:50 PM
I receive two "copying of text" errors, then it gets to this point:

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

and stops.

You mention a memory issue. The largest PDF is 4.6 M, so many we should just delete anything more than 1 M from the site, if that would help.

With the following message, has this PDF been iindexed without a problem? I'm not sure what the return value means or whether the array should have something in it.

Command is: /home/cgxenerg/HTML/investors/pdftotext ../admin/temp/21314312.tmp 2>&1
Result contains: Array ( )
Return value is: 0

101:http://www.cgxenergy.ca/investors/MBerryNotesMar25_04.pdf
(time : 00:09:24)

Thanks again. You've been a huge help. I've been tearing my hair out on this one.

Charter
08-08-2004, 06:02 PM
Hi. Contrary to possible intuition, with the exec command, a return value of zero is a success. The result array is to contain the output from the command, but in the previous post, it looks as though there was a successful execution of the command, but the array is empty. Perhaps check you error logs, and as to a possible memory issue, maybe this (http://www.phpdig.net/showthread.php?threadid=534) thread might help.

rom
08-12-2004, 01:26 PM
Hi Charter,

I read through the memory thread. Looked up my memory_limit, which is 10 M. Tried also this code for memory_get_usage from the php.net site:

<?php
// This is only an example, the numbers below will
// differ depending on your system
echo memory_get_usage() . "\n"; // 36640
$a = str_repeat("Hello", 4242);
echo memory_get_usage() . "\n"; // 57960
unset($a);
echo memory_get_usage() . "\n"; // 36744
?>

My server returned this:
16704 38000 16784

I know now, based on when the spidering ends, that it is getting hung up on one 4.6 M pdf.

From the memory thread, I wasn't sure what else to do, because at the end of the thread Tomas says nothing worked. Is there something that can be done to skip over that one file?

Thanks again.

Rom

Charter
08-15-2004, 04:15 PM
Hi. Did you try something like in this (http://www.phpdig.net/showthread.php?postid=2395#post2395) post?

rom
08-25-2004, 05:05 PM
hi charter,

tried your suggestion above. the indexing just stops part way through. seems to be when it encounters a 4.6 M file. it doesn't want to skip over it.

thanks,

rom

Charter
08-25-2004, 08:10 PM
Hi. Did you try this (http://www.phpdig.net/forum/showthread.php?p=2402#post2402) too?

rom
08-26-2004, 10:02 AM
Hi Charter,

Tried that also, but again it stops part way through indexing, when it reaches the 4.6 M file.

Thanks,

Rom

Charter
08-26-2004, 11:21 AM
Assuming you are using 1.8.3, try moving this code:

if (memory_get_usage() + 1000000 > 3000000) {
return array('tempfile'=>0,'tempfilesize'=>0);
}

to be right after the following in the robot_functions.php file:

// $file_content = @file($uri); /////////////////////////////////////////////////

rom
08-26-2004, 02:55 PM
i'm using 1.8.0. should i upgrade first?

rom
08-27-2004, 05:11 PM
I tried moving the lines as directed. Still stops indexing part way through at the same spot.