PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   External Binaries (http://www.phpdig.net/forum/forumdisplay.php?f=36)
-   -   can't index pdf using pdftotext (http://www.phpdig.net/forum/showthread.php?t=1158)

rom 08-06-2004 07:25 AM

can't index pdf using pdftotext
 
My server is running php 4.3.8 on a linux system, and I am trying to search pdfs using the pdftotext external binary.

I am able to get phpdig to search html files. Pdftotext converts pdfs and places a txt file in the same directory, when run from the command line, but I haven't been able to configure phpdig to index a linked pdf file on my website.

I have followed all the instructions on the thread "External Binaries Problem Checklist", and have inserted the recommended echo statements in spider.php and robot_functions.php. The output when reindexing shown below.

Thanks very much for any assistance.

SITE : http://www.goeco.com/
Exclude paths :
- cgi-bin/


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
1:http://www.goeco.com/index2.html
(time : 00:00:05)
+ +
level 1...


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
2:http://www.goeco.com/fr_band.html
(time : 00:00:15)

(the same output as above for various other linked pages, until we get to:)

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
15:http://www.goeco.com/profile.pdf
(time : 00:01:33)

No link in temporary table
links found : 15
http://www.goeco.com/index2.html
http://www.goeco.com/fr_band.html
http://www.goeco.com/home.html
http://www.goeco.com/contact.html
http://www.goeco.com/sustainability.html
http://www.goeco.com/response.html
http://www.goeco.com/training.html
http://www.goeco.com/sites.html
http://www.goeco.com/wastes.html
http://www.goeco.com/impacts.html
http://www.goeco.com/audits.html
http://www.goeco.com/ems.html
http://www.goeco.com/services.html
http://www.goeco.com/vision.html
http://www.goeco.com/profile.pdf
Optimizing tables...
Indexing complete ! [Back] to admin interface.

Charter 08-06-2004 08:47 AM

Hi. The "is parse pdf executable" is coming up false so check that pdftotext is set to 755 permission.

rom 08-06-2004 10:25 AM

Hi Charter,

Thanks very much for your quick reply. I had set the permissions correctly, but then moved the file to a new directory, so somehow it was changed to the wrong settings. It is now 755, and this is the output from the echos.

...similar to what was there before except as shown below...

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home5/goeco/HTML/pdftotext -cork ../admin/temp/78556292.tmp
Result contains: Array ( )
Return value is: 1

15:http://www.goeco.com/profile.pdf
(time : 00:01:35)

No link in temporary table
links found : 15
http://www.goeco.com/index2.html
http://www.goeco.com/fr_band.html
http://www.goeco.com/home.html
http://www.goeco.com/contact.html
http://www.goeco.com/sustainability.html
http://www.goeco.com/response.html
http://www.goeco.com/training.html
http://www.goeco.com/sites.html
http://www.goeco.com/wastes.html
http://www.goeco.com/impacts.html
http://www.goeco.com/audits.html
http://www.goeco.com/ems.html
http://www.goeco.com/services.html
http://www.goeco.com/vision.html
http://www.goeco.com/profile.pdf
Optimizing tables...

Charter 08-06-2004 10:30 AM

Hi. Now the command:
Code:

/home5/goeco/HTML/pdftotext -cork ../admin/temp/78556292.tmp
is failing so find:
PHP Code:

$command PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2

and replace with:
PHP Code:

$command PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2.' 2>&1'

and see what error it shows on reindex.

rom 08-06-2004 03:06 PM

Hi Charter,

Thanks again for responding so quickly. Here is the latest error message. Was I supposed to have created a cork file somewhere?

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home5/goeco/HTML/pdftotext -cork ../admin/temp/54831932.tmp 2>&1
Result contains: Array ( [0] => Error: Couldn't open file '-cork' )
Return value is: 1

15:http://www.goeco.com/profile.pdf
(time : 00:01:34)

No link in temporary table

Charter 08-06-2004 03:11 PM

Hi. The flag cork is an option that doesn't seem available to you so just set the following in the config file:
PHP Code:

define('PHPDIG_OPTION_PDF',''); // two single quotes, no space between 


rom 08-06-2004 03:26 PM

Hi Charter,

Thanks. It is working now!

Have a good weekend.

Rom

rom 08-07-2004 12:09 PM

I'm working on another website now, and have not been able to get phpdig to index the pdfs on this one either. Have followed all your previous directions, and as an example have received the echos shown below.

Thanks very much for your assistance.

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to:
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
38:http://www.cgxenergy.ca/regionalOverview.html
(time : 00:03:33)

rom 08-07-2004 12:24 PM

Never mind.

Just released the define('PHPDIG_INDEX_PDF',true); was still set to false.

rom 08-07-2004 02:50 PM

Still stuck, unfortunately. The HTML pages seem OK, but indexing PDFs has given several error messages. After the last one, spidering appears to stop without going through the other 100 or so links.

Thanks again.

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
100:http://www.cgxenergy.ca/affiliated.html
(time : 00:09:18)



Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home/cgxenerg/HTML/investors/pdftotext ../admin/temp/21314312.tmp 2>&1
Result contains: Array ( )
Return value is: 0

101:http://www.cgxenergy.ca/investors/MB...esMar25_04.pdf
(time : 00:09:24)


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home/cgxenerg/HTML/investors/pdftotext ../admin/temp/24175282.tmp 2>&1
Result contains: Array ( [0] => Error: Copying of text from this document is not allowed. )
Return value is: 3

102:http://www.cgxenergy.ca/investors/OctagonMar08_04.pdf
(time : 00:09:29)


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable

Charter 08-07-2004 03:03 PM

>> Error: Copying of text from this document is not allowed.

Hi. PhpDig using pdftotext cannot index the PDF if the PDF is set to not allow it.

rom 08-07-2004 06:19 PM

Hi Charter,

Will phpdig still be able to index the other PDFs? Only some gave the copying error.

Is the "copying of text" a security setting on the PDF?

Is the "copying of text" error the reason that the spidering is dieing part way through?

Thanks,

Rom

Charter 08-07-2004 06:26 PM

Hi. PhpDig can index almost any PDF that allows it, save for PDFs that take so much memory as to cause the script to barf due to lack of memory.

Whomever writes the PDF can set whether the copying of text from the PDF is allowed. I'm not sure about the dieing issue after trying to index an index protected PDF.

How many times do you find PhpDig trying to index an index protected PDF before it dies?

rom 08-07-2004 07:50 PM

I receive two "copying of text" errors, then it gets to this point:

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

and stops.

You mention a memory issue. The largest PDF is 4.6 M, so many we should just delete anything more than 1 M from the site, if that would help.

With the following message, has this PDF been iindexed without a problem? I'm not sure what the return value means or whether the array should have something in it.

Command is: /home/cgxenerg/HTML/investors/pdftotext ../admin/temp/21314312.tmp 2>&1
Result contains: Array ( )
Return value is: 0

101:http://www.cgxenergy.ca/investors/MB...esMar25_04.pdf
(time : 00:09:24)

Thanks again. You've been a huge help. I've been tearing my hair out on this one.

Charter 08-08-2004 05:02 PM

Hi. Contrary to possible intuition, with the exec command, a return value of zero is a success. The result array is to contain the output from the command, but in the previous post, it looks as though there was a successful execution of the command, but the array is empty. Perhaps check you error logs, and as to a possible memory issue, maybe this thread might help.


All times are GMT -8. The time now is 07:36 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.