View Single Post
Old 08-06-2004, 07:25 AM   #1
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
Question can't index pdf using pdftotext

My server is running php 4.3.8 on a linux system, and I am trying to search pdfs using the pdftotext external binary.

I am able to get phpdig to search html files. Pdftotext converts pdfs and places a txt file in the same directory, when run from the command line, but I haven't been able to configure phpdig to index a linked pdf file on my website.

I have followed all the instructions on the thread "External Binaries Problem Checklist", and have inserted the recommended echo statements in spider.php and robot_functions.php. The output when reindexing shown below.

Thanks very much for any assistance.

SITE : http://www.goeco.com/
Exclude paths :
- cgi-bin/


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
1:http://www.goeco.com/index2.html
(time : 00:00:05)
+ +
level 1...


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
2:http://www.goeco.com/fr_band.html
(time : 00:00:15)

(the same output as above for various other linked pages, until we get to

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
15:http://www.goeco.com/profile.pdf
(time : 00:01:33)

No link in temporary table
links found : 15
http://www.goeco.com/index2.html
http://www.goeco.com/fr_band.html
http://www.goeco.com/home.html
http://www.goeco.com/contact.html
http://www.goeco.com/sustainability.html
http://www.goeco.com/response.html
http://www.goeco.com/training.html
http://www.goeco.com/sites.html
http://www.goeco.com/wastes.html
http://www.goeco.com/impacts.html
http://www.goeco.com/audits.html
http://www.goeco.com/ems.html
http://www.goeco.com/services.html
http://www.goeco.com/vision.html
http://www.goeco.com/profile.pdf
Optimizing tables...
Indexing complete ! [Back] to admin interface.
rom is offline   Reply With Quote