View Single Post
Old 10-26-2005, 07:19 AM   #1
chrisdgreen
Green Mole
 
Join Date: Jul 2005
Posts: 7
PDF and CATDOC indexing

Having loads of fun with this today

The only way I can get a .doc file to index is if I spider the doc directly.
It doesn't find it from internal links. I have a page (see link) with a test pdf and doc included in the body of the text, neither of these are found or indexed by the spider though the page it self is listed.
http://www.sccyp.org.uk/webpages/about_ourhistory.php

if I include a full url to the spider I get the following

Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/catdoc -s 8859-1 ../admin/temp/66842272.tmp
Result contains: Array ( [0] => Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec non leo [1] => nec enim sollicitudin sodales. Morbi sem sem, mattis vitae, imperdiet [2] => in, malesuada non, arcu. Vestibulum condimentum porttitor tellus. Sed [3] => ultricies. Sed volutpat molestie sem. Quisque quis nisl. Sed mi metus, [4] => dictum at, elementum quis, ultricies ac, nunc. Aliquam eget arcu. [5] => Vivamus felis sem, feugiat id, volutpat ac, lobortis vel, felis. Cum [6] => sociis natoque penatibus et magnis dis parturient montes, nascetur [7] => ridiculus mus. Quisque [8] => grav???????????????????????????????????????????????????????????????????? [9] => ???????????????????????? [10] => )
Return value is: 0

4:http://www.sccyp.org.uk/testdoc.doc

Why won't it spider from the internal link?

For pdf it appears to see the file if I add to the spider automatically but produces no output

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pdftotext ../admin/temp/95318772.tmp 2>&1
Result contains: Array ( )
Return value is: 0

1:http://www.sccyp.org.uk/pdftest.pdf
(time : 00:00:06)
No link in temporary table

Again Why won't it spider from the internal link? and when it does what happens to the content.

I have been through most of the threads on the board relating to this and can't find an answer

Any help gratefully received

Chris
chrisdgreen is offline   Reply With Quote