Having loads of fun with this today
The only way I can get a .doc file to index is if I spider the doc directly.
It doesn't find it from internal links. I have a page (see link) with a test pdf and doc included in the body of the text, neither of these are found or indexed by the spider though the page it self is listed.
http://www.sccyp.org.uk/webpages/about_ourhistory.php
if I include a full url to the spider I get the following
Is result test http an array: 1
What is result test http status: MSWORD
Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
Command is: /usr/bin/catdoc -s 8859-1 ../admin/temp/66842272.tmp
Result contains: Array ( [0] => Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec non leo [1] => nec enim sollicitudin sodales. Morbi sem sem, mattis vitae, imperdiet [2] => in, malesuada non, arcu. Vestibulum condimentum porttitor tellus. Sed [3] => ultricies. Sed volutpat molestie sem. Quisque quis nisl. Sed mi metus, [4] => dictum at, elementum quis, ultricies ac, nunc. Aliquam eget arcu. [5] => Vivamus felis sem, feugiat id, volutpat ac, lobortis vel, felis. Cum [6] => sociis natoque penatibus et magnis dis parturient montes, nascetur [7] => ridiculus mus. Quisque [8] => grav???????????????????????????????????????????????????????????????????? [9] => ???????????????????????? [10] => )
Return value is: 0
4:
http://www.sccyp.org.uk/testdoc.doc
Why won't it spider from the internal link?
For pdf it appears to see the file if I add to the spider automatically but produces no output
Is result test http an array: 1
What is result test http status: PDF
Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
Command is: /usr/bin/pdftotext ../admin/temp/95318772.tmp 2>&1
Result contains: Array ( )
Return value is: 0
1:
http://www.sccyp.org.uk/pdftest.pdf
(time : 00:00:06)
No link in temporary table
Again Why won't it spider from the internal link? and when it does what happens to the content.
I have been through most of the threads on the board relating to this and can't find an answer
Any help gratefully received
Chris