PDF and CATDOC indexing
Having loads of fun with this today :no:
The only way I can get a .doc file to index is if I spider the doc directly. It doesn't find it from internal links. I have a page (see link) with a test pdf and doc included in the body of the text, neither of these are found or indexed by the spider though the page it self is listed. http://www.sccyp.org.uk/webpages/about_ourhistory.php if I include a full url to the spider I get the following Is result test http an array: 1 What is result test http status: MSWORD Is result test an array: 1 What is result test status: MSWORD Use is executable is set to: 1 Index the pdf is set to: 1 Parse the msword is set to: /usr/bin/catdoc Parse the pdf is set to: /usr/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/catdoc -s 8859-1 ../admin/temp/66842272.tmp Result contains: Array ( [0] => Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec non leo [1] => nec enim sollicitudin sodales. Morbi sem sem, mattis vitae, imperdiet [2] => in, malesuada non, arcu. Vestibulum condimentum porttitor tellus. Sed [3] => ultricies. Sed volutpat molestie sem. Quisque quis nisl. Sed mi metus, [4] => dictum at, elementum quis, ultricies ac, nunc. Aliquam eget arcu. [5] => Vivamus felis sem, feugiat id, volutpat ac, lobortis vel, felis. Cum [6] => sociis natoque penatibus et magnis dis parturient montes, nascetur [7] => ridiculus mus. Quisque [8] => grav???????????????????????????????????????????????????????????????????? [9] => ???????????????????????? [10] => ) Return value is: 0 4:http://www.sccyp.org.uk/testdoc.doc Why won't it spider from the internal link? For pdf it appears to see the file if I add to the spider automatically but produces no output Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the msword is set to: /usr/bin/catdoc Parse the pdf is set to: /usr/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/pdftotext ../admin/temp/95318772.tmp 2>&1 Result contains: Array ( ) Return value is: 0 1:http://www.sccyp.org.uk/pdftest.pdf (time : 00:00:06) No link in temporary table Again Why won't it spider from the internal link? and when it does what happens to the content. I have been through most of the threads on the board relating to this and can't find an answer Any help gratefully received Chris |
In the config file set LIMIT_TO_DIRECTORY to false, PHPDIG_IN_DOMAIN to true, and PHPDIG_PDF_EXTENSION to .txt (with the period) and then from the admin panel, use a large search depth, set links per to zero, and select the no option.
|
Thanks for responding
I made the changes suggested for the pdf I still get the following Parse the pdf is set to: /usr/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/pdftotext ../admin/temp/52635362.tmp 2>&1 Result contains: Array ( ) Return value is: 0 I looked for the temp file and there was nothing there, should there be or are these removed automatically? Cheers Chris |
For the PDFs, also set PHPDIG_OPTION_PDF to '' (two single quotes, no space between) in the config file.
|
This is what I have already
define('PHPDIG_INDEX_PDF',true); // set to true define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext'); // assuming linux define('PHPDIG_OPTION_PDF',''); // two single quotes, no space inbetween |
So you have the following; what version of PhpDig are you using?
Code:
define('PHPDIG_INDEX_PDF',true); |
I'm using 1.8.7
|
Hmm, what happens if you save <removed - looks like you got it to work> PDF to your server and try to index it?
|
All times are GMT -8. The time now is 03:11 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.