PDA

View Full Version : PDF and CATDOC indexing


chrisdgreen
10-26-2005, 08:19 AM
Having loads of fun with this today :no:

The only way I can get a .doc file to index is if I spider the doc directly.
It doesn't find it from internal links. I have a page (see link) with a test pdf and doc included in the body of the text, neither of these are found or indexed by the spider though the page it self is listed.
http://www.sccyp.org.uk/webpages/about_ourhistory.php

if I include a full url to the spider I get the following

Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/catdoc -s 8859-1 ../admin/temp/66842272.tmp
Result contains: Array ( [0] => Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec non leo [1] => nec enim sollicitudin sodales. Morbi sem sem, mattis vitae, imperdiet [2] => in, malesuada non, arcu. Vestibulum condimentum porttitor tellus. Sed [3] => ultricies. Sed volutpat molestie sem. Quisque quis nisl. Sed mi metus, [4] => dictum at, elementum quis, ultricies ac, nunc. Aliquam eget arcu. [5] => Vivamus felis sem, feugiat id, volutpat ac, lobortis vel, felis. Cum [6] => sociis natoque penatibus et magnis dis parturient montes, nascetur [7] => ridiculus mus. Quisque [8] => grav???????????????????????????????????????????????????????????????????? [9] => ???????????????????????? [10] => )
Return value is: 0

4:http://www.sccyp.org.uk/testdoc.doc

Why won't it spider from the internal link?

For pdf it appears to see the file if I add to the spider automatically but produces no output

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pdftotext ../admin/temp/95318772.tmp 2>&1
Result contains: Array ( )
Return value is: 0

1:http://www.sccyp.org.uk/pdftest.pdf
(time : 00:00:06)
No link in temporary table

Again Why won't it spider from the internal link? and when it does what happens to the content.

I have been through most of the threads on the board relating to this and can't find an answer

Any help gratefully received

Chris

Charter
10-26-2005, 11:47 PM
In the config file set LIMIT_TO_DIRECTORY to false, PHPDIG_IN_DOMAIN to true, and PHPDIG_PDF_EXTENSION to .txt (with the period) and then from the admin panel, use a large search depth, set links per to zero, and select the no option.

chrisdgreen
10-27-2005, 10:29 AM
Thanks for responding
I made the changes suggested for the pdf I still get the following

Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pdftotext ../admin/temp/52635362.tmp 2>&1
Result contains: Array ( )
Return value is: 0

I looked for the temp file and there was nothing there, should there be or are these removed automatically?

Cheers

Chris

Charter
10-27-2005, 10:52 AM
For the PDFs, also set PHPDIG_OPTION_PDF to '' (two single quotes, no space between) in the config file.

chrisdgreen
10-27-2005, 10:55 AM
This is what I have already

define('PHPDIG_INDEX_PDF',true); // set to true
define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext'); // assuming linux
define('PHPDIG_OPTION_PDF',''); // two single quotes, no space inbetween

Charter
10-27-2005, 11:48 AM
So you have the following; what version of PhpDig are you using?

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');

chrisdgreen
10-28-2005, 05:55 AM
I'm using 1.8.7

Charter
11-01-2005, 03:50 PM
Hmm, what happens if you save <removed - looks like you got it to work> PDF to your server and try to index it?