PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   External Binaries (http://www.phpdig.net/forum/forumdisplay.php?f=36)
-   -   PDF and CATDOC indexing (http://www.phpdig.net/forum/showthread.php?t=2205)

chrisdgreen 10-26-2005 07:19 AM

PDF and CATDOC indexing
 
Having loads of fun with this today :no:

The only way I can get a .doc file to index is if I spider the doc directly.
It doesn't find it from internal links. I have a page (see link) with a test pdf and doc included in the body of the text, neither of these are found or indexed by the spider though the page it self is listed.
http://www.sccyp.org.uk/webpages/about_ourhistory.php

if I include a full url to the spider I get the following

Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/catdoc -s 8859-1 ../admin/temp/66842272.tmp
Result contains: Array ( [0] => Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec non leo [1] => nec enim sollicitudin sodales. Morbi sem sem, mattis vitae, imperdiet [2] => in, malesuada non, arcu. Vestibulum condimentum porttitor tellus. Sed [3] => ultricies. Sed volutpat molestie sem. Quisque quis nisl. Sed mi metus, [4] => dictum at, elementum quis, ultricies ac, nunc. Aliquam eget arcu. [5] => Vivamus felis sem, feugiat id, volutpat ac, lobortis vel, felis. Cum [6] => sociis natoque penatibus et magnis dis parturient montes, nascetur [7] => ridiculus mus. Quisque [8] => grav???????????????????????????????????????????????????????????????????? [9] => ???????????????????????? [10] => )
Return value is: 0

4:http://www.sccyp.org.uk/testdoc.doc

Why won't it spider from the internal link?

For pdf it appears to see the file if I add to the spider automatically but produces no output

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pdftotext ../admin/temp/95318772.tmp 2>&1
Result contains: Array ( )
Return value is: 0

1:http://www.sccyp.org.uk/pdftest.pdf
(time : 00:00:06)
No link in temporary table

Again Why won't it spider from the internal link? and when it does what happens to the content.

I have been through most of the threads on the board relating to this and can't find an answer

Any help gratefully received

Chris

Charter 10-26-2005 10:47 PM

In the config file set LIMIT_TO_DIRECTORY to false, PHPDIG_IN_DOMAIN to true, and PHPDIG_PDF_EXTENSION to .txt (with the period) and then from the admin panel, use a large search depth, set links per to zero, and select the no option.

chrisdgreen 10-27-2005 09:29 AM

Thanks for responding
I made the changes suggested for the pdf I still get the following

Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pdftotext ../admin/temp/52635362.tmp 2>&1
Result contains: Array ( )
Return value is: 0

I looked for the temp file and there was nothing there, should there be or are these removed automatically?

Cheers

Chris

Charter 10-27-2005 09:52 AM

For the PDFs, also set PHPDIG_OPTION_PDF to '' (two single quotes, no space between) in the config file.

chrisdgreen 10-27-2005 09:55 AM

This is what I have already

define('PHPDIG_INDEX_PDF',true); // set to true
define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext'); // assuming linux
define('PHPDIG_OPTION_PDF',''); // two single quotes, no space inbetween

Charter 10-27-2005 10:48 AM

So you have the following; what version of PhpDig are you using?
Code:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');


chrisdgreen 10-28-2005 04:55 AM

I'm using 1.8.7

Charter 11-01-2005 02:50 PM

Hmm, what happens if you save <removed - looks like you got it to work> PDF to your server and try to index it?


All times are GMT -8. The time now is 01:41 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.