PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 10-26-2005, 07:19 AM   #1
chrisdgreen
Green Mole
 
Join Date: Jul 2005
Posts: 7
PDF and CATDOC indexing

Having loads of fun with this today

The only way I can get a .doc file to index is if I spider the doc directly.
It doesn't find it from internal links. I have a page (see link) with a test pdf and doc included in the body of the text, neither of these are found or indexed by the spider though the page it self is listed.
http://www.sccyp.org.uk/webpages/about_ourhistory.php

if I include a full url to the spider I get the following

Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/catdoc -s 8859-1 ../admin/temp/66842272.tmp
Result contains: Array ( [0] => Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec non leo [1] => nec enim sollicitudin sodales. Morbi sem sem, mattis vitae, imperdiet [2] => in, malesuada non, arcu. Vestibulum condimentum porttitor tellus. Sed [3] => ultricies. Sed volutpat molestie sem. Quisque quis nisl. Sed mi metus, [4] => dictum at, elementum quis, ultricies ac, nunc. Aliquam eget arcu. [5] => Vivamus felis sem, feugiat id, volutpat ac, lobortis vel, felis. Cum [6] => sociis natoque penatibus et magnis dis parturient montes, nascetur [7] => ridiculus mus. Quisque [8] => grav???????????????????????????????????????????????????????????????????? [9] => ???????????????????????? [10] => )
Return value is: 0

4:http://www.sccyp.org.uk/testdoc.doc

Why won't it spider from the internal link?

For pdf it appears to see the file if I add to the spider automatically but produces no output

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pdftotext ../admin/temp/95318772.tmp 2>&1
Result contains: Array ( )
Return value is: 0

1:http://www.sccyp.org.uk/pdftest.pdf
(time : 00:00:06)
No link in temporary table

Again Why won't it spider from the internal link? and when it does what happens to the content.

I have been through most of the threads on the board relating to this and can't find an answer

Any help gratefully received

Chris
chrisdgreen is offline   Reply With Quote
Old 10-26-2005, 10:47 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
In the config file set LIMIT_TO_DIRECTORY to false, PHPDIG_IN_DOMAIN to true, and PHPDIG_PDF_EXTENSION to .txt (with the period) and then from the admin panel, use a large search depth, set links per to zero, and select the no option.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-27-2005, 09:29 AM   #3
chrisdgreen
Green Mole
 
Join Date: Jul 2005
Posts: 7
Thanks for responding
I made the changes suggested for the pdf I still get the following

Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pdftotext ../admin/temp/52635362.tmp 2>&1
Result contains: Array ( )
Return value is: 0

I looked for the temp file and there was nothing there, should there be or are these removed automatically?

Cheers

Chris
chrisdgreen is offline   Reply With Quote
Old 10-27-2005, 09:52 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
For the PDFs, also set PHPDIG_OPTION_PDF to '' (two single quotes, no space between) in the config file.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-27-2005, 09:55 AM   #5
chrisdgreen
Green Mole
 
Join Date: Jul 2005
Posts: 7
This is what I have already

define('PHPDIG_INDEX_PDF',true); // set to true
define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext'); // assuming linux
define('PHPDIG_OPTION_PDF',''); // two single quotes, no space inbetween
chrisdgreen is offline   Reply With Quote
Old 10-27-2005, 10:48 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
So you have the following; what version of PhpDig are you using?
Code:
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-28-2005, 04:55 AM   #7
chrisdgreen
Green Mole
 
Join Date: Jul 2005
Posts: 7
I'm using 1.8.7
chrisdgreen is offline   Reply With Quote
Old 11-01-2005, 02:50 PM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hmm, what happens if you save <removed - looks like you got it to work> PDF to your server and try to index it?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Indexing PDF dlaperle Troubleshooting 1 03-21-2007 07:00 PM
catdoc not indexing all files brianread External Binaries 1 11-30-2005 08:14 AM
catdoc and xls2csv not indexing greener_02445 External Binaries 14 04-13-2004 07:33 PM
no indexing with catdoc and xls2csv Kylord External Binaries 2 04-09-2004 07:19 AM
PDF indexing lelandv External Binaries 15 12-08-2003 04:23 PM


All times are GMT -8. The time now is 12:36 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.