PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 10-13-2005, 10:22 AM   #1
ripchen
Green Mole
 
Join Date: Oct 2005
Posts: 5
Question PDF indexing Probelm (pdftotext)

Have an intranert site with linked PDFs under a seveal directories under a directory called policies. Can't get phpdig to index the PDFs

Went down the checklist and every thing cehcks out. Not sure where to go from here.

Here is the output from the echo statements.

Thanks

-----------

SITE : http://192.168.13.80/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
1:http://192.168.13.80/policies/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
Duplicate of an existing document
2:http://192.168.13.80/policies/index.php
(time : 00:00:16)

No link in temporary table
links found : 2

http://192.168.13.80/policies/
http://192.168.13.80/policies/index.php
Optimizing tables...
Indexing complete !
ripchen is offline   Reply With Quote
Old 10-14-2005, 09:03 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
The output looks okay. In the config file, if you set LIMIT_TO_DIRECTORY to false then does it index the PDF files?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-17-2005, 03:25 AM   #3
ripchen
Green Mole
 
Join Date: Oct 2005
Posts: 5
Sorry for the delay in repsonding.

That helped by indexing a few of them but it did not inidex all of them.

Any other thoughts?
ripchen is offline   Reply With Quote
Old 10-19-2005, 07:49 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Try setting PHPDIG_IN_DOMAIN to true, LIMIT_TO_DIRECTORY to false, both in the config file, and then from the admin panel, use a large search depth, set links per to zero, and choose the no option. You can increase search depth beyond twenty by editing SPIDER_MAX_LIMIT in the config file. Once indexing completes, you should see an 'indexing complete' message onscreen. If, when indexing PDFs, the process seems to die in the middle, it might be a memory issue like in this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-20-2005, 04:11 AM   #5
ripchen
Green Mole
 
Join Date: Oct 2005
Posts: 5
Made those changes - back to square one. The indexing finishes but skips all the pdfs.

Beofre I made the last suggested config changes, the site was locked after indexinf a few of the PDFs and I had to stop the spider in the admin panel.
ripchen is offline   Reply With Quote
Old 10-20-2005, 07:05 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Try double checking that PHPDIG_IN_DOMAIN is set to true and LIMIT_TO_DIRECTORY is set to false. The latter should already be false from post two, but maybe it got switched back to true.

Also, in robot_functions.php find:
Code:
$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2;
And replace with:
Code:
$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2.' 2>&1';
Then try an index of PDFs and see what prints onscreen.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-20-2005, 07:44 AM   #7
ripchen
Green Mole
 
Join Date: Oct 2005
Posts: 5
here is the end of what printed on the screen:
---------------------------------------
is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/local/bin/pdftotext ../admin/temp/38625112.tmp 2>&1
Result contains: Array ( )
Return value is: 0

90:http://192.168.13.80/services/leadership.pdf
(time : 00:07:59)


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
-------------------------------------------------

nothing else displays beneath but the top of the sreen indicates spidering is still in progress. admin panel says "locked" it did seem to find a few more PDFs but certainly not all

on the admin panel, i ask it to use a subdir of "policies". Under polices are 12 subdirs containing 1- 10 PDFs. All the PDFs are linked to the pages. From what rpintede on the screen it did not go down under policies despite using a search depth of 40.

Thanks
ripchen is offline   Reply With Quote
Old 10-20-2005, 08:02 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
What is the filesize of the PDF file that appears after leadership.pdf in the list? It seems like there is a big PDF in there somewhere that is using up all your PHP memory, which in turn kills PhpDig so it stops indexing and remains locked. Look at the filesizes and unlink the big ones. PDFs of two or three MBs are probably okay, but it depends on your PHP memory.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-20-2005, 08:39 AM   #9
ripchen
Green Mole
 
Join Date: Oct 2005
Posts: 5
in that directory the next PDF is 6.3 mb.

So I'll have to see waht I can do about that file
ripchen is offline   Reply With Quote
Old 10-20-2005, 11:14 AM   #10
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Untested, but if you want to try and index part of the big PDFs...

In robot_functions.php find:
Code:
        while (!feof($fp)) {
            $file_content[] = fread($fp,8192);
        }
And replace with:
Code:
        $oh_stop_me = 0;
        while (!feof($fp) && $oh_stop_me < 125) {
            $file_content[] = fread($fp,8192);
            $oh_stop_me++;
        }
The 125 is meant to allow for 1,024,000 bytes from bigger PDF files.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Indexing PDF dlaperle Troubleshooting 1 03-21-2007 07:00 PM
pdftotext - not indexing PDFs - oh geez monkeynutts External Binaries 1 11-11-2005 09:15 AM
can't index pdf using pdftotext rom External Binaries 22 08-27-2004 04:11 PM
not indexing with pdftotext davideyre External Binaries 2 03-30-2004 12:55 PM
PDF indexing lelandv External Binaries 15 12-08-2003 04:23 PM


All times are GMT -8. The time now is 03:14 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.