PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   External Binaries (http://www.phpdig.net/forum/forumdisplay.php?f=36)
-   -   PDF indexing Probelm (pdftotext) (http://www.phpdig.net/forum/showthread.php?t=2194)

ripchen 10-13-2005 10:22 AM

PDF indexing Probelm (pdftotext)
 
Have an intranert site with linked PDFs under a seveal directories under a directory called policies. Can't get phpdig to index the PDFs

Went down the checklist and every thing cehcks out. Not sure where to go from here.

Here is the output from the echo statements.

Thanks

-----------

SITE : http://192.168.13.80/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
1:http://192.168.13.80/policies/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
Duplicate of an existing document
2:http://192.168.13.80/policies/index.php
(time : 00:00:16)

No link in temporary table
links found : 2

http://192.168.13.80/policies/
http://192.168.13.80/policies/index.php
Optimizing tables...
Indexing complete !

Charter 10-14-2005 09:03 AM

The output looks okay. In the config file, if you set LIMIT_TO_DIRECTORY to false then does it index the PDF files?

ripchen 10-17-2005 03:25 AM

Sorry for the delay in repsonding.

That helped by indexing a few of them but it did not inidex all of them.

Any other thoughts?

Charter 10-19-2005 07:49 AM

Try setting PHPDIG_IN_DOMAIN to true, LIMIT_TO_DIRECTORY to false, both in the config file, and then from the admin panel, use a large search depth, set links per to zero, and choose the no option. You can increase search depth beyond twenty by editing SPIDER_MAX_LIMIT in the config file. Once indexing completes, you should see an 'indexing complete' message onscreen. If, when indexing PDFs, the process seems to die in the middle, it might be a memory issue like in this thread.

ripchen 10-20-2005 04:11 AM

Made those changes - back to square one. The indexing finishes but skips all the pdfs.

Beofre I made the last suggested config changes, the site was locked after indexinf a few of the PDFs and I had to stop the spider in the admin panel.

Charter 10-20-2005 07:05 AM

Try double checking that PHPDIG_IN_DOMAIN is set to true and LIMIT_TO_DIRECTORY is set to false. The latter should already be false from post two, but maybe it got switched back to true.

Also, in robot_functions.php find:
Code:

$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2;
And replace with:
Code:

$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2.' 2>&1';
Then try an index of PDFs and see what prints onscreen.

ripchen 10-20-2005 07:44 AM

here is the end of what printed on the screen:
---------------------------------------
is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/local/bin/pdftotext ../admin/temp/38625112.tmp 2>&1
Result contains: Array ( )
Return value is: 0

90:http://192.168.13.80/services/leadership.pdf
(time : 00:07:59)


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
-------------------------------------------------

nothing else displays beneath but the top of the sreen indicates spidering is still in progress. admin panel says "locked" it did seem to find a few more PDFs but certainly not all

on the admin panel, i ask it to use a subdir of "policies". Under polices are 12 subdirs containing 1- 10 PDFs. All the PDFs are linked to the pages. From what rpintede on the screen it did not go down under policies despite using a search depth of 40.

Thanks

Charter 10-20-2005 08:02 AM

What is the filesize of the PDF file that appears after leadership.pdf in the list? It seems like there is a big PDF in there somewhere that is using up all your PHP memory, which in turn kills PhpDig so it stops indexing and remains locked. Look at the filesizes and unlink the big ones. PDFs of two or three MBs are probably okay, but it depends on your PHP memory.

ripchen 10-20-2005 08:39 AM

in that directory the next PDF is 6.3 mb.

So I'll have to see waht I can do about that file

Charter 10-20-2005 11:14 AM

Untested, but if you want to try and index part of the big PDFs...

In robot_functions.php find:
Code:

        while (!feof($fp)) {
            $file_content[] = fread($fp,8192);
        }

And replace with:
Code:

        $oh_stop_me = 0;
        while (!feof($fp) && $oh_stop_me < 125) {
            $file_content[] = fread($fp,8192);
            $oh_stop_me++;
        }

The 125 is meant to allow for 1,024,000 bytes from bigger PDF files.


All times are GMT -8. The time now is 02:18 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.