PDA

View Full Version : spider hangs on indexing pdf (pstotext)


sushie
06-07-2005, 10:39 AM
hi there,

i try to use phpdig for the first time...

i read a lot of threads about problems with pstotext, and tried several hints, but still can't get it work...

my system:
------------------------
-FreeBSD 4.10
-PHP Version 4.3.1
-PHPDIG_VERSION 1.8.7
------------------------

from command line pstotext seems to work correctly (it outputs the file content on STDOUT as expected)

the paths in config.php are ok:
------------------------
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','-cork');
------------------------

i tweaked the spider.php and robot_functions.php as mentoined somewhere. this are the outputs:
------------------------
Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1
------------------------

... just after printing that, the spider hangs without any error message...

can anyone help?

Charter
06-07-2005, 11:29 AM
Okay, that all looks good, so remove the code to print those outputs, and instead, in robot_functions.php find:

$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2;

And replace with:

$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2.' 2>&1';

Then try an index of a PDF file and see what prints onscreen.

Also, if the PDFs were not from dvips, then try the following:

define('PHPDIG_OPTION_PDF','');

And of course, since output is STDOUT, use the following:

define('PHPDIG_PDF_EXTENSION','');

sushie
06-10-2005, 07:31 AM
hi carter,
thanks for your reply!

i tried your advises, but without success... the spider still hangs on indexing the pdf.

this is the last the spider prints out:
----------
Is result test http an array: 1
What is result test http status: PDF
----------

this are my settings:
----------
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','');
----------

is saw that the file-permissons to '/usr/local/bin/pstotext' are all set to 755 except the file itself wich has 555 ... could that be a problem?

since i am not adminsitrator of the server (it's a commercional provider) i'm not be able to change any of the file-permissions...

*thanks for further support!

Charter
06-10-2005, 10:11 AM
As you cannot change permission on pstotext, see if your host will change the permission or try pdftotext instead. There are instructions for pdftotext here (http://www.phpdig.net/forum/faq.php?faq=phpdig_ext_bin#faq_phpdig_pdftotext).

sushie
06-13-2005, 12:31 PM
hi charter,

there was a problem with 'allow_url_fopen', now it still dont indexes pdf but the spider don't hangs anymore (still trying with 'pstotext') ... here's the output:

-----------------------
Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/local/bin/pstotext ../admin/temp/66912182.tmp 2>&1
Result contains: Array ( [0] => gs: not found )
Return value is: 3
-----------------------

whats means 'gs: not found' ?

*thanks for your support

(... im now going to try 'pdftotext')

sushie
06-13-2005, 12:49 PM
hi again,

with 'pdftotext' it dont work either (i use the linux-binary on the freeBSD host...)

config:
---------------------------
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/home/local/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');
---------------------------

output:
---------------------------
Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home/ekifch/bin/pdftotext ../admin/temp/89121942.tmp 2>&1
Result contains: Array ( [0] => Abort trap )
Return value is: 134
---------------------------

*any idea?

Charter
06-13-2005, 06:57 PM
> Result contains: Array ( [0] => gs: not found )

That probably means that Ghostscript cannot be found.

> Result contains: Array ( [0] => Abort trap )

That might be a memory issue. Try pdftotext on a small PDF file.

sushie
06-15-2005, 05:57 AM
thanks to your support, some help from my server-admin and lots of hours searching for a solution i finnaly got it work!

the problem was that somehow the 'pstotext' did not find the 'ghostscript'-library when run per web-php-script.

i had to add "export PATH=$PATH:my_path_to_lib; " to the exec command in 'robot_functions.php'...

here's the full change-instruction in case anyone runs into the same problem:

in config.inc (some where near 'EXTERNAL TOOLS SETUP') add:
define('PHPDIG_PATH_TO_BIN','/usr/local/bin');

in robot_functions.php (near line #1089) find:
if ($usetool) {
rename($tempfile1,$tempfile2);
exec($command,$result,$retval);

and replace with:
if ($usetool) {
if(PHPDIG_PATH_TO_BIN)
$setpath="export PATH=$PATH:".PHPDIG_PATH_TO_BIN."; ";
rename($tempfile1,$tempfile2);
exec($setpath.$command,$result,$retval);

maybe that helps someone
*cheers*