PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 02-14-2004, 07:27 AM   #1
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
Unhappy pdftotext with phpdig does not work

hello board,

phpdig for html and php files works great - but:
pdf-files dont work.

i tried on several machines of us debian/redhat php4.2.2/4.2.3.
pdftotext works fine from bash.
if i call with phpdig only one or two files were opened and
only partial content found in temp and text files.


any ideas - anyone???
tomas
tomas is offline   Reply With Quote
Old 02-14-2004, 12:20 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. In the config file set the following and make sure that there are 755 permissions for the directories to pdftotext as well as to the pdftotext file.
PHP Code:
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/path/to/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt'); 
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-14-2004, 12:44 PM   #3
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
hello charter,

thanks for quick response -
i checked all topics - but still all files are empty.
if i set define('PHPDIG_PDF_EXTENSION',''); i can see the
temp files and they are empty too

???
tomas is offline   Reply With Quote
Old 02-14-2004, 12:49 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. What version of PhpDig are you using?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-14-2004, 12:57 PM   #5
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
1.80
and the files aren't empty - they have only one
page break.

i tried a lot of diferent pdfs
tried lot of settings in: define('PHPDIG_OPTION_PDF','');
-q
-nopgbrk
empty

but nothing works
tomas is offline   Reply With Quote
Old 02-14-2004, 01:27 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. There was a problem similar with PHP 4.2.2 described in this post. Not sure if this is related to your problem. What do you get onscreen when you add the code in this post?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-14-2004, 01:48 PM   #7
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
3:http://192.168.1.240/mysite/pdf/02.pdf
(time : 00:00:21)


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /var/www/html/mysite/phpdig/pdftotext/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1



Is result test http an array: 1
What is result test http status: PDF
tomas is offline   Reply With Quote
Old 02-14-2004, 02:10 PM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. That all looks fine. In robot_functions is the following line:
PHP Code:
exec($command,$result,$retval); 
Right after that line place the following lines:
PHP Code:
echo "<br><br>Result contains: ";
print_r($result);
echo 
"<br>Return value is: " $retval "<br><br>"
What shows onscreen for these echo statements?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-14-2004, 02:16 PM   #9
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /var/www/html/mysite/phpdig/pdftotext/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1



Result contains: Array ( )
Return value is: 0



Is result test http an array: 1
What is result test http status: PDF
tomas is offline   Reply With Quote
Old 02-14-2004, 02:21 PM   #10
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
charter - by the way

how can i do you a little favour for your friendly way
doing work here and for the phpdig-project?
tomas is offline   Reply With Quote
Old 02-14-2004, 02:52 PM   #11
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The following means that the exec command is succeeding:

Return value is: 0

However the following means that the output from the exec command has no content:

Result contains: Array ( )

The pdftotext version 1.01 has the following bugs:

Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files.

As you are able to run pdftotext from bash, I don't think this is the problem.

I would say that there is a problem with PHP trying to exec pdftotext from the script. Perhaps try to upgrade to the latest stable version of PHP or try a different converter.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-15-2004, 06:59 AM   #12
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
hello charter,

ok i tried it on an other server fedora_core1/php-4.3.3
-> and grabbing pdf-files now works fine.

the result is pdf-indexing with php-4.2.2/3 does not work !

thanks a lot
tomas
tomas is offline   Reply With Quote
Old 02-25-2004, 07:20 AM   #13
alivin70
Orange Mole
 
alivin70's Avatar
 
Join Date: Sep 2003
Posts: 40
Quote:
Originally posted by tomas
hello charter,

ok i tried it on an other server fedora_core1/php-4.3.3
-> and grabbing pdf-files now works fine.

the result is pdf-indexing with php-4.2.2/3 does not work !

thanks a lot
tomas
After a long work I've found the bug!

PHP 4.2.2 incorrectly handles binary files using the function file($remote_url).

That function is used in robot_function.php during indexing.

I posted a patch here
alivin70 is offline   Reply With Quote
Old 02-25-2004, 12:13 PM   #14
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
hello alivin,

great job :-)

now pdf-digging works fine even with php-4.2.x -
and in my opinion file-funktion also has a bug in php-4.3.x:
digging larger pdf's php.ini had to be overwritten with:
ini_set(memory_limit, "64M");
using your workaround there are no more memory problems.

thanks again for posting back to this thread
maybe this little ideas are helpful for you:

http://www.phpdig.net/showthread.php?s=&threadid=500
http://www.phpdig.net/showthread.php...=2338#post2338


kind regards from monaco di bavaria
tomas
tomas is offline   Reply With Quote
Old 02-25-2004, 01:45 PM   #15
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
hi alivin,

the memory issue does not change - even with your workaround
=> i tested with wrong setting in php.ini

so if anybody has a problem spidering large pdf's especially with
large vector-graphics in it - override php.ini in this way:

in spider.php - first write this line:
ini_set(memory_limit, "64M");

anyway - your bugfix works great :-)

regards
tomas

Last edited by tomas; 02-25-2004 at 02:27 PM.
tomas is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Anyone considering making PhpDig work with SQL rather than MySQL? misterbearcom Mod Requests 0 08-10-2005 03:25 PM
PhpDig indexing won't work sigfy Troubleshooting 11 01-07-2005 06:47 AM
Cronjob for spidering doen't work anymore with PhpDig 1.8.6 gaam Troubleshooting 0 12-22-2004 12:28 AM
Install phpdig in a file named phpdig doesn't work Sansnom Script Installation 1 05-09-2004 03:13 PM
PhpDig does not work (installs OK) rafarspd Troubleshooting 12 01-06-2004 04:20 PM


All times are GMT -8. The time now is 03:35 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.