PDA

View Full Version : pdftotext with phpdig does not work


tomas
02-14-2004, 07:27 AM
hello board,

phpdig for html and php files works great - but:
pdf-files dont work.

i tried on several machines of us debian/redhat php4.2.2/4.2.3.
pdftotext works fine from bash.
if i call with phpdig only one or two files were opened and
only partial content found in temp and text files.


any ideas - anyone???
tomas

Charter
02-14-2004, 12:20 PM
Hi. In the config file set the following and make sure that there are 755 permissions for the directories to pdftotext as well as to the pdftotext file.

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/path/to/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');

tomas
02-14-2004, 12:44 PM
hello charter,

thanks for quick response -
i checked all topics - but still all files are empty.
if i set define('PHPDIG_PDF_EXTENSION',''); i can see the
temp files and they are empty too

???

Charter
02-14-2004, 12:49 PM
Hi. What version of PhpDig are you using?

tomas
02-14-2004, 12:57 PM
1.80
and the files aren't empty - they have only one
page break.

i tried a lot of diferent pdfs
tried lot of settings in: define('PHPDIG_OPTION_PDF','');
-q
-nopgbrk
empty

but nothing works

Charter
02-14-2004, 01:27 PM
Hi. There was a problem similar with PHP 4.2.2 described in this (http://www.phpdig.net/showthread.php?postid=1239#post1239) post. Not sure if this is related to your problem. What do you get onscreen when you add the code in this (http://www.phpdig.net/showthread.php?postid=1641#post1641) post?

tomas
02-14-2004, 01:48 PM
3:http://192.168.1.240/mysite/pdf/02.pdf
(time : 00:00:21)


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /var/www/html/mysite/phpdig/pdftotext/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1



Is result test http an array: 1
What is result test http status: PDF

Charter
02-14-2004, 02:10 PM
Hi. That all looks fine. In robot_functions is the following line:

exec($command,$result,$retval);

Right after that line place the following lines:

echo "<br><br>Result contains: ";
print_r($result);
echo "<br>Return value is: " . $retval . "<br><br>";

What shows onscreen for these echo statements?

tomas
02-14-2004, 02:16 PM
Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /var/www/html/mysite/phpdig/pdftotext/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1



Result contains: Array ( )
Return value is: 0



Is result test http an array: 1
What is result test http status: PDF

tomas
02-14-2004, 02:21 PM
charter - by the way

how can i do you a little favour for your friendly way
doing work here and for the phpdig-project?

Charter
02-14-2004, 02:52 PM
Hi. The following means that the exec command is succeeding:

Return value is: 0

However the following means that the output from the exec command has no content:

Result contains: Array ( )

The pdftotext version 1.01 has the following bugs:

Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files.

As you are able to run pdftotext from bash, I don't think this is the problem.

I would say that there is a problem with PHP trying to exec pdftotext from the script. Perhaps try to upgrade to the latest stable version of PHP or try a different converter.

tomas
02-15-2004, 06:59 AM
hello charter,

ok i tried it on an other server fedora_core1/php-4.3.3
-> and grabbing pdf-files now works fine.

the result is pdf-indexing with php-4.2.2/3 does not work !

thanks a lot
tomas

alivin70
02-25-2004, 07:20 AM
Originally posted by tomas
hello charter,

ok i tried it on an other server fedora_core1/php-4.3.3
-> and grabbing pdf-files now works fine.

the result is pdf-indexing with php-4.2.2/3 does not work !

thanks a lot
tomas
After a long work I've found the bug!

PHP 4.2.2 incorrectly handles binary files using the function file($remote_url).

That function is used in robot_function.php during indexing.

I posted a patch here (http://www.phpdig.net/showthread.php?s=&threadid=570)

tomas
02-25-2004, 12:13 PM
hello alivin,

great job :-)

now pdf-digging works fine even with php-4.2.x -
and in my opinion file-funktion also has a bug in php-4.3.x:
digging larger pdf's php.ini had to be overwritten with:
ini_set(memory_limit, "64M");
using your workaround there are no more memory problems.

thanks again for posting back to this thread
maybe this little ideas are helpful for you:

http://www.phpdig.net/showthread.php?s=&threadid=500
http://www.phpdig.net/showthread.php?s=&postid=2338#post2338


kind regards from monaco di bavaria
tomas

tomas
02-25-2004, 01:45 PM
hi alivin,

the memory issue does not change - even with your workaround
=> i tested with wrong setting in php.ini

so if anybody has a problem spidering large pdf's especially with
large vector-graphics in it - override php.ini in this way:

in spider.php - first write this line:
ini_set(memory_limit, "64M");

anyway - your bugfix works great :-)

regards
tomas