![]() |
pdftotext with phpdig does not work
hello board,
phpdig for html and php files works great - but: pdf-files dont work. i tried on several machines of us debian/redhat php4.2.2/4.2.3. pdftotext works fine from bash. if i call with phpdig only one or two files were opened and only partial content found in temp and text files. any ideas - anyone??? tomas |
Hi. In the config file set the following and make sure that there are 755 permissions for the directories to pdftotext as well as to the pdftotext file.
PHP Code:
|
hello charter,
thanks for quick response - i checked all topics - but still all files are empty. if i set define('PHPDIG_PDF_EXTENSION',''); i can see the temp files and they are empty too ??? |
Hi. What version of PhpDig are you using?
|
1.80
and the files aren't empty - they have only one page break. i tried a lot of diferent pdfs tried lot of settings in: define('PHPDIG_OPTION_PDF',''); -q -nopgbrk empty but nothing works |
|
3:http://192.168.1.240/mysite/pdf/02.pdf
(time : 00:00:21) Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /var/www/html/mysite/phpdig/pdftotext/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Is result test http an array: 1 What is result test http status: PDF |
Hi. That all looks fine. In robot_functions is the following line:
PHP Code:
PHP Code:
|
Is result test http an array: 1
What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /var/www/html/mysite/phpdig/pdftotext/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Result contains: Array ( ) Return value is: 0 Is result test http an array: 1 What is result test http status: PDF |
charter - by the way
how can i do you a little favour for your friendly way doing work here and for the phpdig-project? |
Hi. The following means that the exec command is succeeding:
Return value is: 0 However the following means that the output from the exec command has no content: Result contains: Array ( ) The pdftotext version 1.01 has the following bugs: Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files. As you are able to run pdftotext from bash, I don't think this is the problem. I would say that there is a problem with PHP trying to exec pdftotext from the script. Perhaps try to upgrade to the latest stable version of PHP or try a different converter. |
hello charter,
ok i tried it on an other server fedora_core1/php-4.3.3 -> and grabbing pdf-files now works fine. the result is pdf-indexing with php-4.2.2/3 does not work ! thanks a lot tomas |
Quote:
PHP 4.2.2 incorrectly handles binary files using the function file($remote_url). That function is used in robot_function.php during indexing. I posted a patch here |
hello alivin,
great job :-) now pdf-digging works fine even with php-4.2.x - and in my opinion file-funktion also has a bug in php-4.3.x: digging larger pdf's php.ini had to be overwritten with: ini_set(memory_limit, "64M"); using your workaround there are no more memory problems. thanks again for posting back to this thread maybe this little ideas are helpful for you: http://www.phpdig.net/showthread.php?s=&threadid=500 http://www.phpdig.net/showthread.php...=2338#post2338 kind regards from monaco di bavaria tomas |
hi alivin,
the memory issue does not change - even with your workaround => i tested with wrong setting in php.ini so if anybody has a problem spidering large pdf's especially with large vector-graphics in it - override php.ini in this way: in spider.php - first write this line: ini_set(memory_limit, "64M"); anyway - your bugfix works great :-) regards tomas |
All times are GMT -8. The time now is 06:26 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.