indexing pdf
I installed the pstotext binary, but indexing of pdf-files will not take place. No green checkmark appears when indexing the site. Do you need the ghostscript binary installed on the server? And do you need to upgrade the php-engine (i use version 4.2.2. now). I have red some problems with newer php-engines and indexing html-tags. Is this problem solved in version 1.8?
|
Check this thread. I believe it will answer your question. :)
|
Hi. From http://research.compaq.com/SRC/virtu.../pstotext.html ...
pstotext is a program that works with Ghostscript (version 3.33 or later) to extract plain text from PostScript and PDF files (you should have Ghostscript 3.51 or later for PDF). PHP version 4.2.2/3 seems to have issue with running exec pdftotext as in this thread, but I am not sure if pstotext would have the same problem. The PHP strip_tags function was replaced with a regular expression in version 1.6.3. Version 1.8.0 should not index HTML tags. |
I have fixed the path to GS, set the permissions and modified the config.php to this:
// if set to true is_executable used - set to '0' if is_executable is undefined define('USE_IS_EXECUTABLE_COMMAND',true); //use is_executable for external binaries // if set to true, full path to external binary required define('PHPDIG_INDEX_MSWORD',true); define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc'); define('PHPDIG_OPTION_MSWORD','-s 8859-1'); define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_INDEX_MSEXCEL',true); define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv'); define('PHPDIG_OPTION_MSEXCEL',''); //---------EXTERNAL TOOLS EXTENSIONS // if external binary is not STDOUT or different extension is needed // for example, use '.txt' if external binary writes to filename.txt define('PHPDIG_MSWORD_EXTENSION',''); define('PHPDIG_PDF_EXTENSION',''); define('PHPDIG_MSEXCEL_EXTENSION',''); and still no pdf are indexed. What do I do wrong? |
Hi. Try doing as in this thread. What onscreen output do you get?
|
I get the following output:
SITE : http://www.professioneel-handhaven.nl/ Uit te sluiten paden : - @NONE@ Is result test http an array: 1 What is result test http status: HTML Is result test an array: 1 What is result test status: HTML Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pstotext Does parse pdf exist: 1 Is parse pdf executable: 1 HTML <--- Status 1:http://www.professioneel-handhaven.nl/Bibliotheek/ (tijd : 00:00:05) + + levels 1... Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pstotext Does parse pdf exist: 1 Is parse pdf executable: 1 PDF <--- Status Result contains: Array ( ) Return value is: 0 2:http://www.professioneel-handhaven.n...et_oordeel.pdf (tijd : 00:00:16) Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pstotext Does parse pdf exist: 1 Is parse pdf executable: 1 PDF <--- Status Result contains: Array ( ) Return value is: 0 3:http://www.professioneel-handhaven.n..._Verbeterd.pdf (tijd : 00:00:21) Geen link in tijdelijke tabel Still no result in indexing the pdf-files. |
>> ...i use version 4.2.2. now...
Hi. The same issue and echo results are in this thread. If I remember correctly, there have been three cases of 4.2.2 not working and one case of 4.2.3 not working. Upgrading to a later version of PHP solved the problems. |
Hello Charter, thank you for your help till now, but the problem still exists... I upgraded the php-engine to 4.3.4, and installed the pdftotext binary. Unfortunetely, no green checkmark for each indexed pdf-file... I send hereby the output of the screen, and hope for new tips.
Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /home/www.professioneel-handhaven.nl/www/Zoeken/xpdf/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 PDF <--- Status Result contains: Array ( ) Return value is: 1 3:http://www.professioneel-handhaven.n...et_oordeel.pdf (tijd : 00:00:20) |
Hi. What happens when you run pdftotext from shell on a PDF file?
|
When running pdftotext from shell there was first a problem with the glibc library. We decided to recompile from the xpdf source in /usr/local/bin and now pdf-indexing works fine! The settings in config.php are as follows:
define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION','.txt'); Running pstotext from the shell gives an error on ghostscript (exit code 1) and will defenitely not work on our server. pdftotext is a good alternative. Thanks again to all members for the assistance! |
All times are GMT -8. The time now is 01:34 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.