PDA

View Full Version : indexing pdf


Hoek
02-16-2004, 10:26 AM
I installed the pstotext binary, but indexing of pdf-files will not take place. No green checkmark appears when indexing the site. Do you need the ghostscript binary installed on the server? And do you need to upgrade the php-engine (i use version 4.2.2. now). I have red some problems with newer php-engines and indexing html-tags. Is this problem solved in version 1.8?

vinyl-junkie
02-16-2004, 12:32 PM
Check this thread (http://www.phpdig.net/showthread.php?s=&threadid=516&highlight=pstotext). I believe it will answer your question. :)

Charter
02-16-2004, 02:37 PM
Hi. From http://research.compaq.com/SRC/virtualpaper/pstotext.html ...

pstotext is a program that works with Ghostscript (version 3.33 or later) to extract plain text from PostScript and PDF files (you should have Ghostscript 3.51 or later for PDF).

PHP version 4.2.2/3 seems to have issue with running exec pdftotext as in this (http://www.phpdig.net/showthread.php?threadid=522) thread, but I am not sure if pstotext would have the same problem.

The PHP strip_tags function was replaced with a regular expression in version 1.6.3. Version 1.8.0 should not index HTML tags.

Hoek
02-20-2004, 07:20 AM
I have fixed the path to GS, set the permissions and modified the config.php to this:

// if set to true is_executable used - set to '0' if is_executable is undefined
define('USE_IS_EXECUTABLE_COMMAND',true); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');

and still no pdf are indexed. What do I do wrong?

Charter
02-20-2004, 06:09 PM
Hi. Try doing as in this (http://www.phpdig.net/showthread.php?threadid=522) thread. What onscreen output do you get?

Hoek
02-21-2004, 12:27 PM
I get the following output:

SITE : http://www.professioneel-handhaven.nl/
Uit te sluiten paden :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

HTML <--- Status
1:http://www.professioneel-handhaven.nl/Bibliotheek/
(tijd : 00:00:05)
+ +
levels 1...


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

PDF <--- Status


Result contains: Array ( )
Return value is: 0

2:http://www.professioneel-handhaven.nl/Bibliotheek/documenten/kwaliteitscriteria_doe_je_voordeel_met_het_oordeel.pdf
(tijd : 00:00:16)



Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

PDF <--- Status


Result contains: Array ( )
Return value is: 0

3:http://www.professioneel-handhaven.nl/Bibliotheek/documenten/Goed_Verbeterd.pdf
(tijd : 00:00:21)

Geen link in tijdelijke tabel

Still no result in indexing the pdf-files.

Charter
02-23-2004, 02:10 PM
>> ...i use version 4.2.2. now...

Hi. The same issue and echo results are in this (http://www.phpdig.net/showthread.php?threadid=522) thread. If I remember correctly, there have been three cases of 4.2.2 not working and one case of 4.2.3 not working. Upgrading to a later version of PHP solved the problems.

Hoek
02-24-2004, 04:10 AM
Hello Charter, thank you for your help till now, but the problem still exists... I upgraded the php-engine to 4.3.4, and installed the pdftotext binary. Unfortunetely, no green checkmark for each indexed pdf-file... I send hereby the output of the screen, and hope for new tips.

Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/www.professioneel-handhaven.nl/www/Zoeken/xpdf/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

PDF <--- Status


Result contains: Array ( )
Return value is: 1

3:http://www.professioneel-handhaven.nl/Bibliotheek/documenten/kwaliteitscriteria_doe_je_voordeel_met_het_oordeel.pdf
(tijd : 00:00:20)

Charter
02-24-2004, 01:32 PM
Hi. What happens when you run pdftotext from shell on a PDF file?

Hoek
02-25-2004, 03:42 AM
When running pdftotext from shell there was first a problem with the glibc library. We decided to recompile from the xpdf source in /usr/local/bin and now pdf-indexing works fine! The settings in config.php are as follows:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');

Running pstotext from the shell gives an error on ghostscript (exit code 1) and will defenitely not work on our server. pdftotext is a good alternative.

Thanks again to all members for the assistance!