PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   External Binaries (http://www.phpdig.net/forum/forumdisplay.php?f=36)
-   -   indexing pdf (http://www.phpdig.net/forum/showthread.php?t=533)

Hoek 02-16-2004 09:26 AM

indexing pdf
 
I installed the pstotext binary, but indexing of pdf-files will not take place. No green checkmark appears when indexing the site. Do you need the ghostscript binary installed on the server? And do you need to upgrade the php-engine (i use version 4.2.2. now). I have red some problems with newer php-engines and indexing html-tags. Is this problem solved in version 1.8?

vinyl-junkie 02-16-2004 11:32 AM

Check this thread. I believe it will answer your question. :)

Charter 02-16-2004 01:37 PM

Hi. From http://research.compaq.com/SRC/virtu.../pstotext.html ...

pstotext is a program that works with Ghostscript (version 3.33 or later) to extract plain text from PostScript and PDF files (you should have Ghostscript 3.51 or later for PDF).

PHP version 4.2.2/3 seems to have issue with running exec pdftotext as in this thread, but I am not sure if pstotext would have the same problem.

The PHP strip_tags function was replaced with a regular expression in version 1.6.3. Version 1.8.0 should not index HTML tags.

Hoek 02-20-2004 06:20 AM

I have fixed the path to GS, set the permissions and modified the config.php to this:

// if set to true is_executable used - set to '0' if is_executable is undefined
define('USE_IS_EXECUTABLE_COMMAND',true); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');

and still no pdf are indexed. What do I do wrong?

Charter 02-20-2004 05:09 PM

Hi. Try doing as in this thread. What onscreen output do you get?

Hoek 02-21-2004 11:27 AM

I get the following output:

SITE : http://www.professioneel-handhaven.nl/
Uit te sluiten paden :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

HTML <--- Status
1:http://www.professioneel-handhaven.nl/Bibliotheek/
(tijd : 00:00:05)
+ +
levels 1...


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

PDF <--- Status


Result contains: Array ( )
Return value is: 0

2:http://www.professioneel-handhaven.n...et_oordeel.pdf
(tijd : 00:00:16)



Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

PDF <--- Status


Result contains: Array ( )
Return value is: 0

3:http://www.professioneel-handhaven.n..._Verbeterd.pdf
(tijd : 00:00:21)

Geen link in tijdelijke tabel

Still no result in indexing the pdf-files.

Charter 02-23-2004 01:10 PM

>> ...i use version 4.2.2. now...

Hi. The same issue and echo results are in this thread. If I remember correctly, there have been three cases of 4.2.2 not working and one case of 4.2.3 not working. Upgrading to a later version of PHP solved the problems.

Hoek 02-24-2004 03:10 AM

Hello Charter, thank you for your help till now, but the problem still exists... I upgraded the php-engine to 4.3.4, and installed the pdftotext binary. Unfortunetely, no green checkmark for each indexed pdf-file... I send hereby the output of the screen, and hope for new tips.

Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/www.professioneel-handhaven.nl/www/Zoeken/xpdf/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

PDF <--- Status


Result contains: Array ( )
Return value is: 1

3:http://www.professioneel-handhaven.n...et_oordeel.pdf
(tijd : 00:00:20)

Charter 02-24-2004 12:32 PM

Hi. What happens when you run pdftotext from shell on a PDF file?

Hoek 02-25-2004 02:42 AM

When running pdftotext from shell there was first a problem with the glibc library. We decided to recompile from the xpdf source in /usr/local/bin and now pdf-indexing works fine! The settings in config.php are as follows:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');

Running pstotext from the shell gives an error on ghostscript (exit code 1) and will defenitely not work on our server. pdftotext is a good alternative.

Thanks again to all members for the assistance!


All times are GMT -8. The time now is 01:34 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.