PDA

View Full Version : problem with pdftotext


freak
05-26-2004, 10:07 PM
hello there,

I'm having problems indexing pdfs. I already read most of the post in here and didn't find where is the problem. :confused:

I'm using Apache 2.0.45 + PHP 4.3.6 running on Windows2k SP4. I just downloaded xpdf-3.00-win32 and extract the pdftotext.exe file.

This is my config file:

define('USE_IS_EXECUTABLE_COMMAND','0');
...
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','C:\\Apache Group\\Apache2\\htdocs\\phpdig\\bin\\pdftotext.exe');
define('PHPDIG_OPTION_PDF','');
...
define('PHPDIG_PDF_EXTENSION','.txt');

and this what i got when i try to index a local site with one page that has only one link to a pdf file.

I put the extracode for debugging...

--------------------------------------------------------------------------------
SITE : http://ivan02/
Exclude paths :
- @NONE@


Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\Apache Group\Apache2\htdocs\phpdig\bin\pdftotext.exe
Does parse pdf exist: 1
1:http://ivan02/test/
(time : 00:00:06)
+
level 1...


Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\Apache Group\Apache2\htdocs\phpdig\bin\pdftotext.exe
Does parse pdf exist: 1

Command is :C:\Apache Group\Apache2\htdocs\phpdig\bin\pdftotext.exe ../admin/temp/61216332.tmp

Result contains: Array ( )
Return value is: 1

2:http://ivan02/test/proy01.pdf
(time : 00:00:16)

No link in temporary table

--------------------------------------------------------------------------------

links found : 2
http://ivan02/test/
http://ivan02/test/proy01.pdf
Optimizing tables...
Indexing complete !
--------------------------------------------------------------------------------

The spider find the pdf file but doesn't extract any content from it. Also there is no marks before the link number 2. I mean there is no "good mark" and no "bad mark".

Could somebody please help me? I just don't know what's going on here..

Thanks!

PS: Please excuse my english!

Charter
06-02-2004, 07:20 AM
Hi. Perhaps the space in Apache Group is causing the command not to execute correctly. Try renaming Apache Group to ApacheGroup or try quoting the path in the PHPDIG_PARSE_PDF constant.