PDF indexing [Archive]

View Full Version : PDF indexing

lelandv

12-07-2003, 10:05 AM

Originally posted by Charter
Hi. Delete anything in the temp directory, and then try setting the following in the config file:

define('PHPDIG_PDF_EXTENSION','.txt');

Hi.. I have a similar problem to the other poster. Difference here is that the debug test, it does successfully detect that it's a PDF file, and creates the temporary file and promptly deletes it again.

I have added the define above as per the previous problem, the but the contents of the PDF are still not indexed. I'm using "pdftohtml" with a wrapper which removes all HTML formatting resulting in PDF -> TEXT. (syntax: pdf2txt file.pdf --- which results in a STDOUT output of plain text).

Of course in the database, there is no hint of the contents of the PDF file, thus not indexed... just the filename itself (which is not really what we want here.)

Any help would be appreciated.

:bang:

Leland

Charter

12-07-2003, 10:12 AM

Hi. If the output goes to STDOUT, then set define('PHPDIG_PDF_EXTENSION','');

The extension .txt in define('PHPDIG_PDF_EXTENSION','.txt'); is only needed if the output goes to file with a .txt extension.

lelandv

12-07-2003, 10:18 AM

Originally posted by Charter
Hi. If the output goes to STDOUT, then set define('PHPDIG_PDF_EXTENSION','');

The extension .txt in define('PHPDIG_PDF_EXTENSION','.txt'); is only needed if the output goes to file with a .txt extension.

Hiya.. I've done this, but the PDF file is still not indexed.. just the filename :confused:

Am I missing something here?

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','');

the actual PDF file is linked off of another page, and looking at the server logs I do see the crawler retrieving the pdf document in the first place... just that it's still not indexed at all.

taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "HEAD /pdftest/InstrumentPilot39.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.2 (PHP; MySql)"
taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "GET /pdftest/InstrumentPilot39.pdf HTTP/1.0" 200 1262188 "-" "PHP/4.2.2"

Leland

Charter

12-07-2003, 10:41 AM

Originally posted by lelandv
taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "HEAD /pdftest/InstrumentPilot39.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.2 (PHP; MySql)"
taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "GET /pdftest/InstrumentPilot39.pdf HTTP/1.0" 200 1262188 "-" "PHP/4.2.2"

Hi. Please update to PhpDig 1.6.5 and try it. If you added the l_time column to the logs table already, then no database changes need to be made in the update. For the files, just reconfigure the connect and config files, and FTP over the PHP files, except for install.php unless you want that file online. BTW, what OS are you running?

lelandv

12-07-2003, 11:26 AM

Originally posted by Charter
Hi. Please update to PhpDig 1.6.5 and try it. If you added the l_time column to the logs table already, then no database changes need to be made in the update. For the files, just reconfigure the connect and config files, and FTP over the PHP files, except for install.php unless you want that file online. BTW, what OS are you running?

Debian Linux for the OS with Apache for the server.

(Please note that the latest version stated on freshmeat/sourceforge is 1.6.2.. might want to update that when you get a chance.)

Will try 1.6.5 and let you know how it goes :)

Leland

lelandv

12-07-2003, 11:39 AM

hmm..

version 1.6.5 generates an error 404 when inserting the search on the search page.

laptop1.discpro.org - - [07/Dec/2003:20:36:39 +0000] "GET /phpdig/index.php?template_demo=.%2Ftemplates%2Fphpdig.html&site=0&path=&result_page=index.php&query_string=transition&limite=10&option=start HTTP/1.1" 404 1146 "http://www.discpro.org/phpdig/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

lelandv

12-07-2003, 11:41 AM

disregard that... brain fart

Just tried it with the new version, the PDF content is still not indexed :(

:bang:

L.

Charter

12-07-2003, 12:02 PM

Hi. All I can find is a Windows version of pdf2txt. Are you using xpdf? Unless renamed, it should be pdftotext that comes with xpdf.

If you have shell access and are allowed to locate, just type locate pdftotext to find the path.

The freshmeat listing was updated yesterday. :)

lelandv

12-07-2003, 12:09 PM

No.. I'm using actually pdftohtml since the pdftotext and xpdf doesn't support encrypted PDF's. The only utility that I have available to do this is pdftohtml which creates an output in HTML (to STDOUT). I therefore use a wrapper around it and call the binary from the wrapper. The wrapper removes the HTML tags leaving only the plain text... just a simple 4-line perl script:

#!/usr/bin/perl

$filename = shift;
$output = `/usr/local/bin/pdftohtml -i -stdout -noframes $filename`;

$output =~ s/<.*>//g;
print $output;

As a result, to get the text out of the pdf, simply "pdf2txt myfile.pdf" at the command line, and it outputs the text to STDOUT.

Noted on the freshmeat site... guess I should have waited a day before downloading it then ;)

Really need to sort out this PDF indexing issue though... it's annoying and I really need for the search engine to be able to search based on the contents of a PDF file... there are several other spiders/search-engines available in Php, BUT none of them can do as comprehensive indexing and searching as phpdig... :(

Leland

Charter

12-07-2003, 12:46 PM

Hi. Just "pdf2txt myfile.pdf" with no .cgi or .pl extension? How does it know to treat it as a perl program?

Try using pdftohtml in define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); because I'm thinking PhpDig should clean the results of tags.

lelandv

12-07-2003, 12:52 PM

Originally posted by Charter
Hi. Just "pdf2txt myfile.pdf" with no .cgi or .pl extension? How does it know to treat it as a perl program?

Try using pdftohtml in define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); because I'm thinking PhpDig should clean the results of tags.

the permissions on the wrapper are 0755 (executable) and the first line contains #!/usr/bin/perl forcing the shell to use perl to execute it.

For example, if you do it from the command line itself:

leland@taranta:~/public_html/pdftest> /usr/local/bin/pdf2txt InstrumentPilot39.pdf

Engine Management
1
Intelligence Reports
2
Bashing the Beam
6
European Flight Planning
8
Dew Point Review
10
PPL/IR Europe Web Site
12
14
Bert Maes and I attended the engine efficiency and many others. It was very

<snip>

---
Having said that, I've just added a little hook in the wrapper to detect if the wrapper has even been called, but it looks like the spider isn't even attempting to use it.

Despite the settings in config.php:
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt');
define('PHPDIG_OPTION_PDF','');

the externals are called with "exec" are they not? If they are, then it should at least fall into the trap, but it looks as if it's not even getting that far.

L.

Charter

12-07-2003, 01:03 PM

Hi. Yes, exec is being used. I just tried your perl program and on my OS (Linux/Apache) but #!/usr/bin/perl does not force the execution of perl programs. I'll play around some more with this.

In any case, try using the following:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION',''); // as it's STDOUT

The above is used to make a temp file which is then passed to an index function. In the index function, temp file should be cleaned of tags.

lelandv

12-07-2003, 01:11 PM

Originally posted by Charter
Hi. Yes, exec is being used. I just tried your perl program and on my OS (Linux/Apache) but #!/usr/bin/perl does not force the execution of perl programs. I'll play around some more with this.

Have to, of course, make sure that the perl interpretter is in the right place ;)

In any case, try using the following:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION',''); // as it's STDOUT

The above is used to make a temp file which is then passed to an index function. In the index function, temp file should be cleaned of tags.

did this as you suggested... still no index of the file contents... just the filename. It's as if it's not even bothering to look inside the file if it's a .PDF.

Leland

lelandv

12-07-2003, 01:16 PM

Just looking at the output when running the spider:

SITE : http://www.discpro.org/
Exclude paths :
- @NONE@
1:http://www.discpro.org/
(time : 00:00:01)
+ + + + + + + + +
level 1...
2:http://www.discpro.org/pdftest/InstrumentPilot39.pdf
(time : 00:00:02)

3:http://www.discpro.org/?mode=pgpkey
(time : 00:00:02)

<etc>

#3 has the checkmark next to it.. #2 doesn't.
Am I to presume that it only indexed the file and not the contents of the file?

(it also seemed to do it a little TOO quickly, since it takes at least a few seconds even to convert it from the pdf to html or text. Tells me that it's not even executing the external binary call.

Leland

Charter

12-07-2003, 01:20 PM

Hi. Try a search on some words in the PDF file. In the search results is there a result for the PDF file? Also, I am on chat right now if you'd rather chat through this.

lelandv

12-08-2003, 04:23 PM

Just for information (in case anyone else comes across similar problems).. phpdig will NOT work on PHP version 4.2.2 -- this was the cause of the problems. Upgrading to 4.3.4 solved the problem. All working now!

Serves me right for being complacent with my upgrades! ;)

Regards,

Leland