Thread: PDF indexing
View Single Post
Old 12-07-2003, 12:09 PM   #9
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
No.. I'm using actually pdftohtml since the pdftotext and xpdf doesn't support encrypted PDF's. The only utility that I have available to do this is pdftohtml which creates an output in HTML (to STDOUT). I therefore use a wrapper around it and call the binary from the wrapper. The wrapper removes the HTML tags leaving only the plain text... just a simple 4-line perl script:

#!/usr/bin/perl

$filename = shift;
$output = `/usr/local/bin/pdftohtml -i -stdout -noframes $filename`;

$output =~ s/<.*>//g;
print $output;

As a result, to get the text out of the pdf, simply "pdf2txt myfile.pdf" at the command line, and it outputs the text to STDOUT.

Noted on the freshmeat site... guess I should have waited a day before downloading it then

Really need to sort out this PDF indexing issue though... it's annoying and I really need for the search engine to be able to search based on the contents of a PDF file... there are several other spiders/search-engines available in Php, BUT none of them can do as comprehensive indexing and searching as phpdig...

Leland
lelandv is offline   Reply With Quote