PhpDig.net - View Single Post

lelandv · 12-07-2003, 12:09 PM

No.. I'm using actually pdftohtml since the pdftotext and xpdf doesn't support encrypted PDF's. The only utility that I have available to do this is pdftohtml which creates an output in HTML (to STDOUT). I therefore use a wrapper around it and call the binary from the wrapper. The wrapper removes the HTML tags leaving only the plain text... just a simple 4-line perl script:

#!/usr/bin/perl

$filename = shift;
$output = `/usr/local/bin/pdftohtml -i -stdout -noframes $filename`;

$output =~ s/<.*>//g;
print $output;

As a result, to get the text out of the pdf, simply "pdf2txt myfile.pdf" at the command line, and it outputs the text to STDOUT.

Noted on the freshmeat site... guess I should have waited a day before downloading it then

Really need to sort out this PDF indexing issue though... it's annoying and I really need for the search engine to be able to search based on the contents of a PDF file... there are several other spiders/search-engines available in Php, BUT none of them can do as comprehensive indexing and searching as phpdig...

Leland

12-07-2003, 12:09 PM	#9
lelandv Green Mole Join Date: Dec 2003 Posts: 11	No.. I'm using actually pdftohtml since the pdftotext and xpdf doesn't support encrypted PDF's. The only utility that I have available to do this is pdftohtml which creates an output in HTML (to STDOUT). I therefore use a wrapper around it and call the binary from the wrapper. The wrapper removes the HTML tags leaving only the plain text... just a simple 4-line perl script: #!/usr/bin/perl $filename = shift; $output = `/usr/local/bin/pdftohtml -i -stdout -noframes $filename`; $output =~ s/<.*>//g; print $output; As a result, to get the text out of the pdf, simply "pdf2txt myfile.pdf" at the command line, and it outputs the text to STDOUT. Noted on the freshmeat site... guess I should have waited a day before downloading it then Really need to sort out this PDF indexing issue though... it's annoying and I really need for the search engine to be able to search based on the contents of a PDF file... there are several other spiders/search-engines available in Php, BUT none of them can do as comprehensive indexing and searching as phpdig... Leland