View Full Version : I wrote a mod for indexing pdf without an external binary!!!

07-09-2004, 12:51 PM
Hello people

I have written a modification, with which I now can index pdf-files.

The special is:
You don't need an external binary like ps2txt or another UNIX-tool.

The mod sends the pdf to adobe, which it converts to html-code.
After that, this code is indexed by phpDig.

For more information, please visit my homepage

07-09-2004, 07:11 PM
Is your robot_functions.php meant to completely replace the one that comes with phpdig? It's hard to tell since your site isn't in English. ;)

07-10-2004, 01:42 AM
I had to change code at 4 or 5 positions in the already existing file robot_functions.

so, the easiest way is to replaceing this file (if you didnt made some changes in this file yourself, else make a backup!).

In the header of the file, I have listened all changes, i made.

The english part of it in my homepage will comming soon...
(Or has anyone desire for doing that?)

sorry for my bad english ;)

07-10-2004, 04:20 AM
Please download and use only the actual version from my site.
(The older version has a bug)
I made it for the phpDig V1.8.1. It won't work with older version of phpDig.

07-10-2004, 05:33 AM
Hi. From the Adobe Terms of Use located here (http://www.adobe.com/misc/copyright.html):

In addition, you agree not to use any data mining, robots, or similar data gathering and extraction methods in connection with the Site.

Try one of the following external binaries for use with PhpDig.

catdoc (http://www.45.free.net/~vitus/ice/catdoc/)
wp2html (http://www.res.bbsrc.ac.uk/wp2html/)
xls2csv (http://www.45.free.net/~vitus/ice/catdoc/)
xlhtml (http://chicago.sourceforge.net/xlhtml/)
pstotext (http://research.compaq.com/SRC/virtualpaper/pstotext.html)
pdftotext (http://www.foolabs.com/xpdf/)
ppt2text (http://www.spocom.com/users/gjohnson/mutt/)
ppthtml (http://packages.debian.org/stable/utils/ppthtml)

Please help keep PhpDig an honest and viable open source product. Thanks.

07-10-2004, 07:35 AM
Hello charter

;( , sorry, i didn' read the terms of adobe.
I was very happy to have a sollution for this sch*** pdf-problem.
oh, i really hate adobe!!!

because I can't install ps2txt or pdf2html at my webspace, i have to search annother sollution.

could i send the pdf to annother server (of a friend or else) which converts it for me with pdf2html and sends then back to me?
i have not much enought unix-experience, so i'm not sure.

or know anybody a converter for pdf2txt written in perl (cgi)?

annother sollution is, sending it once to adobe and then save it in a database, until its mtime changes. with this, I think adobe could nothing say!!!

P.S. I really like phpDig, but without pdf-support, I could it use only half.


07-10-2004, 08:13 AM
I know nothing about Perl, but a quick search on Google yielded this (http://user.cs.tu-berlin.de/~ccorn/software/rpm/packages/perl-Text-PDF.html). If that's not a workable solution for you, just do a search on "pdf to text perl script" (without the quotes).

Hope this helps. :)

07-10-2004, 09:01 AM
Hi. At FooLabs (http://www.foolabs.com/xpdf/download.html) is a mirror to PlanetMirror (http://public.planetmirror.com/pub/xpdf/) where you can find compiled versions of pdftotext.

Go to PlanetMirror (http://public.planetmirror.com/pub/xpdf/) and download xpdf-3.00-linux.tar.gz (assuming linux is your operating system).

Unzip xpdf-3.00-linux.tar.gz and extract only the pdftotext file (it's already been compiled and is a binary file).

FTP just the pdftotext file in binary mode to your account.

Once the file is over, change its permission to rwxr-xr-x (755 permission).

Now in the PhpDig config file, set the following:

define('PHPDIG_INDEX_PDF',true); // set to true
define('PHPDIG_PARSE_PDF','/the/full/path/to/pdftotext'); // assuming linux
define('PHPDIG_OPTION_PDF',''); // two single quotes, no space inbetween

Also be sure to set the following in the PhpDig config file too:

define('PHPDIG_PDF_EXTENSION','.txt'); // don't forget the period in .txt

Give PhpDig a whirl and see if it indexes PDF files.

From the admin panel of PhpDig version 1.8.1, just type in the link to a PDF file, and set search depth to zero and set links per to one, to test pdftotext on the one PDF file.

07-10-2004, 11:26 AM
thanx for your tipps charter

I made it with explanations.
(firstly i restored all files from phpdig to its originals) ;)

then i changed the config.php like you said.
for the path, i used /home/ruinelli/public_html/cgi-bin
in which I too moved the file pdftotext (1MB).

But I think, in this dir I can't (don't have the permition) for executing binaries!!!

then I executet the spider with http://testdomain.ruinelli.ch/gpl.pdf
it spiders, but no keyword is putted in the database. ;(

I think, the problem is that the file pdf2txt has to be in a bin-folder like /bin or /usr/local/bin to wich I don't have access.

you can test it under: http://www.ruinelli.ch/phpdig/admin/index.php

read the problem @: http://forums.devshed.com/archive/t-121054

07-10-2004, 11:38 AM
Originally posted by caco3
read the problem @: http://forums.devshed.com/archive/t-121054 Thanks for posting that link. I've had the luxury of being lazy and not having to figure out how to index pdf documents for my site. I do have them, but they are strictly for the purpose of a printer friendly version of certain documents which are also in HTML format on my site. ;)

07-10-2004, 11:44 AM
Hi. Make a new directory called binaries and move the pdftotext to this directory. Make sure pdftotext still has 755 permission. Then set the following in the PhpDig config file:


07-10-2004, 12:08 PM
yeeeees, it works!!!!!

in the path, i forgot the filname pdftotxt in the path ;(
but now it works.

thank a lot!!!

I read so many explanations but with none I get it to work.

now, I can send my mod to /dev/null ;)

I think, It would be nice, when the docu for phpdig would be more explaining.

greets CaCO3 [a really happy man with a genial searchmaschin on his page ;) ]