PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 07-09-2004, 12:51 PM   #1
caco3
Green Mole
 
Join Date: Jul 2004
Location: Illnau, Switzerland, Europe
Posts: 9
Wink I wrote a mod for indexing pdf without an external binary!!!

Hello people

I have written a modification, with which I now can index pdf-files.

The special is:
You don't need an external binary like ps2txt or another UNIX-tool.

The mod sends the pdf to adobe, which it converts to html-code.
After that, this code is indexed by phpDig.

For more information, please visit my homepage
<removed>
caco3 is offline   Reply With Quote
Old 07-09-2004, 07:11 PM   #2
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Is your robot_functions.php meant to completely replace the one that comes with phpdig? It's hard to tell since your site isn't in English.
vinyl-junkie is offline   Reply With Quote
Old 07-10-2004, 01:42 AM   #3
caco3
Green Mole
 
Join Date: Jul 2004
Location: Illnau, Switzerland, Europe
Posts: 9
I had to change code at 4 or 5 positions in the already existing file robot_functions.

so, the easiest way is to replaceing this file (if you didnt made some changes in this file yourself, else make a backup!).

In the header of the file, I have listened all changes, i made.

The english part of it in my homepage will comming soon...
(Or has anyone desire for doing that?)

sorry for my bad english
caco3 is offline   Reply With Quote
Old 07-10-2004, 04:20 AM   #4
caco3
Green Mole
 
Join Date: Jul 2004
Location: Illnau, Switzerland, Europe
Posts: 9
Please download and use only the actual version from my site.
(The older version has a bug)
I made it for the phpDig V1.8.1. It won't work with older version of phpDig.
caco3 is offline   Reply With Quote
Old 07-10-2004, 05:33 AM   #5
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. From the Adobe Terms of Use located here:
Quote:
In addition, you agree not to use any data mining, robots, or similar data gathering and extraction methods in connection with the Site.
Try one of the following external binaries for use with PhpDig.Please help keep PhpDig an honest and viable open source product. Thanks.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-10-2004, 07:35 AM   #6
caco3
Green Mole
 
Join Date: Jul 2004
Location: Illnau, Switzerland, Europe
Posts: 9
Unhappy

Hello charter

;( , sorry, i didn' read the terms of adobe.
I was very happy to have a sollution for this sch*** pdf-problem.
oh, i really hate adobe!!!

because I can't install ps2txt or pdf2html at my webspace, i have to search annother sollution.

could i send the pdf to annother server (of a friend or else) which converts it for me with pdf2html and sends then back to me?
i have not much enought unix-experience, so i'm not sure.

or know anybody a converter for pdf2txt written in perl (cgi)?

annother sollution is, sending it once to adobe and then save it in a database, until its mtime changes. with this, I think adobe could nothing say!!!



P.S. I really like phpDig, but without pdf-support, I could it use only half.

greets
CaCO3
caco3 is offline   Reply With Quote
Old 07-10-2004, 08:13 AM   #7
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
I know nothing about Perl, but a quick search on Google yielded this. If that's not a workable solution for you, just do a search on "pdf to text perl script" (without the quotes).

Hope this helps.
vinyl-junkie is offline   Reply With Quote
Old 07-10-2004, 09:01 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. At FooLabs is a mirror to PlanetMirror where you can find compiled versions of pdftotext.

Go to PlanetMirror and download xpdf-3.00-linux.tar.gz (assuming linux is your operating system).

Unzip xpdf-3.00-linux.tar.gz and extract only the pdftotext file (it's already been compiled and is a binary file).

FTP just the pdftotext file in binary mode to your account.

Once the file is over, change its permission to rwxr-xr-x (755 permission).

Now in the PhpDig config file, set the following:
PHP Code:
define('PHPDIG_INDEX_PDF',true); // set to true
define('PHPDIG_PARSE_PDF','/the/full/path/to/pdftotext'); // assuming linux
define('PHPDIG_OPTION_PDF',''); // two single quotes, no space inbetween 
Also be sure to set the following in the PhpDig config file too:
PHP Code:
define('PHPDIG_PDF_EXTENSION','.txt'); // don't forget the period in .txt 
Give PhpDig a whirl and see if it indexes PDF files.

From the admin panel of PhpDig version 1.8.1, just type in the link to a PDF file, and set search depth to zero and set links per to one, to test pdftotext on the one PDF file.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-10-2004, 11:26 AM   #9
caco3
Green Mole
 
Join Date: Jul 2004
Location: Illnau, Switzerland, Europe
Posts: 9
thanx for your tipps charter

I made it with explanations.
(firstly i restored all files from phpdig to its originals)

then i changed the config.php like you said.
for the path, i used /home/ruinelli/public_html/cgi-bin
in which I too moved the file pdftotext (1MB).

But I think, in this dir I can't (don't have the permition) for executing binaries!!!

then I executet the spider with http://testdomain.ruinelli.ch/gpl.pdf
it spiders, but no keyword is putted in the database. ;(


I think, the problem is that the file pdf2txt has to be in a bin-folder like /bin or /usr/local/bin to wich I don't have access.

you can test it under: http://www.ruinelli.ch/phpdig/admin/index.php


@vinyl-junkie:
read the problem @: http://forums.devshed.com/archive/t-121054

Last edited by caco3; 07-10-2004 at 11:28 AM.
caco3 is offline   Reply With Quote
Old 07-10-2004, 11:38 AM   #10
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Quote:
Originally posted by caco3
@vinyl-junkie:
read the problem @: http://forums.devshed.com/archive/t-121054
Thanks for posting that link. I've had the luxury of being lazy and not having to figure out how to index pdf documents for my site. I do have them, but they are strictly for the purpose of a printer friendly version of certain documents which are also in HTML format on my site.
vinyl-junkie is offline   Reply With Quote
Old 07-10-2004, 11:44 AM   #11
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Make a new directory called binaries and move the pdftotext to this directory. Make sure pdftotext still has 755 permission. Then set the following in the PhpDig config file:
PHP Code:
define('PHPDIG_PARSE_PDF','/home/ruinelli/public_html/binaries/pdftotext'); 
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-10-2004, 12:08 PM   #12
caco3
Green Mole
 
Join Date: Jul 2004
Location: Illnau, Switzerland, Europe
Posts: 9
yeeeees, it works!!!!!

in the path, i forgot the filname pdftotxt in the path ;(
but now it works.

thank a lot!!!

I read so many explanations but with none I get it to work.

now, I can send my mod to /dev/null

I think, It would be nice, when the docu for phpdig would be more explaining.

greets CaCO3 [a really happy man with a genial searchmaschin on his page ]
caco3 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Indexing PDF dlaperle Troubleshooting 1 03-21-2007 07:00 PM
help where I find External Binaries the pdf xls doc gioducati External Binaries 0 08-11-2006 11:28 PM
Suggestions needed for pdf tracking mod chris33 Mod Requests 5 04-22-2005 01:20 PM
PDF indexing aryan External Binaries 11 11-27-2003 07:51 AM


All times are GMT -8. The time now is 08:00 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.