PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 12-07-2003, 10:05 AM   #1
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
Quote:
Originally posted by Charter
Hi. Delete anything in the temp directory, and then try setting the following in the config file:

define('PHPDIG_PDF_EXTENSION','.txt');
Hi.. I have a similar problem to the other poster. Difference here is that the debug test, it does successfully detect that it's a PDF file, and creates the temporary file and promptly deletes it again.

I have added the define above as per the previous problem, the but the contents of the PDF are still not indexed. I'm using "pdftohtml" with a wrapper which removes all HTML formatting resulting in PDF -> TEXT. (syntax: pdf2txt file.pdf --- which results in a STDOUT output of plain text).

Of course in the database, there is no hint of the contents of the PDF file, thus not indexed... just the filename itself (which is not really what we want here.)

Any help would be appreciated.



Leland
lelandv is offline   Reply With Quote
Old 12-07-2003, 10:12 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. If the output goes to STDOUT, then set define('PHPDIG_PDF_EXTENSION','');

The extension .txt in define('PHPDIG_PDF_EXTENSION','.txt'); is only needed if the output goes to file with a .txt extension.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-07-2003, 10:18 AM   #3
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
Quote:
Originally posted by Charter
Hi. If the output goes to STDOUT, then set define('PHPDIG_PDF_EXTENSION','');

The extension .txt in define('PHPDIG_PDF_EXTENSION','.txt'); is only needed if the output goes to file with a .txt extension.
Hiya.. I've done this, but the PDF file is still not indexed.. just the filename

Am I missing something here?

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','');


the actual PDF file is linked off of another page, and looking at the server logs I do see the crawler retrieving the pdf document in the first place... just that it's still not indexed at all.


taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "HEAD /pdftest/InstrumentPilot39.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.2 (PHP; MySql)"
taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "GET /pdftest/InstrumentPilot39.pdf HTTP/1.0" 200 1262188 "-" "PHP/4.2.2"

Leland
lelandv is offline   Reply With Quote
Old 12-07-2003, 10:41 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Quote:
Originally posted by lelandv
taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "HEAD /pdftest/InstrumentPilot39.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.2 (PHP; MySql)"
taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "GET /pdftest/InstrumentPilot39.pdf HTTP/1.0" 200 1262188 "-" "PHP/4.2.2"
Hi. Please update to PhpDig 1.6.5 and try it. If you added the l_time column to the logs table already, then no database changes need to be made in the update. For the files, just reconfigure the connect and config files, and FTP over the PHP files, except for install.php unless you want that file online. BTW, what OS are you running?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-07-2003, 11:26 AM   #5
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
Quote:
Originally posted by Charter
Hi. Please update to PhpDig 1.6.5 and try it. If you added the l_time column to the logs table already, then no database changes need to be made in the update. For the files, just reconfigure the connect and config files, and FTP over the PHP files, except for install.php unless you want that file online. BTW, what OS are you running?
Debian Linux for the OS with Apache for the server.

(Please note that the latest version stated on freshmeat/sourceforge is 1.6.2.. might want to update that when you get a chance.)

Will try 1.6.5 and let you know how it goes

Leland
lelandv is offline   Reply With Quote
Old 12-07-2003, 11:39 AM   #6
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
hmm..

version 1.6.5 generates an error 404 when inserting the search on the search page.

laptop1.discpro.org - - [07/Dec/2003:20:36:39 +0000] "GET /phpdig/index.php?template_demo=.%2Ftemplates%2Fphpdig.html&site=0&path=&result_pag e=index.php&query_string=transition&limite=10&option=start HTTP/1.1" 404 1146 "http://www.discpro.org/phpdig/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
lelandv is offline   Reply With Quote
Old 12-07-2003, 11:41 AM   #7
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
disregard that... brain fart


Just tried it with the new version, the PDF content is still not indexed



L.
lelandv is offline   Reply With Quote
Old 12-07-2003, 12:02 PM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. All I can find is a Windows version of pdf2txt. Are you using xpdf? Unless renamed, it should be pdftotext that comes with xpdf.

If you have shell access and are allowed to locate, just type locate pdftotext to find the path.

The freshmeat listing was updated yesterday.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-07-2003, 12:09 PM   #9
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
No.. I'm using actually pdftohtml since the pdftotext and xpdf doesn't support encrypted PDF's. The only utility that I have available to do this is pdftohtml which creates an output in HTML (to STDOUT). I therefore use a wrapper around it and call the binary from the wrapper. The wrapper removes the HTML tags leaving only the plain text... just a simple 4-line perl script:

#!/usr/bin/perl

$filename = shift;
$output = `/usr/local/bin/pdftohtml -i -stdout -noframes $filename`;

$output =~ s/<.*>//g;
print $output;

As a result, to get the text out of the pdf, simply "pdf2txt myfile.pdf" at the command line, and it outputs the text to STDOUT.

Noted on the freshmeat site... guess I should have waited a day before downloading it then

Really need to sort out this PDF indexing issue though... it's annoying and I really need for the search engine to be able to search based on the contents of a PDF file... there are several other spiders/search-engines available in Php, BUT none of them can do as comprehensive indexing and searching as phpdig...

Leland
lelandv is offline   Reply With Quote
Old 12-07-2003, 12:46 PM   #10
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Just "pdf2txt myfile.pdf" with no .cgi or .pl extension? How does it know to treat it as a perl program?

Try using pdftohtml in define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); because I'm thinking PhpDig should clean the results of tags.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-07-2003, 12:52 PM   #11
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
Quote:
Originally posted by Charter
Hi. Just "pdf2txt myfile.pdf" with no .cgi or .pl extension? How does it know to treat it as a perl program?

Try using pdftohtml in define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); because I'm thinking PhpDig should clean the results of tags.
the permissions on the wrapper are 0755 (executable) and the first line contains #!/usr/bin/perl forcing the shell to use perl to execute it.

For example, if you do it from the command line itself:

leland@taranta:~/public_html/pdftest> /usr/local/bin/pdf2txt InstrumentPilot39.pdf

Engine Management
1
Intelligence Reports
2
Bashing the Beam
6
European Flight Planning
8
Dew Point Review
10
PPL/IR Europe Web Site
12
14
Bert Maes and I attended the engine efficiency and many others. It was very

<snip>

---
Having said that, I've just added a little hook in the wrapper to detect if the wrapper has even been called, but it looks like the spider isn't even attempting to use it.

Despite the settings in config.php:
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt');
define('PHPDIG_OPTION_PDF','');

the externals are called with "exec" are they not? If they are, then it should at least fall into the trap, but it looks as if it's not even getting that far.


L.
lelandv is offline   Reply With Quote
Old 12-07-2003, 01:03 PM   #12
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Yes, exec is being used. I just tried your perl program and on my OS (Linux/Apache) but #!/usr/bin/perl does not force the execution of perl programs. I'll play around some more with this.

In any case, try using the following:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION',''); // as it's STDOUT

The above is used to make a temp file which is then passed to an index function. In the index function, temp file should be cleaned of tags.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-07-2003, 01:11 PM   #13
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
Quote:
Originally posted by Charter
Hi. Yes, exec is being used. I just tried your perl program and on my OS (Linux/Apache) but #!/usr/bin/perl does not force the execution of perl programs. I'll play around some more with this.

Have to, of course, make sure that the perl interpretter is in the right place

Quote:
In any case, try using the following:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION',''); // as it's STDOUT

The above is used to make a temp file which is then passed to an index function. In the index function, temp file should be cleaned of tags.
did this as you suggested... still no index of the file contents... just the filename. It's as if it's not even bothering to look inside the file if it's a .PDF.

Leland
lelandv is offline   Reply With Quote
Old 12-07-2003, 01:16 PM   #14
lelandv
Green Mole
 
Join Date: Dec 2003
Posts: 11
Just looking at the output when running the spider:

SITE : http://www.discpro.org/
Exclude paths :
- @NONE@
1:http://www.discpro.org/
(time : 00:00:01)
+ + + + + + + + +
level 1...
2:http://www.discpro.org/pdftest/InstrumentPilot39.pdf
(time : 00:00:02)

3:http://www.discpro.org/?mode=pgpkey
(time : 00:00:02)

<etc>

#3 has the checkmark next to it.. #2 doesn't.
Am I to presume that it only indexed the file and not the contents of the file?

(it also seemed to do it a little TOO quickly, since it takes at least a few seconds even to convert it from the pdf to html or text. Tells me that it's not even executing the external binary call.

Leland
lelandv is offline   Reply With Quote
Old 12-07-2003, 01:20 PM   #15
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Try a search on some words in the PDF file. In the search results is there a result for the PDF file? Also, I am on chat right now if you'd rather chat through this.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Indexing PDF dlaperle Troubleshooting 1 03-21-2007 07:00 PM
Problem with PDF indexing Phantom External Binaries 2 07-25-2005 02:26 AM
indexing pdf Hoek External Binaries 9 02-25-2004 02:42 AM
indexing pdf philippeguerind External Binaries 11 02-21-2004 10:50 AM
PDF indexing aryan External Binaries 11 11-27-2003 07:51 AM


All times are GMT -8. The time now is 02:39 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.