![]() |
pdf indexing with pstotext
Hi,
I'm running an apache 1.3.28 with php 4.3.4rc1. and phpdig 1.6.4 (hmm, i should updgrade...) But here is my problem.. I've got a lot of pdf, and i want them to be indexed.. I've installed pstotext, which is working right (pstotext "nameofthefile.pdf" shows the contents of the pdf file in STDOUT) i've changed the config file for phpdig to use this.. Quote:
ok... ? When i try to refresh my site, in phpdig admin, pdf files are found, and seems to be indexed.. but when i try to search a name in the pdf text.. no responses.. So where could be the problem ? |
Hi. Are you using Windows? If so, set define('USE_IS_EXECUTABLE_COMMAND','0');
Also, are you indexing a page that links to the PDFs or trying to index the PDFs directly? |
I'm running linux, a mandrake 9.1 but i've reinstalled apache and php from the base source
i'm indexing pdf which are linked in some articles, an example : http://umvf.cochin.univ-paris5.fr/ar...id_article=295 |
Hi. From that PDF document, I get the following:
Code:
mysql> select keywords.* from engine,keywords |
hmm... ?
what i'm supposed to search ? scuse me but i'm not quite sure ? i've tried : SELECT * FROM `keywords` WHERE keyword = '1995'; SELECT * FROM `keywords` WHERE keyword = '500'; .... SELECT * FROM `keywords` WHERE keyword = 'nancy'; but i've got no results for some of them..and the words which are found may be in others articles. But i've tried a search for "carayon" which is an author of this pdf, and his name is not found, neither in mysql base, or in the search, of course.. Sorry, but I really don't know anything about the encoding used for pdf files... I've updated my version to 1.6.5, but no changes for this problem |
Hi. Try saving the PDF at http://www.phpdig.net/demo/avare.pdf and place it on your site in a simple HTML file like so and then try to crawl this HTML file with search depth one. Now when you search on Elise do you see any result?
Code:
<html> |
ok, i've put the avare.pdf, and a html page
i've crawled this : Quote:
but when i search "harpagon" for example... No results.. Hmm.. is it bad, doc ? |
Hi. The avare.pdf file should be good. When you go into the text_content directory, and from shell type
grep -i harpagon * do you see anything? |
no response to that command..
No harpagon in text_content... |
Hi. Okay, it looks like pstotext is not successfully executing from exec($command,$result,$retval); in the robot_functions.php file. From shell type locate pstotext to check the path. If /usr/local/bin/pstotext is the correct path and the output goes to STDOUT, the configuration you posted looks correct. Right after exec($command,$result,$retval); try adding the following and then reindex the avare2.html:
PHP Code:
|
hmmm.....:confused:
i've verified the path to pstotext which is right /usr/local/bin/pstotext the output goes to STDOUT ...? the results of pstotext command goes directly on the console ? that's ok ? i've got this code now in my robot_functions.php PHP Code:
Is this ok, with the code u gave ? i've try to delete and re-index the avare html & pdf.. i can't see the "echo $command . "<br>"; result... but still no "harpagon" in text_contents neither in the results of a search.. argh... |
Hi. Yes, that is correct. It looks like $usetool remains set to false so the contents of the if statement are not getting executed. In robot_functions.php add the following and delete and reindex avare2.html. What does it output?
PHP Code:
|
here is the output :
Quote:
Ok i've tried to follow back the code in the function phpdigTestUrl where u set the $status.. i've verified the response of the browser to be "application/pdf" and the encoding is iso-8859-1 as i thought.. but i don't really understnd where the problem is... it seems to be in html mode only, and never try to crawl the pdf ? |
Hi. When you go to http://umvf.cochin.univ-paris5.fr/avare2.pdf does your browser open the PDF in the browser window or does your browser prompt you to download the file?
|
it promps for download in IE, but it's my settings in acrobat, i think...
but what does it change for the bot ? |
All times are GMT -8. The time now is 01:55 PM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.