PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 01-06-2004, 05:39 AM   #1
zevince
Green Mole
 
Join Date: Dec 2003
Posts: 26
pdf indexing with pstotext

Hi,

I'm running an apache 1.3.28 with php 4.3.4rc1. and phpdig 1.6.4 (hmm, i should updgrade...)
But here is my problem..
I've got a lot of pdf, and i want them to be indexed..

I've installed pstotext, which is working right (pstotext "nameofthefile.pdf" shows the contents of the pdf file in STDOUT)

i've changed the config file for phpdig to use this..

Quote:
define('USE_IS_EXECUTABLE_COMMAND','1'); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',false);
define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',false);
define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension needed
// for example, use .txt if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');

ok... ?

When i try to refresh my site, in phpdig admin, pdf files are found, and seems to be indexed.. but when i try to search a name in the pdf text.. no responses..

So where could be the problem ?
zevince is offline   Reply With Quote
Old 01-06-2004, 10:02 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Are you using Windows? If so, set define('USE_IS_EXECUTABLE_COMMAND','0');

Also, are you indexing a page that links to the PDFs or trying to index the PDFs directly?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-06-2004, 12:24 PM   #3
zevince
Green Mole
 
Join Date: Dec 2003
Posts: 26
I'm running linux, a mandrake 9.1 but i've reinstalled apache and php from the base source

i'm indexing pdf which are linked in some articles, an example :
http://umvf.cochin.univ-paris5.fr/ar...id_article=295
zevince is offline   Reply With Quote
Old 01-06-2004, 02:23 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. From that PDF document, I get the following:
Code:
mysql> select keywords.* from engine,keywords
where engine.key_id = keywords.key_id and
engine.spider_id = xxxxx;
+--------+------------+-----------+
| key_id | twoletters | keyword   |
+--------+------------+-----------+
|  xxxxx | 19         | 1995      |
|  xxxxx | 50         | 500       |
|  xxxxx | 30         | 300       |
|  xxxxx | 80         | 80-100    |
|  xxxxx | in         | infection |
|  xxxxx | na         | nantes    |
|  xxxxx | na         | nancy     |
+--------+------------+-----------+
7 rows in set (0.01 sec)
Are you able to find results for any of the keywords in the keyword column above? Also, do you know what encoding was used to make this PDF file?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-07-2004, 01:46 AM   #5
zevince
Green Mole
 
Join Date: Dec 2003
Posts: 26
hmm... ?
what i'm supposed to search ? scuse me but i'm not quite sure ?

i've tried :
SELECT * FROM `keywords` WHERE keyword = '1995';
SELECT * FROM `keywords` WHERE keyword = '500';
....
SELECT * FROM `keywords` WHERE keyword = 'nancy';

but i've got no results for some of them..and the words which are found may be in others articles.
But i've tried a search for "carayon" which is an author of this pdf, and his name is not found, neither in mysql base, or in the search, of course..

Sorry, but I really don't know anything about the encoding used for pdf files...


I've updated my version to 1.6.5, but no changes for this problem
zevince is offline   Reply With Quote
Old 01-07-2004, 04:22 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Try saving the PDF at http://www.phpdig.net/demo/avare.pdf and place it on your site in a simple HTML file like so and then try to crawl this HTML file with search depth one. Now when you search on Elise do you see any result?
Code:
<html>
<body>
<a href="http://umvf.cochin.univ-paris5.fr/avare.pdf">test</a>
</body>
</html>
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-07-2004, 04:54 AM   #7
zevince
Green Mole
 
Join Date: Dec 2003
Posts: 26
ok, i've put the avare.pdf, and a html page
i've crawled this :


but when i search "harpagon" for example...

No results..


Hmm.. is it bad, doc ?
zevince is offline   Reply With Quote
Old 01-07-2004, 05:04 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The avare.pdf file should be good. When you go into the text_content directory, and from shell type
grep -i harpagon *
do you see anything?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-07-2004, 05:53 AM   #9
zevince
Green Mole
 
Join Date: Dec 2003
Posts: 26
no response to that command..
No harpagon in text_content...
zevince is offline   Reply With Quote
Old 01-07-2004, 06:29 AM   #10
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Okay, it looks like pstotext is not successfully executing from exec($command,$result,$retval); in the robot_functions.php file. From shell type locate pstotext to check the path. If /usr/local/bin/pstotext is the correct path and the output goes to STDOUT, the configuration you posted looks correct. Right after exec($command,$result,$retval); try adding the following and then reindex the avare2.html:
PHP Code:
echo $command "<br>"// try running this from shell in admin dir
print_r($result); // holds the output sent to STDOUT
echo "<br>" $retval// is zero if command succeeded 
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-07-2004, 07:22 AM   #11
zevince
Green Mole
 
Join Date: Dec 2003
Posts: 26
hmmm.....

i've verified the path to pstotext which is right
/usr/local/bin/pstotext

the output goes to STDOUT ...? the results of pstotext command goes directly on the console ? that's ok ?

i've got this code now in my robot_functions.php

PHP Code:
    if ($usetool) {
        
rename($tempfile1,$tempfile2);
        
exec($command,$result,$retval);
        echo 
$command "<br>"// try running this from shell in admin dir 
    
print_r($result); // holds the output sent to STDOUT 
    
echo "<br>" $retval// is zero if command succeeded
        
unlink($tempfile2);
        if (!
$retval) {
             
// the replacement if š is for unbreaking spaces
             // returned by catdoc parsing msword files
             // and '0xAD' "tiret quadratin" returned by pstotext
             // in iso-8859-1
             // Adjust with your encoding and/or your tools
             
if ((is_array($result)) && (count($result) > 0)) {
                
$f_handler fopen($tempfile1,'wb');
                
fwrite($f_handler,str_replace('š',' ',str_replace(chr(0xad),'-',implode(' ',$result))));
                
fclose($f_handler);
             }
        }
        else {
              return array(
'tempfile'=>0,'tempfilesize'=>0);
        }
    } 

Is this ok, with the code u gave ?
i've try to delete and re-index the avare html & pdf..

i can't see the "echo $command . "<br>"; result...

but still no "harpagon" in text_contents neither in the results of a search..

argh...

Last edited by zevince; 01-07-2004 at 07:28 AM.
zevince is offline   Reply With Quote
Old 01-07-2004, 07:52 AM   #12
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Yes, that is correct. It looks like $usetool remains set to false so the contents of the if statement are not getting executed. In robot_functions.php add the following and delete and reindex avare2.html. What does it output?
PHP Code:
$usetool false;
echo 
$result_test['status'] . " <--- Status<br>"// add this line 
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-08-2004, 01:56 AM   #13
zevince
Green Mole
 
Join Date: Dec 2003
Posts: 26
here is the output :

Quote:
HTML <--- Status
Doublon avec un document existant
43:http://umvf.cochin.univ-paris5.fr/ar...id_article=177
(temps : 00:00:13)

File date unchanged
44:http://umvf.cochin.univ-paris5.fr/ru...id_rubrique=91
(temps : 00:00:13)

File date unchanged
45:http://umvf.cochin.univ-paris5.fr/ru...id_rubrique=99
(temps : 00:00:13)

HTML <--- Status
46:http://umvf.cochin.univ-paris5.fr/avare3.html
(temps : 00:00:13)
+
niveau 1...
47:http://umvf.cochin.univ-paris5.fr/avare2.pdf
(temps : 00:00:14)

HTML <--- Status
Doublon avec un document existant
48:http://umvf.cochin.univ-paris5.fr/spip_login.php3
(temps : 00:00:14)

49:http://umvf.cochin.univ-paris5.fr/IMG/pdf/albumine.pdf
(temps : 00:00:14)

Pas de liens dans la table temporaire

Ok i've tried to follow back the code in the function phpdigTestUrl where u set the $status..
i've verified the response of the browser to be "application/pdf" and the encoding is iso-8859-1 as i thought..
but i don't really understnd where the problem is...

it seems to be in html mode only, and never try to crawl the pdf ?

Last edited by zevince; 01-08-2004 at 05:47 AM.
zevince is offline   Reply With Quote
Old 01-08-2004, 05:56 AM   #14
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. When you go to http://umvf.cochin.univ-paris5.fr/avare2.pdf does your browser open the PDF in the browser window or does your browser prompt you to download the file?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-08-2004, 06:02 AM   #15
zevince
Green Mole
 
Join Date: Dec 2003
Posts: 26
it promps for download in IE, but it's my settings in acrobat, i think...
but what does it change for the bot ?
zevince is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Indexing PDF dlaperle Troubleshooting 1 03-21-2007 07:00 PM
spider hangs on indexing pdf (pstotext) sushie External Binaries 7 06-15-2005 05:57 AM
indexing pdf Hoek External Binaries 9 02-25-2004 02:42 AM
PDF indexing lelandv External Binaries 15 12-08-2003 04:23 PM
PDF indexing aryan External Binaries 11 11-27-2003 07:51 AM


All times are GMT -8. The time now is 02:28 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.