PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   External Binaries (http://www.phpdig.net/forum/forumdisplay.php?f=36)
-   -   pdf indexing with pstotext (http://www.phpdig.net/forum/showthread.php?t=360)

zevince 01-06-2004 05:39 AM

pdf indexing with pstotext
 
Hi,

I'm running an apache 1.3.28 with php 4.3.4rc1. and phpdig 1.6.4 (hmm, i should updgrade...)
But here is my problem..
I've got a lot of pdf, and i want them to be indexed..

I've installed pstotext, which is working right (pstotext "nameofthefile.pdf" shows the contents of the pdf file in STDOUT)

i've changed the config file for phpdig to use this..

Quote:

define('USE_IS_EXECUTABLE_COMMAND','1'); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',false);
define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',false);
define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension needed
// for example, use .txt if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');

ok... ?

When i try to refresh my site, in phpdig admin, pdf files are found, and seems to be indexed.. but when i try to search a name in the pdf text.. no responses..

So where could be the problem ?

Charter 01-06-2004 10:02 AM

Hi. Are you using Windows? If so, set define('USE_IS_EXECUTABLE_COMMAND','0');

Also, are you indexing a page that links to the PDFs or trying to index the PDFs directly?

zevince 01-06-2004 12:24 PM

I'm running linux, a mandrake 9.1 but i've reinstalled apache and php from the base source

i'm indexing pdf which are linked in some articles, an example :
http://umvf.cochin.univ-paris5.fr/ar...id_article=295

Charter 01-06-2004 02:23 PM

Hi. From that PDF document, I get the following:
Code:

mysql> select keywords.* from engine,keywords
where engine.key_id = keywords.key_id and
engine.spider_id = xxxxx;
+--------+------------+-----------+
| key_id | twoletters | keyword  |
+--------+------------+-----------+
|  xxxxx | 19        | 1995      |
|  xxxxx | 50        | 500      |
|  xxxxx | 30        | 300      |
|  xxxxx | 80        | 80-100    |
|  xxxxx | in        | infection |
|  xxxxx | na        | nantes    |
|  xxxxx | na        | nancy    |
+--------+------------+-----------+
7 rows in set (0.01 sec)

Are you able to find results for any of the keywords in the keyword column above? Also, do you know what encoding was used to make this PDF file?

zevince 01-07-2004 01:46 AM

hmm... ?
what i'm supposed to search ? scuse me but i'm not quite sure ?

i've tried :
SELECT * FROM `keywords` WHERE keyword = '1995';
SELECT * FROM `keywords` WHERE keyword = '500';
....
SELECT * FROM `keywords` WHERE keyword = 'nancy';

but i've got no results for some of them..and the words which are found may be in others articles.
But i've tried a search for "carayon" which is an author of this pdf, and his name is not found, neither in mysql base, or in the search, of course..

Sorry, but I really don't know anything about the encoding used for pdf files...


I've updated my version to 1.6.5, but no changes for this problem

Charter 01-07-2004 04:22 AM

Hi. Try saving the PDF at http://www.phpdig.net/demo/avare.pdf and place it on your site in a simple HTML file like so and then try to crawl this HTML file with search depth one. Now when you search on Elise do you see any result?
Code:

<html>
<body>
<a href="http://umvf.cochin.univ-paris5.fr/avare.pdf">test</a>
</body>
</html>


zevince 01-07-2004 04:54 AM

ok, i've put the avare.pdf, and a html page
i've crawled this :


but when i search "harpagon" for example...

No results..


Hmm.. is it bad, doc ?

Charter 01-07-2004 05:04 AM

Hi. The avare.pdf file should be good. When you go into the text_content directory, and from shell type
grep -i harpagon *
do you see anything?

zevince 01-07-2004 05:53 AM

no response to that command..
No harpagon in text_content...

Charter 01-07-2004 06:29 AM

Hi. Okay, it looks like pstotext is not successfully executing from exec($command,$result,$retval); in the robot_functions.php file. From shell type locate pstotext to check the path. If /usr/local/bin/pstotext is the correct path and the output goes to STDOUT, the configuration you posted looks correct. Right after exec($command,$result,$retval); try adding the following and then reindex the avare2.html:
PHP Code:

echo $command "<br>"// try running this from shell in admin dir
print_r($result); // holds the output sent to STDOUT
echo "<br>" $retval// is zero if command succeeded 


zevince 01-07-2004 07:22 AM

hmmm.....:confused:

i've verified the path to pstotext which is right
/usr/local/bin/pstotext

the output goes to STDOUT ...? the results of pstotext command goes directly on the console ? that's ok ?

i've got this code now in my robot_functions.php

PHP Code:

    if ($usetool) {
        
rename($tempfile1,$tempfile2);
        
exec($command,$result,$retval);
        echo 
$command "<br>"// try running this from shell in admin dir 
    
print_r($result); // holds the output sent to STDOUT 
    
echo "<br>" $retval// is zero if command succeeded
        
unlink($tempfile2);
        if (!
$retval) {
             
// the replacement if š is for unbreaking spaces
             // returned by catdoc parsing msword files
             // and '0xAD' "tiret quadratin" returned by pstotext
             // in iso-8859-1
             // Adjust with your encoding and/or your tools
             
if ((is_array($result)) && (count($result) > 0)) {
                
$f_handler fopen($tempfile1,'wb');
                
fwrite($f_handler,str_replace('š',' ',str_replace(chr(0xad),'-',implode(' ',$result))));
                
fclose($f_handler);
             }
        }
        else {
              return array(
'tempfile'=>0,'tempfilesize'=>0);
        }
    } 


Is this ok, with the code u gave ?
i've try to delete and re-index the avare html & pdf..

i can't see the "echo $command . "<br>"; result...

but still no "harpagon" in text_contents neither in the results of a search..

argh...

Charter 01-07-2004 07:52 AM

Hi. Yes, that is correct. It looks like $usetool remains set to false so the contents of the if statement are not getting executed. In robot_functions.php add the following and delete and reindex avare2.html. What does it output?
PHP Code:

$usetool false;
echo 
$result_test['status'] . " <--- Status<br>"// add this line 


zevince 01-08-2004 01:56 AM

here is the output :

Quote:

HTML <--- Status
Doublon avec un document existant
43:http://umvf.cochin.univ-paris5.fr/ar...id_article=177
(temps : 00:00:13)

File date unchanged
44:http://umvf.cochin.univ-paris5.fr/ru...id_rubrique=91
(temps : 00:00:13)

File date unchanged
45:http://umvf.cochin.univ-paris5.fr/ru...id_rubrique=99
(temps : 00:00:13)

HTML <--- Status
46:http://umvf.cochin.univ-paris5.fr/avare3.html
(temps : 00:00:13)
+
niveau 1...
47:http://umvf.cochin.univ-paris5.fr/avare2.pdf
(temps : 00:00:14)

HTML <--- Status
Doublon avec un document existant
48:http://umvf.cochin.univ-paris5.fr/spip_login.php3
(temps : 00:00:14)

49:http://umvf.cochin.univ-paris5.fr/IMG/pdf/albumine.pdf
(temps : 00:00:14)

Pas de liens dans la table temporaire

Ok i've tried to follow back the code in the function phpdigTestUrl where u set the $status..
i've verified the response of the browser to be "application/pdf" and the encoding is iso-8859-1 as i thought..
but i don't really understnd where the problem is...

it seems to be in html mode only, and never try to crawl the pdf ?

Charter 01-08-2004 05:56 AM

Hi. When you go to http://umvf.cochin.univ-paris5.fr/avare2.pdf does your browser open the PDF in the browser window or does your browser prompt you to download the file?

zevince 01-08-2004 06:02 AM

it promps for download in IE, but it's my settings in acrobat, i think...
but what does it change for the bot ?


All times are GMT -8. The time now is 01:55 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.