PDA

View Full Version : pdf indexing with pstotext


zevince
01-06-2004, 05:39 AM
Hi,

I'm running an apache 1.3.28 with php 4.3.4rc1. and phpdig 1.6.4 (hmm, i should updgrade...)
But here is my problem..
I've got a lot of pdf, and i want them to be indexed..

I've installed pstotext, which is working right (pstotext "nameofthefile.pdf" shows the contents of the pdf file in STDOUT)

i've changed the config file for phpdig to use this..

define('USE_IS_EXECUTABLE_COMMAND','1'); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',false);
define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',false);
define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension needed
// for example, use .txt if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');



ok... ?

When i try to refresh my site, in phpdig admin, pdf files are found, and seems to be indexed.. but when i try to search a name in the pdf text.. no responses..

So where could be the problem ?

Charter
01-06-2004, 10:02 AM
Hi. Are you using Windows? If so, set define('USE_IS_EXECUTABLE_COMMAND','0');

Also, are you indexing a page that links to the PDFs or trying to index the PDFs directly?

zevince
01-06-2004, 12:24 PM
I'm running linux, a mandrake 9.1 but i've reinstalled apache and php from the base source

i'm indexing pdf which are linked in some articles, an example :
http://umvf.cochin.univ-paris5.fr/article.php3?id_article=295

Charter
01-06-2004, 02:23 PM
Hi. From that PDF document, I get the following:

mysql> select keywords.* from engine,keywords
where engine.key_id = keywords.key_id and
engine.spider_id = xxxxx;
+--------+------------+-----------+
| key_id | twoletters | keyword |
+--------+------------+-----------+
| xxxxx | 19 | 1995 |
| xxxxx | 50 | 500 |
| xxxxx | 30 | 300 |
| xxxxx | 80 | 80-100 |
| xxxxx | in | infection |
| xxxxx | na | nantes |
| xxxxx | na | nancy |
+--------+------------+-----------+
7 rows in set (0.01 sec)

Are you able to find results for any of the keywords in the keyword column above? Also, do you know what encoding was used to make this PDF file?

zevince
01-07-2004, 01:46 AM
hmm... ?
what i'm supposed to search ? scuse me but i'm not quite sure ?

i've tried :
SELECT * FROM `keywords` WHERE keyword = '1995';
SELECT * FROM `keywords` WHERE keyword = '500';
....
SELECT * FROM `keywords` WHERE keyword = 'nancy';

but i've got no results for some of them..and the words which are found may be in others articles.
But i've tried a search for "carayon" which is an author of this pdf, and his name is not found, neither in mysql base, or in the search, of course..

Sorry, but I really don't know anything about the encoding used for pdf files...


I've updated my version to 1.6.5, but no changes for this problem

Charter
01-07-2004, 04:22 AM
Hi. Try saving the PDF at http://www.phpdig.net/demo/avare.pdf and place it on your site in a simple HTML file like so and then try to crawl this HTML file with search depth one. Now when you search on Elise do you see any result?

<html>
<body>
<a href="http://umvf.cochin.univ-paris5.fr/avare.pdf">test</a>
</body>
</html>

zevince
01-07-2004, 04:54 AM
ok, i've put the avare.pdf, and a html page
i've crawled this :

46:http://umvf.cochin.univ-paris5.fr/avare2.html
(temps : 00:00:12)
+
niveau 1...
47:http://umvf.cochin.univ-paris5.fr/avare.pdf
(temps : 00:00:12)


(...)

liens trouvés : 7
http://umvf.cochin.univ-paris5.fr/article.php3?id_article=238
http://umvf.cochin.univ-paris5.fr/article.php3?id_article=130
http://umvf.cochin.univ-paris5.fr/article.php3?id_article=125
http://umvf.cochin.univ-paris5.fr/article.php3?id_article=177
http://umvf.cochin.univ-paris5.fr/avare2.html
http://umvf.cochin.univ-paris5.fr/avare.pdf
http://umvf.cochin.univ-paris5.fr/spip_login.php3
Optimizing tables...
Indexation terminée !




but when i search "harpagon" for example...

No results..


Hmm.. is it bad, doc ?

Charter
01-07-2004, 05:04 AM
Hi. The avare.pdf file should be good. When you go into the text_content directory, and from shell type
grep -i harpagon *
do you see anything?

zevince
01-07-2004, 05:53 AM
no response to that command..
No harpagon in text_content...

Charter
01-07-2004, 06:29 AM
Hi. Okay, it looks like pstotext is not successfully executing from exec($command,$result,$retval); in the robot_functions.php file. From shell type locate pstotext to check the path. If /usr/local/bin/pstotext is the correct path and the output goes to STDOUT, the configuration you posted looks correct. Right after exec($command,$result,$retval); try adding the following and then reindex the avare2.html:

echo $command . "<br>"; // try running this from shell in admin dir
print_r($result); // holds the output sent to STDOUT
echo "<br>" . $retval; // is zero if command succeeded

zevince
01-07-2004, 07:22 AM
hmmm.....:confused:

i've verified the path to pstotext which is right
/usr/local/bin/pstotext

the output goes to STDOUT ...? the results of pstotext command goes directly on the console ? that's ok ?

i've got this code now in my robot_functions.php

if ($usetool) {
rename($tempfile1,$tempfile2);
exec($command,$result,$retval);
echo $command . "<br>"; // try running this from shell in admin dir
print_r($result); // holds the output sent to STDOUT
echo "<br>" . $retval; // is zero if command succeeded
unlink($tempfile2);
if (!$retval) {
// the replacement if š is for unbreaking spaces
// returned by catdoc parsing msword files
// and '0xAD' "tiret quadratin" returned by pstotext
// in iso-8859-1
// Adjust with your encoding and/or your tools
if ((is_array($result)) && (count($result) > 0)) {
$f_handler = fopen($tempfile1,'wb');
fwrite($f_handler,str_replace('š',' ',str_replace(chr(0xad),'-',implode(' ',$result))));
fclose($f_handler);
}
}
else {
return array('tempfile'=>0,'tempfilesize'=>0);
}
}




Is this ok, with the code u gave ?
i've try to delete and re-index the avare html & pdf..

i can't see the "echo $command . "<br>"; result...

but still no "harpagon" in text_contents neither in the results of a search..

argh...

Charter
01-07-2004, 07:52 AM
Hi. Yes, that is correct. It looks like $usetool remains set to false so the contents of the if statement are not getting executed. In robot_functions.php add the following and delete and reindex avare2.html. What does it output?

$usetool = false;
echo $result_test['status'] . " <--- Status<br>"; // add this line

zevince
01-08-2004, 01:56 AM
here is the output :

HTML <--- Status
Doublon avec un document existant
43:http://umvf.cochin.univ-paris5.fr/article.php3?id_article=177
(temps : 00:00:13)

File date unchanged
44:http://umvf.cochin.univ-paris5.fr/rubrique.php3?id_rubrique=91
(temps : 00:00:13)

File date unchanged
45:http://umvf.cochin.univ-paris5.fr/rubrique.php3?id_rubrique=99
(temps : 00:00:13)

HTML <--- Status
46:http://umvf.cochin.univ-paris5.fr/avare3.html
(temps : 00:00:13)
+
niveau 1...
47:http://umvf.cochin.univ-paris5.fr/avare2.pdf
(temps : 00:00:14)

HTML <--- Status
Doublon avec un document existant
48:http://umvf.cochin.univ-paris5.fr/spip_login.php3
(temps : 00:00:14)

49:http://umvf.cochin.univ-paris5.fr/IMG/pdf/albumine.pdf
(temps : 00:00:14)

Pas de liens dans la table temporaire


Ok i've tried to follow back the code in the function phpdigTestUrl where u set the $status..
i've verified the response of the browser to be "application/pdf" and the encoding is iso-8859-1 as i thought..
but i don't really understnd where the problem is...

it seems to be in html mode only, and never try to crawl the pdf ?

Charter
01-08-2004, 05:56 AM
Hi. When you go to http://umvf.cochin.univ-paris5.fr/avare2.pdf does your browser open the PDF in the browser window or does your browser prompt you to download the file?

zevince
01-08-2004, 06:02 AM
it promps for download in IE, but it's my settings in acrobat, i think...
but what does it change for the bot ?

Charter
01-08-2004, 06:38 AM
Hi. On the very last else of the phpdigTempFile function, add the following:

else {
// add the two echo lines

echo "Evaluation is false for URI: " . $uri . "<br>";
echo "Result test contains: " . print_r($result_test) . "<br>";

return array('tempfile'=>0,'tempfilesize'=>0);
}

zevince
01-08-2004, 07:05 AM
ok here is the output :

HTML <--- Status
Doublon avec un document existant
43:http://umvf.cochin.univ-paris5.fr/rubrique.php3?id_rubrique=91
(temps : 00:00:12)
File date unchanged
44:http://umvf.cochin.univ-paris5.fr/rubrique.php3?id_rubrique=99
(temps : 00:00:12)
HTML <--- Status
45:http://umvf.cochin.univ-paris5.fr/avare3.html
(temps : 00:00:12)
+
niveau 1...
Evaluation is false for URI: http://umvf.cochin.univ-paris5.fr/avare2.pdf
Array ( [status] => PDF [lm_date] => Wed, 07 Jan 2004 13:39:43 GMT [path] => /avare2.pdf [host] => umvf.cochin.univ-paris5.fr [cookies] => Array ( ) ) Result test contains: 1
46:http://umvf.cochin.univ-paris5.fr/avare2.pdf
(temps : 00:00:12)

Charter
01-08-2004, 07:47 AM
Hi. Okay, it is the below that is failing because print_r($result_test) outputs one:

if (is_array($result_test)
&& $result_test['status'] == 'HTML'
|| $result_test['status'] == 'PLAINTEXT'
|| $result_test['status'] == 'MSWORD' && PHPDIG_INDEX_MSWORD == true && file_exists(PHPDIG_PARSE_MSWORD) && $is_exec_command_msword
|| $result_test['status'] == 'MSEXCEL' && PHPDIG_INDEX_MSEXCEL == true && file_exists(PHPDIG_PARSE_MSEXCEL) && $is_exec_command_msexcel
|| $result_test['status'] == 'PDF' && PHPDIG_INDEX_PDF == true && file_exists(PHPDIG_PARSE_PDF) && $is_exec_command_pdf
)

There was another strange occurrence here (http://www.phpdig.net/showthread.php?threadid=248&pagenumber=2) that dealt with cookies. If cookies are not the issue, try crawling again and copy paste the info from the raw Apache logs for the crawl.

zevince
01-09-2004, 01:48 AM
hmm, i've tried to put the cookies for http://umvf.cochin.univ-paris5.fr/ always permitted, but it seems not to change anything in the crawl :

File date unchanged
41:http://umvf.cochin.univ-paris5.fr/rubrique.php3?id_rubrique=99
(temps : 00:00:10)

HTML <--- Status
42:http://umvf.cochin.univ-paris5.fr/avare3.html
(temps : 00:00:10)
+
niveau 1...
Evaluation is false for URI: http://umvf.cochin.univ-paris5.fr/avare2.pdf
Array ( [status] => PDF [lm_date] => Wed, 07 Jan 2004 13:39:43 GMT [path] => /avare2.pdf [host] => umvf.cochin.univ-paris5.fr [cookies] => Array ( ) ) Result test contains: 1
43:http://umvf.cochin.univ-paris5.fr/avare2.pdf
(temps : 00:00:10)

HTML <--- Status
Doublon avec un document existant
44:http://umvf.cochin.univ-paris5.fr/spip_login.php3
(temps : 00:00:10)

Pas de liens dans la table temporaire



And here is the output of the "access_combined_log" from apache

umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/style_uol.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/habillage.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/impression.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/lien.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /article.php3?id_article=177 HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "GET /article.php3?id_article=177 HTTP/1.0" 302 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "GET /desole.html HTTP/1.0" 200 1036 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /menu.css HTTP/1.1" 404 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/spip_style.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /rubrique.php3?id_rubrique=99 HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /menu.css HTTP/1.1" 404 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/spip_style.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /avare3.html HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "GET /avare3.html HTTP/1.0" 200 61 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "HEAD /avare2.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "HEAD /avare2.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "HEAD /ecrire/ HTTP/1.1" 302 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "HEAD //../spip_login.php3 HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "GET /spip_login.php3 HTTP/1.0" 200 2222 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
nestor.lurt-cochin.prd.fr - admin [09/Jan/2004:11:36:38 +0100] "POST /recherche/admin/spider.php HTTP/1.1" 200 8691 "http://umvf.cochin.univ-paris5.fr/recherche/admin/index.php" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Charter
01-09-2004, 03:26 AM
Hi. It really seems like the following returns false: $result_test['status'] == 'PDF' && PHPDIG_INDEX_PDF == true && file_exists(PHPDIG_PARSE_PDF) && $is_exec_command_pdf

However, it looks like you echo $result_test_http which says status is PDF but then later $result_test says one: Array ( [status] => PDF [lm_date] => Wed, 07 Jan 2004 13:39:43 GMT [path] => /avare2.pdf [host] => umvf.cochin.univ-paris5.fr [cookies] => Array ( ) ) Result test contains: 1

Let's echo the below items right before and right after the phpdigTempFile function is called and try another index:

In spider.php, add the following echo statements:

// sets $tempfile and $tempfilesize

/*****/
echo "<br><br>Is result test http an array: " . is_array($result_test_http) . "<br>";
echo "What is result test http status: " . $result_test_http['status'] . "<br><br>";
/*****/

extract(phpdigTempFile($url_indexing,$result_test_http,$relative_script_pat h.'/admin/temp/'));

In robot_functions.php, add the following echo statements:

function phpdigTempFile($uri,$result_test,$prefix='temp/',$suffix1='1.tmp',$suffix2='2.tmp') {

/*****/
echo "<br><br>Is result test an array: " . is_array($result_test) . "<br>";
echo "What is result test status: " . $result_test['status'] . "<br>";
echo "Use is executable is set to: " . USE_IS_EXECUTABLE_COMMAND . "<br>";
echo "Index the pdf is set to: " . PHPDIG_INDEX_PDF . "<br>";
echo "Parse the pdf is set to: " . PHPDIG_PARSE_PDF . "<br>";
echo "Does parse pdf exist: " . file_exists(PHPDIG_PARSE_PDF) . "<br>";
echo "Is parse pdf executable: " . is_executable(PHPDIG_PARSE_PDF) . "<br><br>";
/*****/

// $temp_filename = md5(time()+getmypid()).$suffix;

zevince
01-12-2004, 03:40 AM
ok, i've appended the echo statements, and respider avare3.html

HTML <--- Status
Doublon avec un document existant
40:http://umvf.cochin.univ-paris5.fr/article.php3?id_article=177
(temps : 00:00:50)

File date unchanged
41:http://umvf.cochin.univ-paris5.fr/rubrique.php3?id_rubrique=99
(temps : 00:00:50)



Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable:

HTML <--- Status
42:http://umvf.cochin.univ-paris5.fr/avare3.html
(temps : 00:00:50)
+
niveau 1...


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable:

Evaluation is false for URI: http://umvf.cochin.univ-paris5.fr/avare2.pdf
Array ( [status] => PDF [lm_date] => Wed, 07 Jan 2004 13:39:43 GMT [path] => /avare2.pdf [host] => umvf.cochin.univ-paris5.fr [cookies] => Array ( ) ) Result test contains: 1
43:http://umvf.cochin.univ-paris5.fr/avare2.pdf
(temps : 00:00:50)



Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable:

HTML <--- Status
Doublon avec un document existant
44:http://umvf.cochin.univ-paris5.fr/article.php3?id_article=131
(temps : 00:00:51)



Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable:

HTML <--- Status
Doublon avec un document existant
45:http://umvf.cochin.univ-paris5.fr/article.php3?id_article=134
(temps : 00:00:52)

Charter
01-12-2004, 04:19 AM
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');

echo "Is parse pdf executable: " . is_executable(PHPDIG_PARSE_PDF) . "<br><br>";

Is parse pdf executable: // empty meaning false

Hi. The "is parse pdf executable" is not returning a result. This is why the expression in the if statement evaluates to false.

From php.net (http://www.php.net/manual/en/function.is-executable.php) "If a directory is not executable, then you cannot get details on the files in the directory - this includes the permissions."

Try checking that the usr, local, and bin directories, as well as the pstotext file, are all 755 permissions.

zevince
01-12-2004, 04:51 AM
Ok, it's working.. my /usr/local/bin/pstotext dirs and binaries were in chmod 751 and not 755..

sorry for the time to solve this, And thanks very much for your help !