PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   External Binaries (http://www.phpdig.net/forum/forumdisplay.php?f=36)
-   -   pdf indexing with pstotext (http://www.phpdig.net/forum/showthread.php?t=360)

Charter 01-08-2004 06:38 AM

Hi. On the very last else of the phpdigTempFile function, add the following:
PHP Code:

else {
      
// add the two echo lines

      
echo "Evaluation is false for URI: " $uri "<br>";
      echo 
"Result test contains: " print_r($result_test) . "<br>";

      return array(
'tempfile'=>0,'tempfilesize'=>0);



zevince 01-08-2004 07:05 AM

ok here is the output :

Quote:

HTML <--- Status
Doublon avec un document existant
43:http://umvf.cochin.univ-paris5.fr/ru...id_rubrique=91
(temps : 00:00:12)
File date unchanged
44:http://umvf.cochin.univ-paris5.fr/ru...id_rubrique=99
(temps : 00:00:12)
HTML <--- Status
45:http://umvf.cochin.univ-paris5.fr/avare3.html
(temps : 00:00:12)
+
niveau 1...
Evaluation is false for URI: http://umvf.cochin.univ-paris5.fr/avare2.pdf
Array ( [status] => PDF [lm_date] => Wed, 07 Jan 2004 13:39:43 GMT [path] => /avare2.pdf [host] => umvf.cochin.univ-paris5.fr [cookies] => Array ( ) ) Result test contains: 1
46:http://umvf.cochin.univ-paris5.fr/avare2.pdf
(temps : 00:00:12)

Charter 01-08-2004 07:47 AM

Hi. Okay, it is the below that is failing because print_r($result_test) outputs one:
PHP Code:

if (is_array($result_test)
     && 
$result_test['status'] == 'HTML'
     
|| $result_test['status'] == 'PLAINTEXT'
     
|| $result_test['status'] == 'MSWORD' && PHPDIG_INDEX_MSWORD == true && file_exists(PHPDIG_PARSE_MSWORD) && $is_exec_command_msword
     
|| $result_test['status'] == 'MSEXCEL' && PHPDIG_INDEX_MSEXCEL == true && file_exists(PHPDIG_PARSE_MSEXCEL) && $is_exec_command_msexcel
     
|| $result_test['status'] == 'PDF' && PHPDIG_INDEX_PDF == true && file_exists(PHPDIG_PARSE_PDF) && $is_exec_command_pdf
    


There was another strange occurrence here that dealt with cookies. If cookies are not the issue, try crawling again and copy paste the info from the raw Apache logs for the crawl.

zevince 01-09-2004 01:48 AM

hmm, i've tried to put the cookies for http://umvf.cochin.univ-paris5.fr/ always permitted, but it seems not to change anything in the crawl :

Quote:

File date unchanged
41:http://umvf.cochin.univ-paris5.fr/ru...id_rubrique=99
(temps : 00:00:10)

HTML <--- Status
42:http://umvf.cochin.univ-paris5.fr/avare3.html
(temps : 00:00:10)
+
niveau 1...
Evaluation is false for URI: http://umvf.cochin.univ-paris5.fr/avare2.pdf
Array ( [status] => PDF [lm_date] => Wed, 07 Jan 2004 13:39:43 GMT [path] => /avare2.pdf [host] => umvf.cochin.univ-paris5.fr [cookies] => Array ( ) ) Result test contains: 1
43:http://umvf.cochin.univ-paris5.fr/avare2.pdf
(temps : 00:00:10)

HTML <--- Status
Doublon avec un document existant
44:http://umvf.cochin.univ-paris5.fr/spip_login.php3
(temps : 00:00:10)

Pas de liens dans la table temporaire

And here is the output of the "access_combined_log" from apache

Quote:

umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/style_uol.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/habillage.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/impression.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/lien.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /article.php3?id_article=177 HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "GET /article.php3?id_article=177 HTTP/1.0" 302 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "GET /desole.html HTTP/1.0" 200 1036 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /menu.css HTTP/1.1" 404 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/spip_style.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /rubrique.php3?id_rubrique=99 HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /menu.css HTTP/1.1" 404 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /css/spip_style.css HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "HEAD /avare3.html HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:37 +0100] "GET /avare3.html HTTP/1.0" 200 61 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "HEAD /avare2.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "HEAD /avare2.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "HEAD /ecrire/ HTTP/1.1" 302 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "HEAD //../spip_login.php3 HTTP/1.1" 200 0 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
umvf.cochin.univ-paris5.fr - - [09/Jan/2004:11:36:38 +0100] "GET /spip_login.php3 HTTP/1.0" 200 2222 "-" "PhpDig/1.6.4 (+http://www.phpdig.net/robot.php)"
nestor.lurt-cochin.prd.fr - admin [09/Jan/2004:11:36:38 +0100] "POST /recherche/admin/spider.php HTTP/1.1" 200 8691 "http://umvf.cochin.univ-paris5.fr/recherche/admin/index.php" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Charter 01-09-2004 03:26 AM

Hi. It really seems like the following returns false: $result_test['status'] == 'PDF' && PHPDIG_INDEX_PDF == true && file_exists(PHPDIG_PARSE_PDF) && $is_exec_command_pdf

However, it looks like you echo $result_test_http which says status is PDF but then later $result_test says one: Array ( [status] => PDF [lm_date] => Wed, 07 Jan 2004 13:39:43 GMT [path] => /avare2.pdf [host] => umvf.cochin.univ-paris5.fr [cookies] => Array ( ) ) Result test contains: 1

Let's echo the below items right before and right after the phpdigTempFile function is called and try another index:

In spider.php, add the following echo statements:
PHP Code:

// sets $tempfile and $tempfilesize

/*****/
echo "<br><br>Is result test http an array: " is_array($result_test_http) . "<br>";
echo 
"What is result test http status: " $result_test_http['status'] . "<br><br>";
/*****/

extract(phpdigTempFile($url_indexing,$result_test_http,$relative_script_path.'/admin/temp/')); 

In robot_functions.php, add the following echo statements:
PHP Code:

function phpdigTempFile($uri,$result_test,$prefix='temp/',$suffix1='1.tmp',$suffix2='2.tmp') {

/*****/
echo "<br><br>Is result test an array: " is_array($result_test) . "<br>";
echo 
"What is result test status: " $result_test['status'] . "<br>";
echo 
"Use is executable is set to: " USE_IS_EXECUTABLE_COMMAND "<br>";
echo 
"Index the pdf is set to: " PHPDIG_INDEX_PDF "<br>";
echo 
"Parse the pdf is set to: " PHPDIG_PARSE_PDF "<br>";
echo 
"Does parse pdf exist: " file_exists(PHPDIG_PARSE_PDF) . "<br>";
echo 
"Is parse pdf executable: " is_executable(PHPDIG_PARSE_PDF) . "<br><br>";
/*****/

// $temp_filename = md5(time()+getmypid()).$suffix; 


zevince 01-12-2004 03:40 AM

ok, i've appended the echo statements, and respider avare3.html

Quote:

HTML <--- Status
Doublon avec un document existant
40:http://umvf.cochin.univ-paris5.fr/ar...id_article=177
(temps : 00:00:50)

File date unchanged
41:http://umvf.cochin.univ-paris5.fr/ru...id_rubrique=99
(temps : 00:00:50)



Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable:

HTML <--- Status
42:http://umvf.cochin.univ-paris5.fr/avare3.html
(temps : 00:00:50)
+
niveau 1...


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable:

Evaluation is false for URI: http://umvf.cochin.univ-paris5.fr/avare2.pdf
Array ( [status] => PDF [lm_date] => Wed, 07 Jan 2004 13:39:43 GMT [path] => /avare2.pdf [host] => umvf.cochin.univ-paris5.fr [cookies] => Array ( ) ) Result test contains: 1
43:http://umvf.cochin.univ-paris5.fr/avare2.pdf
(temps : 00:00:50)



Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable:

HTML <--- Status
Doublon avec un document existant
44:http://umvf.cochin.univ-paris5.fr/ar...id_article=131
(temps : 00:00:51)



Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable:

HTML <--- Status
Doublon avec un document existant
45:http://umvf.cochin.univ-paris5.fr/ar...id_article=134
(temps : 00:00:52)

Charter 01-12-2004 04:19 AM

PHP Code:

define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');

echo 
"Is parse pdf executable: " is_executable(PHPDIG_PARSE_PDF) . "<br><br>"

Is parse pdf executable// empty meaning false 

Hi. The "is parse pdf executable" is not returning a result. This is why the expression in the if statement evaluates to false.

From php.net "If a directory is not executable, then you cannot get details on the files in the directory - this includes the permissions."

Try checking that the usr, local, and bin directories, as well as the pstotext file, are all 755 permissions.

zevince 01-12-2004 04:51 AM

Ok, it's working.. my /usr/local/bin/pstotext dirs and binaries were in chmod 751 and not 755..

sorry for the time to solve this, And thanks very much for your help !


All times are GMT -8. The time now is 07:48 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.