View Single Post
Old 11-12-2003, 01:22 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Edit: Note the below is for version 1.6.3 only.

Thanks, but my response speed is not always this fast.

To answer your questions, let's assume the following:
PHP Code:
// in config.php
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc');
define('PHPDIG_OPTION_MSWORD','');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',false);
define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','2.txt'); // 2.txt
define('PHPDIG_MSEXCEL_EXTENSION','');

// in robot_functions.php
$prefix='temp/';
$suffix='.tmp';
$temp_filename 'abcdef'.$suffix// abcdef.tmp
$tempfile $prefix.$temp_filename// temp/abcdef.tmp 
First let's consider catdoc. When doc.doc is crawled, abcdef.tmp is formed. Then abcdef.tmp is renamed to abcdef.tmp2 (rename($tempfile,$tempfile.'2');) and then doc.doc is converted to text (exec($command,$result,$retval);), sticking the results in $result (an array) and returning $retval (success or failure). On success, !$retval is true so $result gets some work done on it and is written back to abcdef.tmp and abcdef.tmp is returned from the last switch statement in the function. The unlink($tempfile.'2'); deletes the abcdef.tmp2 file.

This all works fine because catdoc is sending output to STDOUT, and it is this STDOUT output that is contained in the $result variable.

Now let's consider pdftotext. When doc.pdf is crawled abcdef.tmp is formed. Then abcdef.tmp is renamed to abcdef.tmp2 (rename($tempfile,$tempfile.'2');) and then doc.pdf is converted to text (exec($command,$result,$retval);), sticking the results in $result (an array) and returning $retval (success or failure). As before, on success, !$retval is true so $result gets some work done on it and is written back to abcdef.tmp and abcdef.tmp2.txt is returned from the last switch statement in the function. Again, the unlink($tempfile.'2'); deletes the abcdef.tmp2 file.

This doesn't work fine because pdftotext doesn't send output to STDOUT, but rather sends output to a file called abcdef.tmp2.txt leaving $result empty (note that 2.txt is the value of PHPDIG_PDF_EXTENSION). Hence, when $result is written back to abcdef.tmp, the abcdef.tmp file is empty. The reason for adding count($result) into the if statement is to prevent the writing of the empty file.

On other OS it should work the same, so if output is written to STDOUT, then the following can be left empty:
PHP Code:
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION',''); 
However, if output is written to a file, then the extensions defined in the following should be whatever is after abcdef.tmp in abcdef.tmp2.txt (i.e., 2.txt):
PHP Code:
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','2.txt'); // 2.txt
define('PHPDIG_MSEXCEL_EXTENSION',''); 
Personally, I'd like to change this extension stuff and try something different, but I suspect that there may have been a memory or OS issue with reading and writing to one file.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote