PDA

View Full Version : 1.6.2 fix to crawl binary files


Charter
09-13-2003, 02:35 PM
This is a 1.6.2 temporary fix to crawl binary files. This fix is not included in the 1.6.2 download but will be improved upon and included in the next release.

First make a backup of the robot_functions.php file. Then in robot_functions.php, find the function phpdigTempFile. In the function phpdigTempFile, find the following:

return array('tempfile'=>$tempfile,'tempfilesize'=>$tempfilesize);

and replace with the following:

switch ($result_test['status']) {
case 'MSWORD':
$my_new_tempfile = $tempfile;
break;

//case 'MSEXCEL':
//$my_new_tempfile = "<fill in>";
//break;

case 'PDF':
$my_new_tempfile = $tempfile."2.txt";
break;

default:
$my_new_tempfile = $tempfile;
}

return array('tempfile'=>$my_new_tempfile,'tempfilesize'=>$tempfilesize);

It seems that $my_new_filename can be different depending on external binary defaults, something to work on for the next release. In the meantime, after crawling a binary file, go to the temp directory and look at the extention, modifying the above as necessary.

Charter
09-16-2003, 07:52 PM
Here's an example of what's going on with external binaries.

catdoc spits output to stdout so $result contains output
pdftotext spits output to filename.txt so $result is empty

This means that if the external binary that you are using outputs to stdout, then there is no need to add the switch statement given in the previous post, as $result contains the necessary info for indexing the document.

However, if the external binary does not output to stdout but rather outputs to a file, and the document is not indexed, then check the file extension in the temp directory, modifying the switch statement as necessary.

EDIT: external binary process modified in version 1.6.4.