PDA

View Full Version : Version 1.6.3 and some bugs/ideas


manfred
11-11-2003, 08:51 AM
A new member has joined!

Just installed the new version and some bugs/ideas came to my mind.

1. If filename has & char this cannot be spidered (maybe also some other special ones)
- Solution is to change this in robot_functions

$path = $url['path'];

to

$path = ereg_replace('& amp;*','&',$url['path']); // edit: remove space between & and amp

2. If using M$ environment is_executable function is not available until PHP 5.0.0. Comment those out and external binaries will start to work.

3. Antiword cannot handle long filenames, only DOS 8.3. Change $temp_filename to something like this:

srand ((double)microtime()*1000000);
$temp_filename = rand(0,999999).$suffix;

and remember to change also $suffix.

Question: temp directory is no cleared totally after these mods - how to do this?

4. How to change %20 in file or folder names to space?
(I have a mod for this but it is so quick and dirty)

Otherwise I think this is great peace of software!

-Manfred

Charter
11-11-2003, 10:49 PM
Hi. Thanks for the comments.

For points one through three, I've made the appropriate changes for version 1.6.4 of PhpDig. For point two, I've added an option in the config file whether to use the is_executable command. For point three, I've added an option in the config file to set the length of the temp filenames and set a check for uniqueness just in case.

To your questions, are the files remaining in the temp directory empty? If so, in the robot_functions.php file find:

if (!$retval) {

and within this if statement, change:

if (is_array($result)) {

to the following:

if ((is_array($result)) && (count($result) > 0)) {

EDIT: The above doesn't quite take care of the empty files. PhpDig version 1.6.4 has addressed this problem.

If the files remaining in the temp directory are not empty, what are the file extensions and what external binaries are you using?

For point four, is it that space is changed to %20 in the search results, or where do you see this? If you could, please post your mod to give me a better idea of what you mean.

manfred
11-12-2003, 06:58 AM
Great support, thanks!

Sorry about that typo/copy&paste error.

Yes those temp files are empty. I'll implement your suggested patch and see what happens.

This space conversion thing is exactly what you said. In search results it would be nice to have all those names without %20s. This is just minor thing but hey, why not be perfect if it is easy to correct!

Here is something I have used - this is not right way to do it but it works.
In robot_function.php phpdigUpdSpiderRow insert these lines right after the function:
$path=ereg_replace('%20*',' ',$path);
$file=ereg_replace('%20*',' ',$file);
and also replace this
$titre_resume = $file; to
$titre_resume = ereg_replace('%20*',' ',$file);

As you may guess this has some side effects. when spidering first round error will be seen in Apache log but in second round it finds all folders and files with spaces. In browser side this is not a problem because it converts all spaces back to %20s.

Manfred

Charter
11-12-2003, 09:42 AM
Hi. I'm thinking that if the %20s are not wanted in the displayed search results, the search_function.php file could be modifying to have the displayed text without the %20s but the links themselves could keep the %20s. I think several browsers convert spaces back to %20, but aren't there browsers out there that don't do this? Maybe instead try the following:

In search_function.php, find:

$l_path = ", ".phpdigMsg('this_path')." : <a class='phpdig' href='".SEARCH_PAGE."?refine=1&amp;query_string=".urlencode($query_string)."&amp;site=".$content['site_id']."&amp;path=".$content['path']."&amp;limite=$limite&amp;option=$option' >".$content['path']."</a>";

and replace with:

$l_path = ", ".phpdigMsg('this_path')." : <a class='phpdig' href='".SEARCH_PAGE."?refine=1&amp;query_string=".urlencode($query_string)."&amp;site=".$content['site_id']."&amp;path=".$content['path']."&amp;limite=$limite&amp;option=$option' >".ereg_replace('%20*',' ',$content['path'])."</a>";

Also in search_function.php, find:

'page_link' => "<a class=\"phpdig\" href=\"".$url."\" target=\"".LINK_TARGET."\" >$title</a>",

and replace with:

'page_link' => "<a class=\"phpdig\" href=\"".$url."\" target=\"".LINK_TARGET."\" >".ereg_replace('%20*',' ',$title)."</a>",

Remember to remove any "word" wrapping in the above code. ;)

manfred
11-12-2003, 12:15 PM
Awesome response speed!!! Why commercial product support does not work like this?

Good news first!
You are absolutely right about the compatibility issue which cannot be compromised. Btw your patch works like a dream.

Then some clarification about external binary usage in Windows.
What I posted earlier was a cure for antiword but pdftotxt did not like it. It seems to be so that numbers are not recognized in name extension at all. So I removed '.2' from all lines in $command definitions.

Then there is a part of code that I don't understand at all. What is the meaning of
rename($tempfile,$tempfile.'2'); and unlink($tempfile.'2'); These are not affecting at all?!

After commenting out those lines also pdf documents can be spidered. And $suffix can be the default one.

Maybe there is already threads about these issues but I couldn't find any. Hopefully this will help others to solve problems in Windows.

Charter
11-12-2003, 01:22 PM
Edit: Note the below is for version 1.6.3 only.

Thanks, but my response speed is not always this fast.

To answer your questions, let's assume the following:

// in config.php
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc');
define('PHPDIG_OPTION_MSWORD','');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',false);
define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','2.txt'); // 2.txt
define('PHPDIG_MSEXCEL_EXTENSION','');

// in robot_functions.php
$prefix='temp/';
$suffix='.tmp';
$temp_filename = 'abcdef'.$suffix; // abcdef.tmp
$tempfile = $prefix.$temp_filename; // temp/abcdef.tmp

First let's consider catdoc. When doc.doc is crawled, abcdef.tmp is formed. Then abcdef.tmp is renamed to abcdef.tmp2 (rename($tempfile,$tempfile.'2');) and then doc.doc is converted to text (exec($command,$result,$retval);), sticking the results in $result (an array) and returning $retval (success or failure). On success, !$retval is true so $result gets some work done on it and is written back to abcdef.tmp and abcdef.tmp is returned from the last switch statement in the function. The unlink($tempfile.'2'); deletes the abcdef.tmp2 file.

This all works fine because catdoc is sending output to STDOUT, and it is this STDOUT output that is contained in the $result variable.

Now let's consider pdftotext. When doc.pdf is crawled abcdef.tmp is formed. Then abcdef.tmp is renamed to abcdef.tmp2 (rename($tempfile,$tempfile.'2');) and then doc.pdf is converted to text (exec($command,$result,$retval);), sticking the results in $result (an array) and returning $retval (success or failure). As before, on success, !$retval is true so $result gets some work done on it and is written back to abcdef.tmp and abcdef.tmp2.txt is returned from the last switch statement in the function. Again, the unlink($tempfile.'2'); deletes the abcdef.tmp2 file.

This doesn't work fine because pdftotext doesn't send output to STDOUT, but rather sends output to a file called abcdef.tmp2.txt leaving $result empty (note that 2.txt is the value of PHPDIG_PDF_EXTENSION). Hence, when $result is written back to abcdef.tmp, the abcdef.tmp file is empty. The reason for adding count($result) into the if statement is to prevent the writing of the empty file.

On other OS it should work the same, so if output is written to STDOUT, then the following can be left empty:

define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');

However, if output is written to a file, then the extensions defined in the following should be whatever is after abcdef.tmp in abcdef.tmp2.txt (i.e., 2.txt):

define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','2.txt'); // 2.txt
define('PHPDIG_MSEXCEL_EXTENSION','');

Personally, I'd like to change this extension stuff and try something different, but I suspect that there may have been a memory or OS issue with reading and writing to one file.

manfred
11-17-2003, 11:06 AM
Version 1.6.4 solved all problems mentioned above.

Great work Charter!