PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 11-11-2003, 08:51 AM   #1
manfred
Orange Mole
 
Join Date: Nov 2003
Posts: 42
Version 1.6.3 and some bugs/ideas

A new member has joined!

Just installed the new version and some bugs/ideas came to my mind.

1. If filename has & char this cannot be spidered (maybe also some other special ones)
- Solution is to change this in robot_functions

$path = $url['path'];

to

$path = ereg_replace('& amp;*','&',$url['path']); // edit: remove space between & and amp

2. If using M$ environment is_executable function is not available until PHP 5.0.0. Comment those out and external binaries will start to work.

3. Antiword cannot handle long filenames, only DOS 8.3. Change $temp_filename to something like this:

srand ((double)microtime()*1000000);
$temp_filename = rand(0,999999).$suffix;

and remember to change also $suffix.

Question: temp directory is no cleared totally after these mods - how to do this?

4. How to change %20 in file or folder names to space?
(I have a mod for this but it is so quick and dirty)

Otherwise I think this is great peace of software!

-Manfred
manfred is offline   Reply With Quote
Old 11-11-2003, 10:49 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Thanks for the comments.

For points one through three, I've made the appropriate changes for version 1.6.4 of PhpDig. For point two, I've added an option in the config file whether to use the is_executable command. For point three, I've added an option in the config file to set the length of the temp filenames and set a check for uniqueness just in case.

To your questions, are the files remaining in the temp directory empty? If so, in the robot_functions.php file find:
PHP Code:
if (!$retval) { 
and within this if statement, change:
PHP Code:
if (is_array($result)) { 
to the following:
PHP Code:
if ((is_array($result)) && (count($result) > 0)) { 
EDIT: The above doesn't quite take care of the empty files. PhpDig version 1.6.4 has addressed this problem.

If the files remaining in the temp directory are not empty, what are the file extensions and what external binaries are you using?

For point four, is it that space is changed to %20 in the search results, or where do you see this? If you could, please post your mod to give me a better idea of what you mean.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 11-12-2003, 06:58 AM   #3
manfred
Orange Mole
 
Join Date: Nov 2003
Posts: 42
Great support, thanks!

Sorry about that typo/copy&paste error.

Yes those temp files are empty. I'll implement your suggested patch and see what happens.

This space conversion thing is exactly what you said. In search results it would be nice to have all those names without %20s. This is just minor thing but hey, why not be perfect if it is easy to correct!

Here is something I have used - this is not right way to do it but it works.
In robot_function.php phpdigUpdSpiderRow insert these lines right after the function:
$path=ereg_replace('%20*',' ',$path);
$file=ereg_replace('%20*',' ',$file);
and also replace this
$titre_resume = $file; to
$titre_resume = ereg_replace('%20*',' ',$file);

As you may guess this has some side effects. when spidering first round error will be seen in Apache log but in second round it finds all folders and files with spaces. In browser side this is not a problem because it converts all spaces back to %20s.

Manfred
manfred is offline   Reply With Quote
Old 11-12-2003, 09:42 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. I'm thinking that if the %20s are not wanted in the displayed search results, the search_function.php file could be modifying to have the displayed text without the %20s but the links themselves could keep the %20s. I think several browsers convert spaces back to %20, but aren't there browsers out there that don't do this? Maybe instead try the following:

In search_function.php, find:
PHP Code:
$l_path ", ".phpdigMsg('this_path')." : <a class='phpdig' href='".SEARCH_PAGE."?refine=1&amp;query_string=".urlencode($query_string)."&amp;site=".$content['site_id']."&amp;path=".$content['path']."&amp;limite=$limite&amp;option=$option' >".$content['path']."</a>"
and replace with:
PHP Code:
$l_path ", ".phpdigMsg('this_path')." : <a class='phpdig' href='".SEARCH_PAGE."?refine=1&amp;query_string=".urlencode($query_string)."&amp;site=".$content['site_id']."&amp;path=".$content['path']."&amp;limite=$limite&amp;option=$option' >".ereg_replace('%20*',' ',$content['path'])."</a>"
Also in search_function.php, find:
PHP Code:
'page_link' => "<a class=\"phpdig\" href=\"".$url."\" target=\"".LINK_TARGET."\" >$title</a>"
and replace with:
PHP Code:
'page_link' => "<a class=\"phpdig\" href=\"".$url."\" target=\"".LINK_TARGET."\" >".ereg_replace('%20*',' ',$title)."</a>"
Remember to remove any "word" wrapping in the above code.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 11-12-2003, 12:15 PM   #5
manfred
Orange Mole
 
Join Date: Nov 2003
Posts: 42
Awesome response speed!!! Why commercial product support does not work like this?

Good news first!
You are absolutely right about the compatibility issue which cannot be compromised. Btw your patch works like a dream.

Then some clarification about external binary usage in Windows.
What I posted earlier was a cure for antiword but pdftotxt did not like it. It seems to be so that numbers are not recognized in name extension at all. So I removed '.2' from all lines in $command definitions.

Then there is a part of code that I don't understand at all. What is the meaning of
PHP Code:
rename($tempfile,$tempfile.'2'); and unlink($tempfile.'2'); 
These are not affecting at all?!

After commenting out those lines also pdf documents can be spidered. And $suffix can be the default one.

Maybe there is already threads about these issues but I couldn't find any. Hopefully this will help others to solve problems in Windows.
manfred is offline   Reply With Quote
Old 11-12-2003, 01:22 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Edit: Note the below is for version 1.6.3 only.

Thanks, but my response speed is not always this fast.

To answer your questions, let's assume the following:
PHP Code:
// in config.php
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc');
define('PHPDIG_OPTION_MSWORD','');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',false);
define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','2.txt'); // 2.txt
define('PHPDIG_MSEXCEL_EXTENSION','');

// in robot_functions.php
$prefix='temp/';
$suffix='.tmp';
$temp_filename 'abcdef'.$suffix// abcdef.tmp
$tempfile $prefix.$temp_filename// temp/abcdef.tmp 
First let's consider catdoc. When doc.doc is crawled, abcdef.tmp is formed. Then abcdef.tmp is renamed to abcdef.tmp2 (rename($tempfile,$tempfile.'2');) and then doc.doc is converted to text (exec($command,$result,$retval);), sticking the results in $result (an array) and returning $retval (success or failure). On success, !$retval is true so $result gets some work done on it and is written back to abcdef.tmp and abcdef.tmp is returned from the last switch statement in the function. The unlink($tempfile.'2'); deletes the abcdef.tmp2 file.

This all works fine because catdoc is sending output to STDOUT, and it is this STDOUT output that is contained in the $result variable.

Now let's consider pdftotext. When doc.pdf is crawled abcdef.tmp is formed. Then abcdef.tmp is renamed to abcdef.tmp2 (rename($tempfile,$tempfile.'2');) and then doc.pdf is converted to text (exec($command,$result,$retval);), sticking the results in $result (an array) and returning $retval (success or failure). As before, on success, !$retval is true so $result gets some work done on it and is written back to abcdef.tmp and abcdef.tmp2.txt is returned from the last switch statement in the function. Again, the unlink($tempfile.'2'); deletes the abcdef.tmp2 file.

This doesn't work fine because pdftotext doesn't send output to STDOUT, but rather sends output to a file called abcdef.tmp2.txt leaving $result empty (note that 2.txt is the value of PHPDIG_PDF_EXTENSION). Hence, when $result is written back to abcdef.tmp, the abcdef.tmp file is empty. The reason for adding count($result) into the if statement is to prevent the writing of the empty file.

On other OS it should work the same, so if output is written to STDOUT, then the following can be left empty:
PHP Code:
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION',''); 
However, if output is written to a file, then the extensions defined in the following should be whatever is after abcdef.tmp in abcdef.tmp2.txt (i.e., 2.txt):
PHP Code:
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','2.txt'); // 2.txt
define('PHPDIG_MSEXCEL_EXTENSION',''); 
Personally, I'd like to change this extension stuff and try something different, but I suspect that there may have been a memory or OS issue with reading and writing to one file.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 11-17-2003, 11:06 AM   #7
manfred
Orange Mole
 
Join Date: Nov 2003
Posts: 42
Version 1.6.4 solved all problems mentioned above.

Great work Charter!
manfred is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
RSS version? AllKnightAccess Troubleshooting 2 09-27-2004 12:06 AM
Corrections for Version 1.8.1 Charter Feedback & News 3 07-12-2004 04:37 PM
Next version? tazmandev Mod Requests 1 03-09-2004 11:59 AM
Bugs, and missing Features in V. 1.6.2 Rolandks Bug Tracker 4 01-23-2004 07:01 AM
Some ideas (in french) for synonyms & Aptness(?) fr :: anonymus Mod Requests 1 12-08-2003 03:09 PM


All times are GMT -8. The time now is 07:37 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.