PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   Words after SMALL_WORDS_SIZE not indexed (http://www.phpdig.net/forum/showthread.php?t=165)

Rolandks 10-23-2003 01:54 AM

Words after SMALL_WORDS_SIZE not indexed
 
On pages where words AFTER a short word which are excluded by (SMALL_WORDS_SIZE = 2) separeted with - ALL word after - are NOT indexed.

Example (for Demo 1.6.2) :
If-Modified -> Modified is NOT found in this page (other words on this page are indexed):
http://httpd.apache.org/docs/misc/perf-tuning.html

Okay, Modified is in index but NOT this "Modified" (don´t find an other word after - )!

:: Other example for test::
- add at a page the words: or-juzutuziopa and index this page.
juzutuziopa was NOT found and or-juzutuziopa was also not found

juzutuziopa is not in keywordtable !

Any hints ?

Rolandks 10-23-2003 09:00 AM

Indexing and exclude SMALL_WORDS are in:

admin\robot_functions.php (Line 873)
admin\robot_functions.php (Line 913)

function phpdigEpureText($text,$min_word_length=2,$encoding = PHPDIG_ENCODING)

is in: libs\phpdig_functions.php( Line 213):

or-juzutuziopa must index as one word ! Perhaps it is a name, city or other .... !

I think
-Roland-

Rolandks 10-24-2003 12:50 AM

I have try it on an other machine: or-juzutuziopa are indexed and works with php 4.3.0 -> its again > PHP 4.3.2 problem !

Hmm, :confused:

Move this thread to Bugs, please.

-Roland-

Rolandks 10-27-2003 08:41 AM

I am not an expert in regular ex, :rolleyes: but i think this are the reason for all BUGS they using ereg_replace in PHP > 4.3.2:

libs\phpdig_functions.php( Line 213):
PHP Code:

$text ereg_replace('[[:blank:]][^ ]{1,'.$min_word_length.'}[[:blank:]]',' ',' '.$text.' '); 

see: http://bugs.php.net/bug.php?id=25730

Can anyone change ALL ereg_replace to SGML-Conform version, because this is change since PHP 4.3.2 !


Thanks
-Roland-

Rolandks 11-06-2003 08:24 AM

Hello ?!
Have no one an idea why word separeted with an - and ALL words after - are NOT indexed in PHP > 4.3.2 but index in in PHP < 4.3.2 :confused:

no-index-this

It's important :D - thanks.

-Roland-

Charter 11-07-2003 02:12 PM

Hi. When you run the following, what do you see when you look at the HTML source?
PHP Code:

<?php
$min_word_length 
2;
$text " or-juzutuziopa ";
echo 
$text " <- orig text<br>\n";
$text ereg_replace('[[:blank:]][^ ]{1,'.$min_word_length.'}[[:blank:]]',' ',' '.$text.' ');
echo 
$text " <- new text<br>\n";
?>

I get the following:
Code:

or-juzutuziopa  <- orig text<br>
  or-juzutuziopa  <- new text<br>

Also, thanks for pointing out these problems. It certainly will help make PhpDig better. :)

Rolandks 11-08-2003 07:54 AM

Okay, i have same result :confused:

See this search:

x-compress ist NOT found. "compress" is in keyword-table because there are other word "compress" in the pages.

Try to add or-juzutuziopa on one of the apache Site and reindex this site. If you are using PHP 4.3.2 or 4.3.3 on the server, the word juzutuziopa is NOT indexed and NOT in keyword-table. But with PHP 4.3.1 or PHP 4.3.0 it is indexed. I don't know why :confused:

-Roland-

Charter 11-08-2003 08:27 AM

Hi. In search_function.php find:
PHP Code:

if (eregi("[^[:alnum:]^ +]+",$query_to_parse)) { $query_to_parse eregi_replace("[^[:alnum:]^ ]+"," ",$query_to_parse); } 

and replace with:
PHP Code:

if (eregi("[^[:alnum:]^ +^-]+",$query_to_parse)) { $query_to_parse eregi_replace("[^[:alnum:]^ ]+"," ",$query_to_parse); } 

The latter line allows alnum, space, and dash in the searches whereas the former line allows alnum and space.

Of course, remove any "word" wrapping in the above code. ;)

Rolandks 11-08-2003 09:21 AM

In search_function.php ? This php-code (if (eregi("[^[:alnum:]^ ....) i do NOT found in complete phpdig code ?

Why search_function.php ? The words after - are NOT indexed! I think problem are: admin\robot_functions.php !

-Roland-

Charter 11-08-2003 09:26 AM

Oh I see. I was going off of the example search posted above. ;)

I use the code above so it now allows dashes in the searches. Not indexed is the problem, as you posted. Silly me.

Charter 11-08-2003 10:34 AM

Hi. Try running the following code (remove any "word" wrapping if necessary).
PHP Code:

<?php

$text 
"My t-shirt is blue.";

define('PHPDIG_ENCODING','iso-8859-1');
$phpdig_string_subst['iso-8859-1'] = 'A:ÀÁÂÃÄÅ,a:*áâãäå,O:ÒÓÔÕÖØ,o:òóôõöø,E:ÈÉÊË,e:èéêë,C:Ç,c:ç,I:ÌÍÎÏ,i:ì*îï,U:ÙÚÛÜ,u:ùúûü,Y:Ý,y:ÿý,N:Ñ,n:ñ';
$phpdig_words_chars['iso-8859-1'] = '[:alnum:]ðþß';

$text phpdigEpureText($text);

function 
phpdigEpureText($text,$min_word_length=2,$encoding=PHPDIG_ENCODING) {
global 
$phpdig_words_chars;

echo 
$text " A<---<br><br>\n";
$text phpdigStripAccents(strtolower ($text));
echo 
$text " B<---<br><br>\n";
//no-latin upper to lowercase - now islandic
switch (PHPDIG_ENCODING) {
   case 
'iso-8859-1':
   
$text strtr$text,'ÐÞ','ðþ');
   break;
}
echo 
$text " C<---<br><br>\n";
$text ereg_replace('[[:blank:]][0-9]+[[:blank:]]',' ',ereg_replace('[^'.$phpdig_words_chars[$encoding].'._&%/-]+',' ',$text));
echo 
$text " D<---<br><br>\n";
$text ereg_replace('[[:blank:]][^ ]{1,'.$min_word_length.'}[[:blank:]]',' ',' '.$text.' ');
echo 
$text " E<---<br><br>\n";
$text ereg_replace('\\.+[[:blank:]]|\\.+$|\\.{2,}',' ',$text);
echo 
$text " F<---<br><br>\n";
return 
trim(ereg_replace("[[:blank:]]+"," ",$text));
}

function 
phpdigStripAccents($chaine,$encoding=PHPDIG_ENCODING) {
$phpdigEncode = array();
global 
$phpdigEncode;
if (!isset(
$phpdigEncode[$encoding])) {
   
$encoding PHPDIG_ENCODING;
}
// exceptions
if ($encoding == 'iso-8859-1') {
    
$chaine str_replace('Æ','ae',str_replace('æ','ae',$chaine));
}
return( 
strtr$chaine,$phpdigEncode[$encoding]['str'],$phpdigEncode[$encoding]['tr']) );
}

echo 
$text " G<---<br><br>\n";

?>

What is the output when viewing the HTML source? The output I get is the following.
Code:

My t-shirt is blue. A<---<br><br>
my t-shirt is blue. B<---<br><br>
my t-shirt is blue. C<---<br><br>
my t-shirt is blue. D<---<br><br>
 t-shirt blue.  E<---<br><br>
 t-shirt blue  F<---<br><br>
t-shirt blue G<---<br><br>


Rolandks 11-08-2003 01:11 PM

Hmm, a difficult problem - just the same :confused:
Code:

My t-shirt is blue. A<---<br><br>
my t-shirt is blue. B<---<br><br>
my t-shirt is blue. C<---<br><br>
my t-shirt is blue. D<---<br><br>
 t-shirt blue.  E<---<br><br>
 t-shirt blue  F<---<br><br>
t-shirt blue G<---<br><br>

I try tomorrow new index with or-juzutuziopa and My t-shirt is blue on the page :rolleyes:

-Roland-

Rolandks 11-10-2003 05:54 AM

Okay, i found the problem. :D
t-shirt is indexed in keyword-table as: t-shirt
or-juzutuziopa is indexed in keyword-table as: or-juzutuziopa

BUT if you search: t-shirt or or-juzutuziopa you get:

"t", are too short words and were ignored. :eek:
"or", are too short words and were ignored. :eek:

BUT search for: shirt or juzutuziopa are get empty results.

The problem is in search_function with version PHP 4.3.2 or 4.3.3 !

With PHP 4.3.0 / 4.3.1
If you search: t-shirt you get:
Results 1-2, 2 total, on "t-shirt" (0.48 seconds)

If search for: shirt you get empty result.

-Roland-

Charter 11-10-2003 06:13 AM

Hi. First apply the patch in post five of this thread, and then apply the patch in post eight above, and make sure that in search_function.php the following line is commented out.
PHP Code:

//$query_to_parse = ereg_replace("([^ ])-([^ ])","\\1 \\2",$query_to_parse); 


Rolandks 11-10-2003 06:35 AM

Okay thanks, this patch five is include since many weeks, also commented out the line. But what means this ?
Quote:

Originally posted by Charter
... apply the patch in post eight above ...
Quote:

Hi. The code eregi_replace("[^[:alnum:]^ ]+"," ",$query_to_parse); takes everything that is not a number, letter, or space and replaces it with a space. This happens before $kconds[$ncrit] is formed, where $kconds[$ncrit] is used to make the mysql query from the search field. Please do examine the code. The more eyes, the better.
-Roland


All times are GMT -8. The time now is 02:04 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.