PDA

View Full Version : What's in a word for 1.8.0?


renehaentjens
02-06-2004, 07:06 AM
I'm just beginning with 1.8.0...

The doc states: "Characters '._~@#$:&%/;,=- now allowed in indexing and searches".

A first test seems to indicate that "word1;word2;word3;...", which was considered as separate words in 1.6.5, is now considered as one. Therefore I no longer find the page when searching word2. Moreover, as in my website the combination always grows beyond 30 characters, it is not indexed, which means that I cannot find the page when searching word1 either!

Is this indeed a changed behavior in 1.8.0? Why was it changed in this way? Is there a config parameter for it?

This has other "funny" impacts: if I have a phrase "... word1, word2 ..." in my webpage, I won't find it back with an exact phrase search "word1 word2", I have to search for "word1, word2", with the comma, to find it! (I've tested it.) Are you sure that this is what people expect as behaviour?

Charter
02-06-2004, 10:03 AM
Hi. A cool aspect of GNU/GPL software is that users get the source code free so they can change it to fit their individual needs. If you want to remove all or some of the ._~@#$:&%/;,=- characters, just do the following and then reindex.

In search_function find and modify the following as desired:

$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \\'.\_~@#$:&\%/;,=-]+";

In phpdig_functions.php find and modify the following as desired:

$text = ereg_replace('[^'.$phpdig_words_chars[$encoding].' \\'._~@#$:&%/;,=-]+',' ',$text);

In robot_functions.php find and modify the following as desired:

if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].'#$]',$key))

renehaentjens
02-09-2004, 01:04 AM
Thanks, Charter. Indeed GNU/GPL has this "extralegal benefit" but you have to know what you're doing or get a little help from someone who does!

The piece of code from robot_functions seems to say: insert in database if the word is not too small or too large, if it is not a stop word, and if it starts with a "words_char" or a "#" or a "$". Wouldn't it be easier to simply add "#" and "$" to the default lists of $phpdig_words_chars in config.php?

For the rest of the special characters that are allowed in words (but not at the beginning) may I suggest an additional config parameter?

There is a little inconsistency with the backslash as could be expected from the fact that the two regexps (in search_function and in phpdig_functions) are not identical.

I haven't completely figured out why, but if a page contains "word1\word2", the separate words are indexed, but there is no way to find the page back by an exact match like for "word1\word2" or "word1 word2" or similar. Unless with CONTENT_TEXT = 0 in which case you can find it with exact match "word1 word2" (one day I'll have to understand what CONTENT_TEXT means...)

Charter
02-09-2004, 09:28 AM
>> Wouldn't it be easier to simply add "#" and "$" to the default lists of $phpdig_words_chars in config.php?

Hi. TMTOWTDI, but # and $ are not word characters.

>> For the rest of the special characters that are allowed in words (but not at the beginning) may I suggest an additional config parameter?

Maybe, but then there may be issues with what to escape and where... read on.

>> There is a little inconsistency with the backslash as could be expected from the fact that the two regexps (in search_function and in phpdig_functions) are not identical.

In search_function.php the backslash allows escaping '_% from user input to make literal characters. In phpdig_functions the backslash escapes ' to prevent a parse error.

>> I haven't completely figured out why, but if a page contains "word1\word2", the separate words are indexed, but...

The '._~@#$:&%/;,=- characters are allowed in search results, no backslash.

>> ...what CONTENT_TEXT means...

Basically CONTENT_TEXT set to one stores text content from crawled pages in the text_content directory. With CONTENT_TEXT set to zero then first_words from the spider table is used.

renehaentjens
02-10-2004, 06:53 AM
Bear with me, Charter, I'm doing this with honorable intentions.

(I had to look up "TMTOWTDI", I'm not a Perl fan...)

Thanks for explaining CONTENT_TEXT. Now I wonder why anyone would want to set it on. Isn't using the DB *always* better than using a directory?

Concerning the backslash, I took a piece of the search_function code, added a line in front to simulate user input, and a line at the end to see the result of the code.

Here's the code:
<?php
$query_to_parse = addslashes('w5+6w*56[_ww5]:ww6°!w55(w66)w5w5\w6w6');


$query_to_parse = str_replace('_','\_',$query_to_parse); // avoid '_' in the query
$query_to_parse = str_replace('%','\%',$query_to_parse); // avoid '%' in the query
$query_to_parse = str_replace('\"',' ',$query_to_parse); // avoid '"' in the query
$query_to_parse = strtolower($query_to_parse); //made all lowercase

$what_query_chars = "[^w56 \'.\_~@#$:&\%/;,=-]+"; // epure chars \'._~@#$:&%/;,=-

$query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
$query_to_parse = trim(ereg_replace(" +"," ",$query_to_parse)); // no more than 1 blank

echo htmlspecialchars($query_to_parse);
?>

Here's the output, scraped from my screen:
w5 6w 56 \_ww5 :ww6 w55 w66 w5w5\\w6w6

The backslash hasn't gone away... Shouldn't it have by now?

Charter
02-11-2004, 08:08 AM
>> Now I wonder why anyone would want to set it on.

Hi. Only the first words from an indexed page are in the database table.

>> Isn't using the DB *always* better than using a directory?

http://discuss.fogcreek.com/joelonsoftware/default.asp?cmd=show&ixPost=99830
http://www.faqts.com/knowledge_base/answer/versions/index.phtml?id=11839
http://lists.evolt.org/archive/Week-of-Mon-20020408/109650.html

>> The backslash hasn't gone away... Shouldn't it have by now?

With unmodified version 1.8.0, when you index word1\word2 only word1 word2 are in the keywords table, no backslash. The backslashes in the PhpDig code are there for escaping purposes.

The following code will remove the backslashes from your example:

<?php
$query_to_parse = addslashes('w5+6w*56[_ww5]:ww6°!w55(w66)w5w5\w6w6');
$query_to_parse = str_replace('_','\_',$query_to_parse);
$query_to_parse = str_replace('%','\%',$query_to_parse);
$query_to_parse = str_replace('\"',' ',$query_to_parse);
$query_to_parse = strtolower($query_to_parse);
$text = ereg_replace('[^w56 \\'._~@#$:&%/;,=-]+',' ',$query_to_parse);
$query_to_parse = trim(ereg_replace(" +"," ",$text));
echo htmlspecialchars($query_to_parse);
?>

renehaentjens
02-12-2004, 03:40 AM
Thanks Charter for being patient with me.

I understand your reply as follows:

For 1.8.0, in libs/search_function.php, replace the line:
$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=-]+"; // epure chars \'._~@#$:&%/;,=-
by:
$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING].' \\'._~@#$:&%/;,=-]+'; // epure chars \'._~@#$:&%/;,=-

(The part after ENCODING]. reads: opening apostrophe, space, backslash-apostrophe, point, underline, ... - I'm having some trouble getting this into this forum post...)

Thanks for the pointers to the interesting debates on flat files vs. database. Of course the PhpDig case is different, because a database is already being used. The question then is why would anyone want to replace a table field which perfectly fits the purpose - you can always put a little bit more text in the field if needed - by a potentially very large number of files in one directory?

I'm biased, I admit. I have seen several cases in my career as developer where performance went drastically down when the application had to manage hundreds or thousands of little files in one directory, on Unix, Windows and other platforms. And we could never guarantee that the files would always remain in sync with the records in the database table...

Charter
02-13-2004, 09:41 PM
>> For 1.8.0, in libs/search_function.php, replace the line...

Hi. It was code that would remove the backslashes from your example. The backslash in the search_function.php file is there for escaping purposes. Maybe the following links will help, or perhaps add a line in search_function.php that removes backslashes, but only if not followed by a character that should be escaped.

http://www.mysql.com/doc/en/String_syntax.html
http://www.mysql.com/doc/en/String_comparison_functions.html

renehaentjens
02-16-2004, 03:57 AM
Charter, I may be a beginner with 1.8.0 and only recently promoted junior to member in this forum, I am not a novice on string literals!

The fact remains that, when looking at the mysql_query around line 217 in search_function:
1. with user query "abc!def" it is executed twice, with: (1) ... AND k.keyword like 'abc%' ... (2) ... AND k.keyword like 'def%', whereas
2. with user query "abc<backslash>def", it is executed once, with: ... AND k.keyword like 'abc<backslash><backslash>def%' ...

Charter
02-16-2004, 12:41 PM
>> ...may be a beginner with 1.8.0 and only recently promoted junior to member in this forum, I am not a novice on string literals...

Hi. It seems you have taken offense where none was intended. Please keep in mind that, if I provide code or answer questions, I do so free of charge, on my own time, to be helpful.

>> The fact remains that, when looking at the mysql_query around line 217 in search_function...

Like I said, perhaps add a line in search_function.php that removes backslashes, but only if not followed by a character that should be escaped.

<?php

$query_to_parse = "I\'m_wearing_a%white%shirt\with\sleeves!";
$query_to_parse = addslashes($query_to_parse);

$query_to_parse = str_replace('_','\_',$query_to_parse);
$query_to_parse = str_replace('%','\%',$query_to_parse);
$query_to_parse = str_replace('\"',' ',$query_to_parse);

$what_query_chars = "[^ a-z0-9\\'.\_~@#$:&\%/;,=-]+";

$query_to_parse = eregi_replace("[\][^_%'\"]"," ",preg_replace('/[\0]/is',' ',$query_to_parse));
// TMTOWTDI $query_to_parse = eregi_replace("[\]{2}"," ",$query_to_parse);

if (eregi($what_query_chars,$query_to_parse)) {
$query_to_parse = eregi_replace($what_query_chars," ",$query_to_parse);
}

echo $query_to_parse; // I\\'m\_wearing\_a\%white\%shirt with sleeves

?>

If this method does not suit your fancy, then just rework the code to something that would be a palatable solution for you.

renehaentjens
02-18-2004, 03:37 AM
Thanks, Charter.

No offense taken! I appreciate your work and your advise, as I stated and will repeat every now and then in other posts. Talking to each other over a forum line isn't always that easy...

I should devote more time to understanding the code in order to know where I could do something about the d*** backslash.

Anyway, not so many users are going to put backslashes between query words. Why would they? It's only perfectionists like myself, with a never sleeping suspicion about possible havoc caused by funny characters, who try out such things.