PDA

View Full Version : Too few pages indexed, Umlaut problem


salzbermat
12-15-2004, 05:53 AM
Hi there,

just upgraded from 1.6.0 to 1.8.5

The site contains about 800 pages, now all of a sudden only 250 are indexed. I made sure the max level of depth and links is set to 20 in both the index admin panel and the config file, but to no avail. I am making heavy use of phpdiginclude and exclude comments in the "middle" of the code, this hasn't changed though. What might be the problem?

Secondly, when at the beginning of a title oder description string, Umlauts (e.g. Ä or Ä as HTML entity) are displayed in lower case even if they're upper case.

Any clue?

Thanks,
Bernd

Charter
12-15-2004, 06:27 AM
1) Search depth to large number, links per to zero, LIMIT_TO_DIRECTORY to false.

2) In config.php find:

"&auml" => "ä",

And afterwards add:

"&Auml" => "Ä",
"&Euml" => "Ë",
"&Iuml" => "Ï",
"&Uuml" => "Ü",

In robot_functions.php find:

//tries to replace htmlentities by ascii equivalent
foreach ($spec as $entity => $char) {
$text = eregi_replace ($entity."[;]?",$char,$text);
$title = eregi_replace ($entity."[;]?",$char,$title);
}

And beforehand add:

//tries to replace htmlentities by ascii equivalent
foreach ($spec as $entity => $char) {
$text = ereg_replace ($entity."[;]?",$char,$text);
$title = ereg_replace ($entity."[;]?",$char,$title);
}

salzbermat
12-15-2004, 08:38 AM
Thanks a lot! Works great!

oli
12-16-2004, 08:24 AM
As of 1.8.6 more entities are shown wrong in the search results. So I digged around in the code and came across the following question:

Why do you use the custom $spec array instead of just reversing the function of htmlentities?

e.g. replace your existing code in robot_functions.php:


// first case-sensitive and then case-insensitive
//tries to replace htmlentities by ascii equivalent

foreach ($spec as $entity => $char) {
$text = ereg_replace ($entity."[;]?",$char,$text);
$title = ereg_replace ($entity."[;]?",$char,$title);
}
//tries to replace htmlentities by ascii equivalent
foreach ($spec as $entity => $char) {
$text = eregi_replace ($entity."[;]?",$char,$text);
$title = eregi_replace ($entity."[;]?",$char,$title);
}


With this:


$trans = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
$trans = array_flip($trans);
$text = strtr($text, $trans);
$title = strtr($title, $trans);


Using PHP4.3 and later, you could even make use of the new html_entity_decode() function.

Charter
12-16-2004, 10:00 AM
>> Why do you use the custom $spec array instead of just reversing the function of htmlentities?

Because in a land long, long ago and far, far away... HTML page content may not be in correct form, and & # 039; versus & # 39; (without spaces) may cause an issue.

$text = "Ä ä &Auml &auml"; // and so forth

$trans = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
$trans = array_flip($trans);
$text = strtr($text, $trans);

echo $text; // prints Ä ä &Auml &auml

So you specify them in the $spec array, and PhpDig "tries to replace htmlentities by ascii equivalent." Just add to the $spec array those entities you want translated, and PhpDig should do the rest. Of course TMTOWTDI.