PDA

View Full Version : need help: phpdig suddenly reads html-comments!


manute
10-09-2003, 06:09 AM
hi!

i've been using phpdig for about 1 year now and it has always worked fine.
but now suddenly - and i haven't changed anything - it starts to read html-comments in the source-code and put's it into the description.
and as that of course has nothing to do in the results page for the user, i'd like to get rid of that.
has anyone else ever experienced that problem and knows a "cure"?

Charter
10-09-2003, 03:32 PM
How about look at this (http://www.phpdig.net/showthread.php?threadid=67) and modify the if statement?

manute
10-10-2003, 05:49 AM
no, that is not my problem. it doesn't only red that exclude-comment, but all html-comments! so i got stuff like "main table starts here" ect. in my results-page.
that really sucks! any ideas why that is?

Charter
10-11-2003, 06:20 AM
Hi. What version of PHP do you have? Try running the following. What are the results when viewing the HTML source?

<?
$text = "<!-- test -->";
$text2 = phpdigCleanHtml($text);

function phpdigCleanHtml($text) {
//htmlentities
//global $spec;

//replace blank characters by spaces
$text = ereg_replace("[\\r\\n\\t]+"," ",$text);
echo $text . "A<br>\\n";

//extracts title
if ( eregi("<title *>([^<>]*)</title *>",$text,$regs) ) {
$title = $regs[1];
}
else {
$title = "";
}
//delete content of head, script, and style tags
$text = eregi_replace("<head[^<>]*>.*</head>"," ",$text);
echo $text . "B<br>\\n";
$text = eregi_replace("<script[^>]*>.*</script>"," ",$text);
echo $text . "C<br>\\n";
$text = eregi_replace("<style[^>]*>.*</style>"," ",$text);
echo $text . "D<br>\\n";
// clean tags
$text = eregi_replace("(</?[a-z0-9 ]+>)",'\\1 ',$text);
echo $text . "E<br>\\n";
//tries to replace htmlentities by ascii equivalent
/*
foreach ($spec as $entity => $char) {
$text = eregi_replace ($entity."[;]?",$char,$text);
$title = eregi_replace ($entity."[;]?",$char,$title);
}
*/
$text = ereg_replace('&#([0-9]+);',chr('\\1').' ',$text);
echo $text . "F<br>\\n";
//replace blank characters by spaces
$text = eregi_replace("--|[{}();\\"]+|</[a-z0-9]+>|[\\r\\n\\t]+",' ',$text);
echo $text . "G<br>\\n";
//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\\1\\2',$text);
echo $text . "H<br>\\n";
//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));
echo $text . "I<br>\\n";
//$retour['content'] = $text;
//$retour['title'] = $title;
return $text;
}

echo $text2."J<br>";
?>

I get the following when I view the HTML source:

<!-- test -->A<br>
<!-- test -->B<br>
<!-- test -->C<br>
<!-- test -->D<br>
<!-- test -->E<br>
<!-- test -->F<br>
<! test >G<br>
< test >H<br>
I<br>
J<br>

Rolandks
10-11-2003, 08:50 AM
Hmm :D there are my problems which i also post here (http://www.phpdig.net/showthread.php?s=&threadid=139):

PHP 4.3.2 - Result:

<!-- test -->A<br>
<!-- test -->B<br>
<!-- test -->C<br>
<!-- test -->D<br>
<!-- test -->E<br>
<!-- test -->F<br>
<! test >G<br>
< test >H<br>
< test >I<br>
< test >J<br>

I think this must change, because more and more people use the newer PHP > 4.3.2 and all all html-comments, and META are indexed with that php-version.

-Roland-

Charter
10-11-2003, 10:26 AM
Hi. It seems that strip_tags in PHP 4.3.2 has been reworked, making it so that it doesn't eliminate as much as before. The following will remove everything between the < and > symbols.

In robot_functions.php, replace:

//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));

with the following:

//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text));

Rolandks
10-11-2003, 10:40 AM
I also found something:

$text = preg_replace('/<.*>/U', '', $text);
echo $text . "K<br>\n";

works also for this, but solve not META-Tag indexing :confused:

-Roland-

Charter
10-11-2003, 11:28 AM
$text = preg_replace('/<.*>/U', '', $text);
echo $text . "K<br>\\n";


Try adding the 's' to account for newline like so:

$text = preg_replace('/<.*>/sU', '', $text);
echo $text . "K<br>\n";

You may want to remove all the whitespace though. ;)

manute
10-12-2003, 03:01 PM
@charter:

that really seems to be the problem. my hoster must have updated php. what i get is

<!-- test -->A<br>
<!-- test -->B<br>
<!-- test -->C<br>
<!-- test -->D<br>
<!-- test -->E<br>
<!-- test -->F<br>
<! test >G<br>
< test >H<br>
< test >I<br>
< test >J<br>

the server is running PHP Version 4.3.3.
i'm now gonna try indexing with the

$text = ereg_replace("[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text));

workaround you posted. thanks! :-)

manute
10-12-2003, 03:08 PM
and that seems to work. great. thanks again!

Rolandks
10-12-2003, 05:01 PM
Hm okay, what is now the better solution for the future ?


//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));

OR

//replace any group of blank characters by an unique space
$text = preg_replace('/<.*>/sU', '', $text);

-Roland-

Charter
10-12-2003, 05:32 PM
Hi. My personal preference would be to use

$text = ereg_replace("[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text));

Of the two choices you mentioned, I would modify and choose

$text = ereg_replace("[[:blank:]]+"," ",preg_replace('/<.*>/sU', '', $text));

It seems strip_tags is more lenient now than in older PHP versions.

manute
10-12-2003, 05:40 PM
i just used this line

$text = ereg_replace("[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text));

and it seems to work perfectly. and that's good enough i guess. ;-)

manute
10-13-2003, 06:30 AM
dammit. it still doesn't work. after completely reindexing the page (that takes some hours) i got comments in the results page again.

the original comment-line was

<!----------sub-navbar table ends here---------->

and in the html-source of the results page now i still find this:

&lt; sub-navbar table ends here &gt;

all my comments start with <!---------- and end with ---------->.
unfortunately i'm not a php-crack, but it somehow has to be possible to get rid of that and everthing in between. is it?

Rolandks
10-13-2003, 07:58 AM
Originally posted by Charter

$text = ereg_replace("[[:blank:]]+"," ",preg_replace('/<.*>/sU', '', $text));

It seems strip_tags is more lenient now than in older PHP versions.

NO. This doesn´t work in the future! See Comment from /manute/.

This is quite expected behaviour. The SGML specification doesn't allow whitespaces to appear right after the less than sign.: http://bugs.php.net/bug.php?id=25730

-Roland-

Charter
10-13-2003, 03:39 PM
Hi Rolandks. The bug is that strip_tags is more lenient than before, meaning that certain things that used to be stipped are no longer. With preg_replace('/<.*>/sU', '', $text); and eregi_replace("<[^>]*>","",$text); everything between the < and > should be stripped. My personal preference is to use eregi_replace("<[^>]*>","",$text); over preg_replace('/<.*>/sU', '', $text); but I don't want to keep using strip_tags($text); because of problems encountered. ;)

Hi manute. What version of PhpDig are you running? In robot_functions.php, the phpdigCleanHtml function in version 1.6.2 is as follows:

function phpdigCleanHtml($text) {
//htmlentities
global $spec;

//replace blank characters by spaces
$text = ereg_replace("[\\r\\n\\t]+"," ",$text);

//extracts title
if ( eregi("<title *>([^<>]*)</title *>",$text,$regs) ) {
$title = $regs[1];
}
else {
$title = "";
}
//delete content of head, script, and style tags
$text = eregi_replace("<head[^<>]*>.*</head>"," ",$text);
$text = eregi_replace("<script[^>]*>.*</script>"," ",$text);
$text = eregi_replace("<style[^>]*>.*</style>"," ",$text);
// clean tags
$text = eregi_replace("(</?[a-z0-9 ]+>)",'\\1 ',$text);

//tries to replace htmlentities by ascii equivalent
foreach ($spec as $entity => $char) {
$text = eregi_replace ($entity."[;]?",$char,$text);
$title = eregi_replace ($entity."[;]?",$char,$title);
}
$text = ereg_replace('&#([0-9]+);',chr('\\1').' ',$text);

//replace blank characters by spaces
$text = eregi_replace("--|[{}();\\"]+|</[a-z0-9]+>|[\\r\\n\\t]+",' ',$text);

//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\\1\\2',$text);

//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text));

$retour['content'] = $text;
$retour['title'] = $title;
return $retour;
}

and in config.php, the $spec array in version 1.6.2 is as follows:

//----------HTML ENTITIES
$spec = array( "&amp" => "&",
"&agrave" => "*",
"&egrave" => "è",
"&ugrave" => "ù",
"&oacute;" => "ó",
"&eacute" => "é",
"&icirc" => "î",
"&ocirc" => "ô",
"&ucirc" => "û",
"&ecirc" => "ê",
"&ccedil" => "ç",
"œ" => "oe",
"&gt" => " ",
"&lt" => " ",
"&deg" => " ",
"&apos" => "'",
"&quot" => " ",
"&acirc" => "â",
"&iuml" => "ï",
"&euml" => "ë",
"&auml" => "ä",
"&ouml" => "ö",
"&uuml" => "ü",
"&nbsp" => " ",
"&szlig" => "ß",
"&iacute" => "*",
"&reg" => " ",
"&copy" => " ",
"&aacute" => "á",
"&Aacute" => "Á",
"&eth" => "ð",
"&ETH" => "Ð",
"&Eacute" => "É",
"&Iacute" => "Í",
"&Oacute" => "Ó",
"&uacute" => "ú",
"&Uacute" => "Ú",
"&THORN" => "Þ",
"&thorn" => "þ",
"&Ouml" => "Ö",
"&aelig" => "æ",
"&AELIG" => "Æ",
"&aring" => "å",
"&Aring" => "Å",
"&oslash" => "ø",
"&Oslash" => "Ø"
);

manute
10-14-2003, 05:29 AM
hmm okay, then it's a bug. but what can i now do?
can't anyone just tell me how to get rid of "<!----------", "---------->" and everything in between?

Charter
10-14-2003, 02:59 PM
What version of PhpDig are you running?

manute
10-14-2003, 03:59 PM
i'm running "PhpDig 1.6.x" as it seems. that's written in the index.php.
i'm gonna try to understand the code you posted up there, but now i'm gonna go to bed... :-) cu!

manute
10-15-2003, 05:54 AM
okay charter, i changed the robot_functions the way you posted it up there. now i'm reindexing. hope it's gonna work now. :-/

manute
10-15-2003, 06:05 AM
yes, that finally seems to work now. thank you charter!

manute
10-16-2003, 05:03 AM
"§/$%&"$ i'm really going crazy with that ****. it still doesn't work.
i tried putting

$text = eregi_replace("<[^>]*>","",$text);

and

$text = preg_replace('/<.*>/sU', '', $text);

into the cleanhtml-function, but it still doesn't work.
haven't you tried it yourself - does that work with your sites?

manute
10-16-2003, 05:53 AM
now i got an idea. :-)
in the cleanhtml-function there also happens the html-entities-replacing. so after that, of courso there is no "<" to be replaced any more, it's "&lt;" then.
i've tried to put it at the beginning of the function, before the html-entity-replacing. unfortunately spidering my site always takes a couple of hours, so i can't say if it works right now. but i'll do later.

manute
10-16-2003, 05:54 AM
hey, this thread more and more looks like a discussion between me and myself... ;-)
is anyone else still reading here at all? :-)

Rolandks
10-16-2003, 09:39 AM
Originally posted by manute
hey, this thread more and more looks like a discussion between me and myself... ;-)

Seems so ;)
The following must work as possible solution:

Change ONLY this in robot_functions.php Line 160:

//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));

to

//replace any group of blank characters by
$text = preg_replace('/<.*>/U', '', $text);


It works with PHP 4.3.2 and PhpDig 1.6.2. NO html-comments are indexing !

-Roland-

manute
10-16-2003, 06:17 PM
yes! now it finally works. thanks rolandks and charter! :-)

Charter
10-17-2003, 03:33 PM
Great, glad it's now working. :)

manute
10-17-2003, 03:59 PM
yeah, me too! :-)

ZAP
01-19-2004, 05:25 PM
Works for me also. I just noticed those ugly comments in my search result snippets today. I'm not sure when my host updated PHP, but as far as I'm concerned it's very naughty of them to change the behavior of a standard function so radically.