PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 10-13-2003, 03:39 PM   #16
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi Rolandks. The bug is that strip_tags is more lenient than before, meaning that certain things that used to be stipped are no longer. With preg_replace('/<.*>/sU', '', $text); and eregi_replace("<[^>]*>","",$text); everything between the < and > should be stripped. My personal preference is to use eregi_replace("<[^>]*>","",$text); over preg_replace('/<.*>/sU', '', $text); but I don't want to keep using strip_tags($text); because of problems encountered.

Hi manute. What version of PhpDig are you running? In robot_functions.php, the phpdigCleanHtml function in version 1.6.2 is as follows:
PHP Code:
function phpdigCleanHtml($text) {
//htmlentities
global $spec;

//replace blank characters by spaces
$text ereg_replace("[\\r\\n\\t]+"," ",$text);

//extracts title
if ( eregi("<title *>([^<>]*)</title *>",$text,$regs) ) {
    
$title $regs[1];
}
else {
    
$title "";
}
//delete content of head, script, and style tags
$text eregi_replace("<head[^<>]*>.*</head>"," ",$text);
$text eregi_replace("<script[^>]*>.*</script>"," ",$text);
$text eregi_replace("<style[^>]*>.*</style>"," ",$text);
// clean tags
$text eregi_replace("(</?[a-z0-9 ]+>)",'\\1 ',$text);

//tries to replace htmlentities by ascii equivalent
foreach ($spec as $entity => $char) {
      
$text eregi_replace ($entity."[;]?",$char,$text);
      
$title eregi_replace ($entity."[;]?",$char,$title);
}
$text ereg_replace('&#([0-9]+);',chr('\\1').' ',$text);

//replace blank characters by spaces
$text eregi_replace("--|[{}();\\"]+|</[a-z0-9]+>|[rnt]+",' ',$text);

//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\\1\\2',$text);

//replace any group of blank characters by an unique space
$text = ereg_replace("
[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text));

$retour['content'] = $text;
$retour['title'] = $title;
return $retour;

and in config.php, the $spec array in version 1.6.2 is as follows:
PHP Code:
//----------HTML ENTITIES
$spec = array( "&amp" => "&",
               
"&agrave" => "*",
               
"&egrave" => "è",
               
"&ugrave" => "ù",
               
"&oacute;" => "ó",
               
"&eacute" => "é",
               
"&icirc" => "î",
               
"&ocirc" => "ô",
               
"&ucirc" => "û",
               
"&ecirc" => "ê",
               
"&ccedil" => "ç",
               
"&#156" => "oe",
               
"&gt" => " ",
               
"&lt" => " ",
               
"&deg" => " ",
               
"&apos" => "'",
               
"&quot" => " ",
               
"&acirc" => "â",
               
"&iuml" => "ï",
               
"&euml" => "ë",
               
"&auml" => "ä",
               
"&ouml" => "ö",
               
"&uuml" => "ü",
               
"&nbsp" => " ",
               
"&szlig" => "ß",
               
"&iacute" => "*",
               
"&reg" => " ",
               
"&copy" => " ",
               
"&aacute" => "á",
               
"&Aacute" => "Á",
               
"&eth" => "ð",
               
"&ETH" => "Ð",
               
"&Eacute" => "É",
               
"&Iacute" => "Í",
               
"&Oacute" => "Ó",
               
"&uacute" => "ú",
               
"&Uacute" => "Ú",
               
"&THORN" => "Þ",
               
"&thorn" => "þ",
               
"&Ouml" => "Ö",
               
"&aelig" => "æ",
               
"&AELIG" => "Æ",
               
"&aring" => "å",
               
"&Aring" => "Å",
               
"&oslash" => "ø",
               
"&Oslash" => "Ø"
               
); 
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-14-2003, 05:29 AM   #17
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
hmm okay, then it's a bug. but what can i now do?
can't anyone just tell me how to get rid of "<!----------", "---------->" and everything in between?
manute is offline   Reply With Quote
Old 10-14-2003, 02:59 PM   #18
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
What version of PhpDig are you running?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-14-2003, 03:59 PM   #19
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
i'm running "PhpDig 1.6.x" as it seems. that's written in the index.php.
i'm gonna try to understand the code you posted up there, but now i'm gonna go to bed... :-) cu!
manute is offline   Reply With Quote
Old 10-15-2003, 05:54 AM   #20
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
okay charter, i changed the robot_functions the way you posted it up there. now i'm reindexing. hope it's gonna work now. :-/
manute is offline   Reply With Quote
Old 10-15-2003, 06:05 AM   #21
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
yes, that finally seems to work now. thank you charter!
manute is offline   Reply With Quote
Old 10-16-2003, 05:03 AM   #22
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
"§/$%&"$ i'm really going crazy with that ****. it still doesn't work.
i tried putting

$text = eregi_replace("<[^>]*>","",$text);

and

$text = preg_replace('/<.*>/sU', '', $text);

into the cleanhtml-function, but it still doesn't work.
haven't you tried it yourself - does that work with your sites?
manute is offline   Reply With Quote
Old 10-16-2003, 05:53 AM   #23
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
now i got an idea. :-)
in the cleanhtml-function there also happens the html-entities-replacing. so after that, of courso there is no "<" to be replaced any more, it's "&lt;" then.
i've tried to put it at the beginning of the function, before the html-entity-replacing. unfortunately spidering my site always takes a couple of hours, so i can't say if it works right now. but i'll do later.
manute is offline   Reply With Quote
Old 10-16-2003, 05:54 AM   #24
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
hey, this thread more and more looks like a discussion between me and myself... ;-)
is anyone else still reading here at all? :-)
manute is offline   Reply With Quote
Old 10-16-2003, 09:39 AM   #25
Rolandks
Purple Mole
 
Rolandks's Avatar
 
Join Date: Sep 2003
Location: Kassel, Germany
Posts: 119
Quote:
Originally posted by manute
hey, this thread more and more looks like a discussion between me and myself... ;-)
Seems so
The following must work as possible solution:

Change ONLY this in robot_functions.php Line 160:
Code:
//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));
to
Code:
//replace any group of blank characters by
$text = preg_replace('/<.*>/U', '', $text);
It works with PHP 4.3.2 and PhpDig 1.6.2. NO html-comments are indexing !

-Roland-
Rolandks is offline   Reply With Quote
Old 10-16-2003, 06:17 PM   #26
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
yes! now it finally works. thanks rolandks and charter! :-)
manute is offline   Reply With Quote
Old 10-17-2003, 03:33 PM   #27
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Great, glad it's now working.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-17-2003, 03:59 PM   #28
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
yeah, me too! :-)
manute is offline   Reply With Quote
Old 01-19-2004, 05:25 PM   #29
ZAP
Green Mole
 
Join Date: Nov 2003
Posts: 7
Works for me also. I just noticed those ugly comments in my search result snippets today. I'm not sure when my host updated PHP, but as far as I'm concerned it's very naughty of them to change the behavior of a standard function so radically.
ZAP is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Multi-line HTML comments incorrectly being indexed nicrodgers Troubleshooting 0 12-22-2004 02:32 AM
How to make phpdig index certain content, located in certain html tags?! r3m How-to Forum 1 11-18-2004 05:27 PM
Phpdig indexing including HTML in results Mrsoft Troubleshooting 1 09-28-2004 04:23 AM
PHP and Javascript in phpdig.html template file jayhawk How-to Forum 1 06-17-2004 05:03 PM
Indexing all HTML-Comments Rolandks Bug Tracker 4 10-04-2003 06:38 AM


All times are GMT -8. The time now is 07:25 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.