Some fixes for phpDigCleanHtml()
I was confused by results of indexing of one site. I look into phpDigCleanHtml() and see, that regexp for searching tags are not powerfull. Take a look:
//extracts title
if ( eregi("<title *>([^<>]*)</title *>",$text,$regs) ) {
If in page title stored as <TITLE>Title of my site</TITLE> this code is not work.
more powerful is:
preg_match("/< *title *>(.*?)< */ *title *>/i",$text,$regs)
same is with code:
//delete content of head, script, and style tags
$text = eregi_replace("<head[^>]*>.*</head>"," ",$text);
//$text = eregi_replace("<script[^>]*>.*</script>"," ",$text); // more conservative
$text = preg_replace("/<script[^>]*?>.*?<\/script>/is","",$text); // less conservative
$text = eregi_replace("<style[^>]*>.*</style>"," ",$text);
i think, it will be better to replace any tag by space, for example modify
$text = ereg_replace("[[:space:]]+"," ",eregi_replace("<[^>]*>","",$text));
with
$text = ereg_replace("[[:space:]]+"," ",eregi_replace("<[^>]*>"," ",$text));
(now <td>Hello</td><td>Pavel</td> will be indexed correctly)
PS: Sorry for my english.
|