PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Bug Tracker (http://www.phpdig.net/forum/forumdisplay.php?f=27)
-   -   Some fixes for phpDigCleanHtml() (http://www.phpdig.net/forum/showthread.php?t=1206)

pavel 08-24-2004 12:15 AM

Some fixes for phpDigCleanHtml()
 
I was confused by results of indexing of one site. I look into phpDigCleanHtml() and see, that regexp for searching tags are not powerfull. Take a look:

//extracts title
if ( eregi("<title *>([^<>]*)</title *>",$text,$regs) ) {

If in page title stored as <TITLE>Title of my site</TITLE> this code is not work.
more powerful is:

preg_match("/< *title *>(.*?)< */ *title *>/i",$text,$regs)

same is with code:

//delete content of head, script, and style tags
$text = eregi_replace("<head[^>]*>.*</head>"," ",$text);
//$text = eregi_replace("<script[^>]*>.*</script>"," ",$text); // more conservative
$text = preg_replace("/<script[^>]*?>.*?<\/script>/is","",$text); // less conservative
$text = eregi_replace("<style[^>]*>.*</style>"," ",$text);

i think, it will be better to replace any tag by space, for example modify

$text = ereg_replace("[[:space:]]+"," ",eregi_replace("<[^>]*>","",$text));

with

$text = ereg_replace("[[:space:]]+"," ",eregi_replace("<[^>]*>"," ",$text));

(now <td>Hello</td><td>Pavel</td> will be indexed correctly)

PS: Sorry for my english.

pavel 08-24-2004 12:19 AM

There was an error
 
read

preg_match("/< *title *>(.*?)< */ *title *>/i",$text,$regs)

as

preg_match('/< *title *>(.*?)< *\/ *title *>/i',$text,$regs)

pavel 08-24-2004 12:46 AM

Bad example
 
I write bad example, sorry. Try to dig html with this title:
<TITLE>HOME > NEWS</TITLE>
I know, that > must be written as &gt, but not all webmasters know this :)


All times are GMT -8. The time now is 12:18 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.