phpdigCleanHtml clean too much
with these twolines in function phpdigCleanHtml from robot_functions.php :
$text = eregi_replace("<script[^>]*>.*</script>"," ",$text); $text = eregi_replace("<style[^>]*>.*</style>"," ",$text); if we have by example : <script> fdlsm </script> important information <script> fdlsm </script> the text : "important information" will be erase because the ereg function will take the first <script> and the last </script> the correction may be : $text = eregi_replace("<script[^>]*>([^<]+)?</script>","",$text); $text = eregi_replace("<style[^>]*>([^<]+)?</style>","",$text); |
Hi. In the config file perhaps try setting define('CHUNK_SIZE',2048); to a lower number, that number being something small enough so that the 'important information' being cleaned isn't contained between first-last tags like those posted.
|
i think my explanation was bad.
so i give you an example : with the page html : ****************** <script></script> some text, html code, a usually page in html ... <script></script> another part for text ... <script></script> ****************** this operation : eregi("<script[^>]*>(.*)</script>",$txt,$regs); will fill the variable regs with : *********************** print_r(regs): array( [0] => <script></script> some text, html code, a usually page in html ... <script></script> another part for text ... <script></script> [1] => </script> some text, html code, a usually page in html ... <script></script> another part for text ... <script> ) ********************** As you can see, the regs[1] contain all the html code ! so the clean function will just cut all the page if it contain a script tag at the beginning and at the end ! and javascript is pretty popular ! |
Hi. I understood. ;)
Another approach would be to change define('CHUNK_SIZE',2048); to something small like define('CHUNK_SIZE',20); in the config file and then index. The chunk size is basically the string length of a chunk of text sent to the phpdigCleanHtml function. A smaller chunk size should pick up text between tags but may increase index time. Of course, TMTOWTDI. |
ah ok i understand.
i thought, i took the whole page with a big chunk size :D |
a better correction should be :
$txt = preg_replace("/<TAG[^>]*>(.*?)<\/TAG>/is",$txt); the '?' make the preg function lazy and stop at the first match. Moreover preg functions are faster than ereg functions. |
All times are GMT -8. The time now is 09:21 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.