View Full Version : phpdigCleanHtml clean too much
with these twolines in function phpdigCleanHtml from robot_functions.php :
$text = eregi_replace("<script[^>]*>.*</script>"," ",$text);
$text = eregi_replace("<style[^>]*>.*</style>"," ",$text);
if we have by example :
<script> fdlsm </script>
important information
<script> fdlsm </script>
the text : "important information" will be erase because the ereg function will take the first <script> and the last </script>
the correction may be :
$text = eregi_replace("<script[^>]*>([^<]+)?</script>","",$text);
$text = eregi_replace("<style[^>]*>([^<]+)?</style>","",$text);
Charter
04-11-2004, 01:55 PM
Hi. In the config file perhaps try setting define('CHUNK_SIZE',2048); to a lower number, that number being something small enough so that the 'important information' being cleaned isn't contained between first-last tags like those posted.
i think my explanation was bad.
so i give you an example :
with the page html :
******************
<script></script>
some text, html code, a usually page in html ...
<script></script>
another part for text ...
<script></script>
******************
this operation :
eregi("<script[^>]*>(.*)</script>",$txt,$regs);
will fill the variable regs with :
***********************
print_r(regs):
array(
[0] => <script></script>
some text, html code, a usually page in html ...
<script></script>
another part for text ...
<script></script>
[1] => </script>
some text, html code, a usually page in html ...
<script></script>
another part for text ...
<script>
)
**********************
As you can see, the regs[1] contain all the html code ! so the clean function will just cut all the page if it contain a script tag at the beginning and at the end !
and javascript is pretty popular !
Charter
04-11-2004, 07:40 PM
Hi. I understood. ;)
Another approach would be to change define('CHUNK_SIZE',2048); to something small like define('CHUNK_SIZE',20); in the config file and then index.
The chunk size is basically the string length of a chunk of text sent to the phpdigCleanHtml function. A smaller chunk size should pick up text between tags but may increase index time.
Of course, TMTOWTDI.
ah ok i understand.
i thought, i took the whole page with a big chunk size :D
a better correction should be :
$txt = preg_replace("/<TAG[^>]*>(.*?)<\/TAG>/is",$txt);
the '?' make the preg function lazy and stop at the first match.
Moreover preg functions are faster than ereg functions.
vBulletin® v3.7.3, Copyright ©2000-2024, Jelsoft Enterprises Ltd.