PDA

View Full Version : phpdigCleanHtml clean too much


Jer
04-09-2004, 11:15 AM
with these twolines in function phpdigCleanHtml from robot_functions.php :

$text = eregi_replace("<script[^>]*>.*</script>"," ",$text);
$text = eregi_replace("<style[^>]*>.*</style>"," ",$text);

if we have by example :

<script> fdlsm </script>
important information
<script> fdlsm </script>

the text : "important information" will be erase because the ereg function will take the first <script> and the last </script>

the correction may be :

$text = eregi_replace("<script[^>]*>([^<]+)?</script>","",$text);
$text = eregi_replace("<style[^>]*>([^<]+)?</style>","",$text);

Charter
04-11-2004, 01:55 PM
Hi. In the config file perhaps try setting define('CHUNK_SIZE',2048); to a lower number, that number being something small enough so that the 'important information' being cleaned isn't contained between first-last tags like those posted.

Jer
04-11-2004, 04:18 PM
i think my explanation was bad.

so i give you an example :

with the page html :
******************
<script></script>

some text, html code, a usually page in html ...

<script></script>

another part for text ...

<script></script>
******************
this operation :
eregi("<script[^>]*>(.*)</script>",$txt,$regs);

will fill the variable regs with :
***********************
print_r(regs):
array(
[0] => <script></script>

some text, html code, a usually page in html ...

<script></script>

another part for text ...

<script></script>
[1] => </script>

some text, html code, a usually page in html ...

<script></script>

another part for text ...

<script>
)
**********************
As you can see, the regs[1] contain all the html code ! so the clean function will just cut all the page if it contain a script tag at the beginning and at the end !

and javascript is pretty popular !

Charter
04-11-2004, 07:40 PM
Hi. I understood. ;)

Another approach would be to change define('CHUNK_SIZE',2048); to something small like define('CHUNK_SIZE',20); in the config file and then index.

The chunk size is basically the string length of a chunk of text sent to the phpdigCleanHtml function. A smaller chunk size should pick up text between tags but may increase index time.

Of course, TMTOWTDI.

Jer
04-12-2004, 01:40 AM
ah ok i understand.

i thought, i took the whole page with a big chunk size :D

Jer
04-12-2004, 02:57 AM
a better correction should be :

$txt = preg_replace("/<TAG[^>]*>(.*?)<\/TAG>/is",$txt);

the '?' make the preg function lazy and stop at the first match.
Moreover preg functions are faster than ereg functions.