PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Submissions

Reply
 
Thread Tools
Old 04-09-2004, 11:15 AM   #1
Jer
Green Mole
 
Join Date: Apr 2004
Posts: 4
phpdigCleanHtml clean too much

with these twolines in function phpdigCleanHtml from robot_functions.php :

$text = eregi_replace("<script[^>]*>.*</script>"," ",$text);
$text = eregi_replace("<style[^>]*>.*</style>"," ",$text);

if we have by example :

<script> fdlsm </script>
important information
<script> fdlsm </script>

the text : "important information" will be erase because the ereg function will take the first <script> and the last </script>

the correction may be :

$text = eregi_replace("<script[^>]*>([^<]+)?</script>","",$text);
$text = eregi_replace("<style[^>]*>([^<]+)?</style>","",$text);
Jer is offline   Reply With Quote
Old 04-11-2004, 01:55 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. In the config file perhaps try setting define('CHUNK_SIZE',2048); to a lower number, that number being something small enough so that the 'important information' being cleaned isn't contained between first-last tags like those posted.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-11-2004, 04:18 PM   #3
Jer
Green Mole
 
Join Date: Apr 2004
Posts: 4
i think my explanation was bad.

so i give you an example :

with the page html :
******************
<script></script>

some text, html code, a usually page in html ...

<script></script>

another part for text ...

<script></script>
******************
this operation :
eregi("<script[^>]*>(.*)</script>",$txt,$regs);

will fill the variable regs with :
***********************
print_r(regs):
array(
[0] => <script></script>

some text, html code, a usually page in html ...

<script></script>

another part for text ...

<script></script>
[1] => </script>

some text, html code, a usually page in html ...

<script></script>

another part for text ...

<script>
)
**********************
As you can see, the regs[1] contain all the html code ! so the clean function will just cut all the page if it contain a script tag at the beginning and at the end !

and javascript is pretty popular !
Jer is offline   Reply With Quote
Old 04-11-2004, 07:40 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. I understood.

Another approach would be to change define('CHUNK_SIZE',2048); to something small like define('CHUNK_SIZE',20); in the config file and then index.

The chunk size is basically the string length of a chunk of text sent to the phpdigCleanHtml function. A smaller chunk size should pick up text between tags but may increase index time.

Of course, TMTOWTDI.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-12-2004, 01:40 AM   #5
Jer
Green Mole
 
Join Date: Apr 2004
Posts: 4
ah ok i understand.

i thought, i took the whole page with a big chunk size
Jer is offline   Reply With Quote
Old 04-12-2004, 02:57 AM   #6
Jer
Green Mole
 
Join Date: Apr 2004
Posts: 4
a better correction should be :

$txt = preg_replace("/<TAG[^>]*>(.*?)<\/TAG>/is",$txt);

the '?' make the preg function lazy and stop at the first match.
Moreover preg functions are faster than ereg functions.

Last edited by Jer; 04-12-2004 at 03:00 AM.
Jer is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
clean dashes?? vispa How-to Forum 1 02-26-2005 07:24 PM
Some fixes for phpDigCleanHtml() pavel Bug Tracker 2 08-24-2004 12:46 AM
PHP dig not indexing site on clean install... mixonic Troubleshooting 1 06-28-2004 08:15 AM
Clean a PC with autoexec.bat Charter Coding & Tutorials 3 03-05-2004 12:28 PM


All times are GMT -8. The time now is 12:29 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.