PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 10-09-2003, 07:09 AM   #1
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
Unhappy need help: phpdig suddenly reads html-comments!

hi!

i've been using phpdig for about 1 year now and it has always worked fine.
but now suddenly - and i haven't changed anything - it starts to read html-comments in the source-code and put's it into the description.
and as that of course has nothing to do in the results page for the user, i'd like to get rid of that.
has anyone else ever experienced that problem and knows a "cure"?
manute is offline   Reply With Quote
Old 10-09-2003, 04:32 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
How about look at this and modify the if statement?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-10-2003, 06:49 AM   #3
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
no, that is not my problem. it doesn't only red that exclude-comment, but all html-comments! so i got stuff like "main table starts here" ect. in my results-page.
that really sucks! any ideas why that is?
manute is offline   Reply With Quote
Old 10-11-2003, 07:20 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. What version of PHP do you have? Try running the following. What are the results when viewing the HTML source?
PHP Code:
<?
$text 
"<!-- test -->";
$text2 phpdigCleanHtml($text);

function 
phpdigCleanHtml($text) {
//htmlentities
//global $spec;

//replace blank characters by spaces
$text ereg_replace("[\\r\\n\\t]+"," ",$text);
echo 
$text "A<br>\\n";

//extracts title
if ( eregi("<title *>([^<>]*)</title *>",$text,$regs) ) {
    
$title $regs[1];
}
else {
    
$title "";
}
//delete content of head, script, and style tags
$text eregi_replace("<head[^<>]*>.*</head>"," ",$text);
echo 
$text "B<br>\\n";
$text eregi_replace("<script[^>]*>.*</script>"," ",$text);
echo 
$text "C<br>\\n";
$text eregi_replace("<style[^>]*>.*</style>"," ",$text);
echo 
$text "D<br>\\n";
// clean tags
$text eregi_replace("(</?[a-z0-9 ]+>)",'\\1 ',$text);
echo 
$text "E<br>\\n";
//tries to replace htmlentities by ascii equivalent
/*
foreach ($spec as $entity => $char) {
      $text = eregi_replace ($entity."[;]?",$char,$text);
      $title = eregi_replace ($entity."[;]?",$char,$title);
}
*/
$text ereg_replace('&#([0-9]+);',chr('\\1').' ',$text);
echo 
$text "F<br>\\n";
//replace blank characters by spaces
$text eregi_replace("--|[{}();\\"]+|</[a-z0-9]+>|[rnt]+",' ',$text);
echo $text . "
G<br>n";
//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\\1\\2',$text);
echo $text . "
H<br>n";
//replace any group of blank characters by an unique space
$text = ereg_replace("
[[:blank:]]+"," ",strip_tags($text));
echo $text . "
I<br>n";
//$retour['content'] = $text;
//$retour['title'] = $title;
return $text;
}

echo $text2."
J<br>";
?>
I get the following when I view the HTML source:
Code:
<!-- test -->A<br>
<!-- test -->B<br>
<!-- test -->C<br>
<!-- test -->D<br>
<!-- test -->E<br>
<!-- test -->F<br>
<!  test  >G<br>
<  test  >H<br>
I<br>
J<br>
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-11-2003, 09:50 AM   #5
Rolandks
Purple Mole
 
Rolandks's Avatar
 
Join Date: Sep 2003
Location: Kassel, Germany
Posts: 119
Hmm there are my problems which i also post here:

PHP 4.3.2 - Result:
Code:
<!-- test -->A<br>
<!-- test -->B<br>
<!-- test -->C<br>
<!-- test -->D<br>
<!-- test -->E<br>
<!-- test -->F<br>
<!  test  >G<br>
<  test  >H<br>
< test >I<br>
< test >J<br>
I think this must change, because more and more people use the newer PHP > 4.3.2 and all all html-comments, and META are indexed with that php-version.

-Roland-
Rolandks is offline   Reply With Quote
Old 10-11-2003, 11:26 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. It seems that strip_tags in PHP 4.3.2 has been reworked, making it so that it doesn't eliminate as much as before. The following will remove everything between the < and > symbols.

In robot_functions.php, replace:
PHP Code:
//replace any group of blank characters by an unique space
$text ereg_replace("[[:blank:]]+"," ",strip_tags($text)); 
with the following:
PHP Code:
//replace any group of blank characters by an unique space
$text ereg_replace("[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text)); 
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-11-2003, 11:40 AM   #7
Rolandks
Purple Mole
 
Rolandks's Avatar
 
Join Date: Sep 2003
Location: Kassel, Germany
Posts: 119
I also found something:
Code:
$text = preg_replace('/<.*>/U', '', $text);
echo $text . "K<br>\n";
works also for this, but solve not META-Tag indexing

-Roland-
Rolandks is offline   Reply With Quote
Old 10-11-2003, 12:28 PM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Quote:
PHP Code:
$text preg_replace('/<.*>/U'''$text);
echo 
$text "K<br>\\n"
Try adding the 's' to account for newline like so:
PHP Code:
$text preg_replace('/<.*>/sU'''$text);
echo 
$text "K<br>\n"
You may want to remove all the whitespace though.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-12-2003, 04:01 PM   #9
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
@charter:

that really seems to be the problem. my hoster must have updated php. what i get is

<!-- test -->A<br>
<!-- test -->B<br>
<!-- test -->C<br>
<!-- test -->D<br>
<!-- test -->E<br>
<!-- test -->F<br>
<! test >G<br>
< test >H<br>
< test >I<br>
< test >J<br>

the server is running PHP Version 4.3.3.
i'm now gonna try indexing with the

$text = ereg_replace("[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text));

workaround you posted. thanks! :-)
manute is offline   Reply With Quote
Old 10-12-2003, 04:08 PM   #10
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
and that seems to work. great. thanks again!
manute is offline   Reply With Quote
Old 10-12-2003, 06:01 PM   #11
Rolandks
Purple Mole
 
Rolandks's Avatar
 
Join Date: Sep 2003
Location: Kassel, Germany
Posts: 119
Hm okay, what is now the better solution for the future ?

Code:
//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));
OR
Code:
//replace any group of blank characters by an unique space
$text = preg_replace('/<.*>/sU', '', $text);
-Roland-
Rolandks is offline   Reply With Quote
Old 10-12-2003, 06:32 PM   #12
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. My personal preference would be to use
PHP Code:
$text ereg_replace("[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text)); 
Of the two choices you mentioned, I would modify and choose
PHP Code:
$text ereg_replace("[[:blank:]]+"," ",preg_replace('/<.*>/sU'''$text)); 
It seems strip_tags is more lenient now than in older PHP versions.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-12-2003, 06:40 PM   #13
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
i just used this line

$text = ereg_replace("[[:blank:]]+"," ",eregi_replace("<[^>]*>","",$text));

and it seems to work perfectly. and that's good enough i guess. ;-)
manute is offline   Reply With Quote
Old 10-13-2003, 07:30 AM   #14
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
dammit. it still doesn't work. after completely reindexing the page (that takes some hours) i got comments in the results page again.

the original comment-line was

<!----------sub-navbar table ends here---------->

and in the html-source of the results page now i still find this:

&lt; sub-navbar table ends here &gt;

all my comments start with <!---------- and end with ---------->.
unfortunately i'm not a php-crack, but it somehow has to be possible to get rid of that and everthing in between. is it?
manute is offline   Reply With Quote
Old 10-13-2003, 08:58 AM   #15
Rolandks
Purple Mole
 
Rolandks's Avatar
 
Join Date: Sep 2003
Location: Kassel, Germany
Posts: 119
Quote:
Originally posted by Charter
PHP Code:
$text ereg_replace("[[:blank:]]+"," ",preg_replace('/<.*>/sU'''$text)); 
It seems strip_tags is more lenient now than in older PHP versions.
NO. This doesn´t work in the future! See Comment from /manute/.

This is quite expected behaviour. The SGML specification doesn't allow whitespaces to appear right after the less than sign.: http://bugs.php.net/bug.php?id=25730

-Roland-
Rolandks is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Multi-line HTML comments incorrectly being indexed nicrodgers Troubleshooting 0 12-22-2004 03:32 AM
How to make phpdig index certain content, located in certain html tags?! r3m How-to Forum 1 11-18-2004 06:27 PM
Phpdig indexing including HTML in results Mrsoft Troubleshooting 1 09-28-2004 05:23 AM
PHP and Javascript in phpdig.html template file jayhawk How-to Forum 1 06-17-2004 06:03 PM
Indexing all HTML-Comments Rolandks Bug Tracker 4 10-04-2003 07:38 AM


All times are GMT -8. The time now is 12:50 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.