View Full Version : Indexing all HTML-Comments
Rolandks
09-22-2003, 09:53 AM
Win 2003 and IIS 6 don't like PhpDig :D
If you are indexing a Win 2003 IIS 6 Site and PhpDig runs at Win 2003 Server he is indexing ALL HTML-Comments:
<!--LayoutTable-->
<tr>
<td width="10" height="114"> </td>
<td width="10"> </td>
<td width="675"> </td>
</tr>
<tr>
<!--Layout Empty Cell-->
<td height="164"> </td>
<td colspan="2" valign="top">
"LayoutTable" and "Layout" "Empty" and "Cell" are in Keywords-Table.
If you are indexing the same Win 2003 IIS 6 Site and PhpDig runs at Linux or Win 2000 Server he is NOT indexing HTML-Comments !! :mad: :confused:
Any Ideas - where is the php-code which exclude HTML-Comments - hhm, i don't found it ....
-Roland-
Ps.: This fix is included: http://www.phpdig.net/showthread.php?s=&threadid=67
Rolandks
09-30-2003, 04:54 AM
Any Ideas - where is the php-code which exclude HTML-Comments - hhm, i don't found it ....
I think it is perhaps the same wrong \r\n - Bug as on Thread before, but i don't find the php-code which general exclude all HTML-Comments :confused:
In the Text-content files TXT all < HTML-Comments are > are < > Example : 14.txt:
< Navigations-Table end > And here is the Real text
from the Page < Table-Image begin > Real Text from page......
Ps: I found it: :)
robot_functions.php Line 156
//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\1\2',$text);
But why should this not work on Win 2003 ?
Thanks
-Roland-
Rolandks
10-02-2003, 05:22 AM
[i]I found it: :)
robot_functions.php Line 156
//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\1\2',$text);
I find found out, this string above is OK and same on all Servers and all PHP-Versions.
The following Line "kills" the Comments, but NOT with PHP 4.3.2 at Win 2003: :confused:
//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));
I think "strip_tags" doesn'´t work. for this: "< Navigations-Table end > And here is the Real text from the Page < Table-Image begin > Real Text from page.."
-Roland-
Rolandks
10-02-2003, 07:13 AM
PHP Bug #25730 : ereg_replace or strip_tags unexpected result:
This is quite expected behaviour. The SGML specification doesn't allow
whitespaces to appear right after the less than sign.
see: http://bugs.php.net/bug.php?id=25730
//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));
Can anyone change this to SGML-Conform :confused: - it must fix for the future because never works with PHP > 4.3.2 !!
Thanks
-Roland-
PS.: I Change Headlline of this Thread: its ALL OS !
Rolandks
10-04-2003, 06:38 AM
The following must work as possible solution:
Change this in robot_functions.php Line 160:
//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));
to
//replace any group of blank characters by
$text = preg_replace('/<.*>/U', '', $text);
Hope it works for all OS and all PHP-Versions :confused:
vBulletin® v3.7.3, Copyright ©2000-2024, Jelsoft Enterprises Ltd.