PDA

View Full Version : Indexing all HTML-Comments


Rolandks
09-22-2003, 09:53 AM
Win 2003 and IIS 6 don't like PhpDig :D

If you are indexing a Win 2003 IIS 6 Site and PhpDig runs at Win 2003 Server he is indexing ALL HTML-Comments:

<!--LayoutTable-->
<tr>
<td width="10" height="114">&nbsp;</td>
<td width="10">&nbsp;</td>
<td width="675">&nbsp;</td>
</tr>
<tr>
<!--Layout Empty Cell-->
<td height="164">&nbsp;</td>
<td colspan="2" valign="top">

"LayoutTable" and "Layout" "Empty" and "Cell" are in Keywords-Table.

If you are indexing the same Win 2003 IIS 6 Site and PhpDig runs at Linux or Win 2000 Server he is NOT indexing HTML-Comments !! :mad: :confused:

Any Ideas - where is the php-code which exclude HTML-Comments - hhm, i don't found it ....

-Roland-
Ps.: This fix is included: http://www.phpdig.net/showthread.php?s=&threadid=67

Rolandks
09-30-2003, 04:54 AM
Any Ideas - where is the php-code which exclude HTML-Comments - hhm, i don't found it ....
I think it is perhaps the same wrong \r\n - Bug as on Thread before, but i don't find the php-code which general exclude all HTML-Comments :confused:

In the Text-content files TXT all < HTML-Comments are > are < > Example : 14.txt:
< Navigations-Table end > And here is the Real text
from the Page < Table-Image begin > Real Text from page......

Ps: I found it: :)
robot_functions.php Line 156

//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\1\2',$text);


But why should this not work on Win 2003 ?

Thanks
-Roland-

Rolandks
10-02-2003, 05:22 AM
[i]I found it: :)
robot_functions.php Line 156

//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\1\2',$text);

I find found out, this string above is OK and same on all Servers and all PHP-Versions.
The following Line "kills" the Comments, but NOT with PHP 4.3.2 at Win 2003: :confused:

//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));

I think "strip_tags" doesn'´t work. for this: "< Navigations-Table end > And here is the Real text from the Page < Table-Image begin > Real Text from page.."


-Roland-

Rolandks
10-02-2003, 07:13 AM
PHP Bug #25730 : ereg_replace or strip_tags unexpected result:

This is quite expected behaviour. The SGML specification doesn't allow
whitespaces to appear right after the less than sign.

see: http://bugs.php.net/bug.php?id=25730


//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));

Can anyone change this to SGML-Conform :confused: - it must fix for the future because never works with PHP > 4.3.2 !!

Thanks
-Roland-
PS.: I Change Headlline of this Thread: its ALL OS !

Rolandks
10-04-2003, 06:38 AM
The following must work as possible solution:

Change this in robot_functions.php Line 160:

//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));

to

//replace any group of blank characters by
$text = preg_replace('/<.*>/U', '', $text);


Hope it works for all OS and all PHP-Versions :confused: