PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Bug Tracker (http://www.phpdig.net/forum/forumdisplay.php?f=27)
-   -   Indexing all HTML-Comments (http://www.phpdig.net/forum/showthread.php?t=85)

Rolandks 09-22-2003 09:53 AM

Indexing all HTML-Comments PHP > 4.3.2
 
Win 2003 and IIS 6 don't like PhpDig :D

If you are indexing a Win 2003 IIS 6 Site and PhpDig runs at Win 2003 Server he is indexing ALL HTML-Comments:
Code:

<!--LayoutTable-->
  <tr>
    <td width="10" height="114">&nbsp;</td>
    <td width="10">&nbsp;</td>
    <td width="675">&nbsp;</td>
    </tr>
  <tr>
<!--Layout Empty Cell-->
<td height="164">&nbsp;</td>
    <td colspan="2" valign="top">

"LayoutTable" and "Layout" "Empty" and "Cell" are in Keywords-Table.

If you are indexing the same Win 2003 IIS 6 Site and PhpDig runs at Linux or Win 2000 Server he is NOT indexing HTML-Comments !! :mad: :confused:

Any Ideas - where is the php-code which exclude HTML-Comments - hhm, i don't found it ....

-Roland-
Ps.: This fix is included: http://www.phpdig.net/showthread.php?s=&threadid=67

Rolandks 09-30-2003 04:54 AM

Re: Again Win 2003: indexing all HTML-Comments
 
Quote:


Any Ideas - where is the php-code which exclude HTML-Comments - hhm, i don't found it ....

I think it is perhaps the same wrong \r\n - Bug as on Thread before, but i don't find the php-code which general exclude all HTML-Comments :confused:

In the Text-content files TXT all < HTML-Comments are > are < > Example : 14.txt:
Code:

< Navigations-Table end > And here is the Real text
from the Page < Table-Image begin > Real Text from page......

Ps: I found it: :)
robot_functions.php Line 156
Code:

//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\1\2',$text);

But why should this not work on Win 2003 ?

Thanks
-Roland-

Rolandks 10-02-2003 05:22 AM

Re: Re: Again Win 2003: indexing all HTML-Comments
 
Quote:

[i]I found it: :)
robot_functions.php Line 156
Code:

//f..k <!SOMETHING tags !!
$text = eregi_replace('(<)!([^-])','\1\2',$text);


I find found out, this string above is OK and same on all Servers and all PHP-Versions.
The following Line "kills" the Comments, but NOT with PHP 4.3.2 at Win 2003: :confused:
Code:

//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));

I think "strip_tags" doesn'´t work. for this: "< Navigations-Table end > And here is the Real text from the Page < Table-Image begin > Real Text from page.."


-Roland-

Rolandks 10-02-2003 07:13 AM

PHP Bug #25730 : ereg_replace or strip_tags unexpected result:
Quote:

This is quite expected behaviour. The SGML specification doesn't allow
whitespaces to appear right after the less than sign.

see: http://bugs.php.net/bug.php?id=25730
//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));

Can anyone change this to SGML-Conform :confused: - it must fix for the future because never works with PHP > 4.3.2 !!

Thanks
-Roland-
PS.: I Change Headlline of this Thread: its ALL OS !

Rolandks 10-04-2003 06:38 AM

The following must work as possible solution:

Change this in robot_functions.php Line 160:
Code:

//replace any group of blank characters by an unique space
$text = ereg_replace("[[:blank:]]+"," ",strip_tags($text));

to
Code:

//replace any group of blank characters by
$text = preg_replace('/<.*>/U', '', $text);

Hope it works for all OS and all PHP-Versions :confused:


All times are GMT -8. The time now is 07:17 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.