PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   urls with collection of weird characters (http://www.phpdig.net/forum/showthread.php?t=1720)

revenazb 01-09-2005 06:55 AM

urls with collection of weird characters
 
Hi,

phpdig is having trouble indexing large parts of my intranet. I think I have tracked the problem down to the following. A large number of urls are use number of special characters and I think that they are just no being picked up by phpdig.

The following is missed for example:
http://my.intranet.com/WBSITE/INTRAN...489784,00.html

This are static urls, not dynamic urls.

Where can I change the the regular expression and how. So that these pages get indexed.

Bert

vinyl-junkie 01-09-2005 07:20 AM

Welcome to the forum, revenazb. :D

I suspect you need to change the value of the variable called PHPDIG_ENCODING in the config file to match the character set in your intranet.

revenazb 01-09-2005 08:11 AM

Hi,

Thanks for the welcome.
It is not so much the character set as the site is in english.
The problem is with the url of the page, it contains a lot of comas, colons, tilds etc. Looking at the regular expression that I think phpdig uses it looks like it would not pick up the url I put above. While the individual characters are in there I don't think the pattern would be picked. I am not very good with regular expressions. Im putting the line of code that I think is relevant below.

Regular expression:
Code:

"(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|HREF[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?"
Sample url that was not found by php dig:
Code:

my.intranet.com/WBSITE/INTRANET/UNITS/INTINFNETWORK/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html
The spaces in ht ml are a typo. the actual url reads html

vinyl-junkie 01-09-2005 08:36 AM

I don't even know how to read a URL like that, let alone modify the regular expression so that phpdig would index it. If someone else doesn't come along that helps you with modifying that, I know of another great forum (here) where you could probably get some help with that regular expression.

Charter 01-09-2005 09:15 AM

If the link were as follows:
Code:

http://www.domain.com/dir/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html
Then the request sent to the server is as follows:
Code:

127.0.0.1 - - [09/Jan/2005:10:00:30 -0800] "HEAD /20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html HTTP/1.1" 404 0 "-" "PhpDig/1.8.6 (+http://www.phpdig.net/robot.php)"
See how at the first ":" it busts?

There are actually two spots in robot_functions.php to edit:

- One
Code:

while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) {
- Two
Code:

while(eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9 ()~-]*))[#\'\" ]?)",$line,$regs)) {
I don't have the time now to further diagnose, but maybe this tidbit will help you edit the two regexs.

Oh, and vB inserts space if there are too many chars in a row without space, so take that into account when considering the code posted herein.

revenazb 01-09-2005 06:16 PM

Thanks for the help,

I have identified the problem. It is not in the regular expressions above but in the function rewrite urls were the following line

$url = @parse_url(str_replace('\'"','',$eval));

Should be replace with

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));

What happens is the parse_url function interprets the firs colon in the path and therefore messes up.
Using the split funtion fixes this problem.

Hope this helps someone else

Charter 01-10-2005 01:09 AM

Yes, I see what you say about parse_url with those type of links.

If you plan to use:
Code:

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));
In place of:
Code:

$url = @parse_url(str_replace('\'"','',$eval));
Then in phpdigRewriteUrl add:
Code:

if (!eregi("[?]",$eval)) {
    $eval .= "?";
}

Right before:
Code:

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));
Otherwise, you can receive "undefined offset" notices when error reporting is on high.

Note: it's not enough to comment out the "remove ending question mark" line as phpdigRewriteUrl is called in various places with various content.


All times are GMT -8. The time now is 09:22 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.