PDA

View Full Version : urls with collection of weird characters


revenazb
01-09-2005, 06:55 AM
Hi,

phpdig is having trouble indexing large parts of my intranet. I think I have tracked the problem down to the following. A large number of urls are use number of special characters and I think that they are just no being picked up by phpdig.

The following is missed for example:
http://my.intranet.com/WBSITE/INTRANET/UNITS/INTINFNETWORK/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.ht ml

This are static urls, not dynamic urls.

Where can I change the the regular expression and how. So that these pages get indexed.

Bert

vinyl-junkie
01-09-2005, 07:20 AM
Welcome to the forum, revenazb. :D

I suspect you need to change the value of the variable called PHPDIG_ENCODING in the config file to match the character set in your intranet.

revenazb
01-09-2005, 08:11 AM
Hi,

Thanks for the welcome.
It is not so much the character set as the site is in english.
The problem is with the url of the page, it contains a lot of comas, colons, tilds etc. Looking at the regular expression that I think phpdig uses it looks like it would not pick up the url I put above. While the individual characters are in there I don't think the pattern would be picked. I am not very good with regular expressions. Im putting the line of code that I think is relevant below.

Regular expression:

"(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|HREF[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?"

Sample url that was not found by php dig:

my.intranet.com/WBSITE/INTRANET/UNITS/INTINFNETWORK/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.ht ml


The spaces in ht ml are a typo. the actual url reads html

vinyl-junkie
01-09-2005, 08:36 AM
I don't even know how to read a URL like that, let alone modify the regular expression so that phpdig would index it. If someone else doesn't come along that helps you with modifying that, I know of another great forum (here (http://www.sitepoint.com/forums/index.php?referrerid=37747)) where you could probably get some help with that regular expression.

Charter
01-09-2005, 09:15 AM
If the link were as follows:

http://www.domain.com/dir/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.ht ml

Then the request sent to the server is as follows:

127.0.0.1 - - [09/Jan/2005:10:00:30 -0800] "HEAD /20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html HTTP/1.1" 404 0 "-" "PhpDig/1.8.6 (+http://www.phpdig.net/robot.php)"

See how at the first ":" it busts?

There are actually two spots in robot_functions.php to edit:

- One

while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) {

- Two

while(eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9 ()~-]*))[#\'\" ]?)",$line,$regs)) {

I don't have the time now to further diagnose, but maybe this tidbit will help you edit the two regexs.

Oh, and vB inserts space if there are too many chars in a row without space, so take that into account when considering the code posted herein.

revenazb
01-09-2005, 06:16 PM
Thanks for the help,

I have identified the problem. It is not in the regular expressions above but in the function rewrite urls were the following line

$url = @parse_url(str_replace('\'"','',$eval));

Should be replace with

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));

What happens is the parse_url function interprets the firs colon in the path and therefore messes up.
Using the split funtion fixes this problem.

Hope this helps someone else

Charter
01-10-2005, 01:09 AM
Yes, I see what you say about parse_url with those type of links.

If you plan to use:

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));

In place of:

$url = @parse_url(str_replace('\'"','',$eval));

Then in phpdigRewriteUrl add:

if (!eregi("[?]",$eval)) {
$eval .= "?";
}

Right before:

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));

Otherwise, you can receive "undefined offset" notices when error reporting is on high.

Note: it's not enough to comment out the "remove ending question mark" line as phpdigRewriteUrl is called in various places with various content.