phpdig is having trouble indexing large parts of my intranet. I think I have tracked the problem down to the following. A large number of urls are use number of special characters and I think that they are just no being picked up by phpdig.

The following is missed for example:
http://my.intranet.com/WBSITE/INTRANET/UNITS/INTINFNETWORK/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.ht ml

This are static urls, not dynamic urls.

Where can I change the the regular expression and how. So that these pages get indexed.


I suspect you need to change the value of the variable called PHPDIG_ENCODING in the config file to match the character set in your intranet.

Thanks for the welcome.
It is not so much the character set as the site is in english.
The problem is with the url of the page, it contains a lot of comas, colons, tilds etc. Looking at the regular expression that I think phpdig uses it looks like it would not pick up the url I put above. While the individual characters are in there I don't think the pattern would be picked. I am not very good with regular expressions. Im putting the line of code that I think is relevant below.

Regular expression:

"(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|HREF[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?"

Sample url that was not found by php dig:

my.intranet.com/WBSITE/INTRANET/UNITS/INTINFNETWORK/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.ht ml

The spaces in ht ml are a typo. the actual url reads html

I don't even know how to read a URL like that, let alone modify the regular expression so that phpdig would index it. If someone else doesn't come along that helps you with modifying that, I know of another great forum (here (http://www.sitepoint.com/forums/index.php?referrerid=37747)) where you could probably get some help with that regular expression.

If the link were as follows:

http://www.domain.com/dir/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.ht ml

Then the request sent to the server is as follows: - - [09/Jan/2005:10:00:30 -0800] "HEAD /20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html HTTP/1.1" 404 0 "-" "PhpDig/1.8.6 (+http://www.phpdig.net/robot.php)"

See how at the first ":" it busts?

There are actually two spots in robot_functions.php to edit:

- One

while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) {

- Two

while(eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9 ()~-]*))[#\'\" ]?)",$line,$regs)) {

I don't have the time now to further diagnose, but maybe this tidbit will help you edit the two regexs.

Oh, and vB inserts space if there are too many chars in a row without space, so take that into account when considering the code posted herein.

Thanks for the help,

I have identified the problem. It is not in the regular expressions above but in the function rewrite urls were the following line

$url = @parse_url(str_replace('\'"','',$eval));

Should be replace with

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));

What happens is the parse_url function interprets the firs colon in the path and therefore messes up.
Using the split funtion fixes this problem.

Hope this helps someone else

Yes, I see what you say about parse_url with those type of links.

If you plan to use:

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));

In place of:

$url = @parse_url(str_replace('\'"','',$eval));

Then in phpdigRewriteUrl add:

if (!eregi("[?]",$eval)) {
$eval .= "?";

Right before:

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));

Otherwise, you can receive "undefined offset" notices when error reporting is on high.

Note: it's not enough to comment out the "remove ending question mark" line as phpdigRewriteUrl is called in various places with various content.