![]() |
urls with collection of weird characters
Hi,
phpdig is having trouble indexing large parts of my intranet. I think I have tracked the problem down to the following. A large number of urls are use number of special characters and I think that they are just no being picked up by phpdig. The following is missed for example: http://my.intranet.com/WBSITE/INTRAN...489784,00.html This are static urls, not dynamic urls. Where can I change the the regular expression and how. So that these pages get indexed. Bert |
Welcome to the forum, revenazb. :D
I suspect you need to change the value of the variable called PHPDIG_ENCODING in the config file to match the character set in your intranet. |
Hi,
Thanks for the welcome. It is not so much the character set as the site is in english. The problem is with the url of the page, it contains a lot of comas, colons, tilds etc. Looking at the regular expression that I think phpdig uses it looks like it would not pick up the url I put above. While the individual characters are in there I don't think the pattern would be picked. I am not very good with regular expressions. Im putting the line of code that I think is relevant below. Regular expression: Code:
"(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|HREF[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?" Code:
my.intranet.com/WBSITE/INTRANET/UNITS/INTINFNETWORK/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html |
I don't even know how to read a URL like that, let alone modify the regular expression so that phpdig would index it. If someone else doesn't come along that helps you with modifying that, I know of another great forum (here) where you could probably get some help with that regular expression.
|
If the link were as follows:
Code:
http://www.domain.com/dir/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html Code:
127.0.0.1 - - [09/Jan/2005:10:00:30 -0800] "HEAD /20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html HTTP/1.1" 404 0 "-" "PhpDig/1.8.6 (+http://www.phpdig.net/robot.php)" There are actually two spots in robot_functions.php to edit: - One Code:
while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) { Code:
while(eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9 ()~-]*))[#\'\" ]?)",$line,$regs)) { Oh, and vB inserts space if there are too many chars in a row without space, so take that into account when considering the code posted herein. |
Thanks for the help,
I have identified the problem. It is not in the regular expressions above but in the function rewrite urls were the following line $url = @parse_url(str_replace('\'"','',$eval)); Should be replace with list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval)); What happens is the parse_url function interprets the firs colon in the path and therefore messes up. Using the split funtion fixes this problem. Hope this helps someone else |
Yes, I see what you say about parse_url with those type of links.
If you plan to use: Code:
list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval)); Code:
$url = @parse_url(str_replace('\'"','',$eval)); Code:
if (!eregi("[?]",$eval)) { Code:
list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval)); Note: it's not enough to comment out the "remove ending question mark" line as phpdigRewriteUrl is called in various places with various content. |
All times are GMT -8. The time now is 09:22 PM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.