PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 01-09-2005, 06:55 AM   #1
revenazb
Green Mole
 
Join Date: Jan 2005
Posts: 3
Red face urls with collection of weird characters

Hi,

phpdig is having trouble indexing large parts of my intranet. I think I have tracked the problem down to the following. A large number of urls are use number of special characters and I think that they are just no being picked up by phpdig.

The following is missed for example:
http://my.intranet.com/WBSITE/INTRAN...489784,00.html

This are static urls, not dynamic urls.

Where can I change the the regular expression and how. So that these pages get indexed.

Bert
revenazb is offline   Reply With Quote
Old 01-09-2005, 07:20 AM   #2
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Welcome to the forum, revenazb.

I suspect you need to change the value of the variable called PHPDIG_ENCODING in the config file to match the character set in your intranet.
vinyl-junkie is offline   Reply With Quote
Old 01-09-2005, 08:11 AM   #3
revenazb
Green Mole
 
Join Date: Jan 2005
Posts: 3
Hi,

Thanks for the welcome.
It is not so much the character set as the site is in english.
The problem is with the url of the page, it contains a lot of comas, colons, tilds etc. Looking at the regular expression that I think phpdig uses it looks like it would not pick up the url I put above. While the individual characters are in there I don't think the pattern would be picked. I am not very good with regular expressions. Im putting the line of code that I think is relevant below.

Regular expression:
Code:
"(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|HREF[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?"
Sample url that was not found by php dig:
Code:
my.intranet.com/WBSITE/INTRANET/UNITS/INTINFNETWORK/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html
The spaces in ht ml are a typo. the actual url reads html

Last edited by revenazb; 01-09-2005 at 08:47 AM. Reason: Small Typo
revenazb is offline   Reply With Quote
Old 01-09-2005, 08:36 AM   #4
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
I don't even know how to read a URL like that, let alone modify the regular expression so that phpdig would index it. If someone else doesn't come along that helps you with modifying that, I know of another great forum (here) where you could probably get some help with that regular expression.
vinyl-junkie is offline   Reply With Quote
Old 01-09-2005, 09:15 AM   #5
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
If the link were as follows:
Code:
http://www.domain.com/dir/0,,contentMDK:20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html
Then the request sent to the server is as follows:
Code:
127.0.0.1 - - [09/Jan/2005:10:00:30 -0800] "HEAD /20295425~pagePK:64156298~piPK:64152276~theSitePK:489784,00.html HTTP/1.1" 404 0 "-" "PhpDig/1.8.6 (+http://www.phpdig.net/robot.php)"
See how at the first ":" it busts?

There are actually two spots in robot_functions.php to edit:

- One
Code:
while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) {
- Two
Code:
while(eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9 ()~-]*))[#\'\" ]?)",$line,$regs)) {
I don't have the time now to further diagnose, but maybe this tidbit will help you edit the two regexs.

Oh, and vB inserts space if there are too many chars in a row without space, so take that into account when considering the code posted herein.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-09-2005, 06:16 PM   #6
revenazb
Green Mole
 
Join Date: Jan 2005
Posts: 3
Thanks for the help,

I have identified the problem. It is not in the regular expressions above but in the function rewrite urls were the following line

$url = @parse_url(str_replace('\'"','',$eval));

Should be replace with

list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));

What happens is the parse_url function interprets the firs colon in the path and therefore messes up.
Using the split funtion fixes this problem.

Hope this helps someone else
revenazb is offline   Reply With Quote
Old 01-10-2005, 01:09 AM   #7
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Yes, I see what you say about parse_url with those type of links.

If you plan to use:
Code:
list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));
In place of:
Code:
$url = @parse_url(str_replace('\'"','',$eval));
Then in phpdigRewriteUrl add:
Code:
if (!eregi("[?]",$eval)) {
    $eval .= "?";
}
Right before:
Code:
list($url['path'], $url['query']) = split("\?", str_replace('\'"','',$eval));
Otherwise, you can receive "undefined offset" notices when error reporting is on high.

Note: it's not enough to comment out the "remove ending question mark" line as phpdigRewriteUrl is called in various places with various content.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Weird Indexing type problem silverfox Script Installation 1 08-10-2007 06:16 AM
A weird thought Charter The Mole Hole 2 12-22-2004 01:27 PM
garbage collection baskamer How-to Forum 1 12-19-2004 09:28 AM
ignore special characters like - mirdin Troubleshooting 5 09-11-2004 06:48 AM
hmm.. a bit weird ..? zevince Troubleshooting 6 12-02-2003 07:41 AM


All times are GMT -8. The time now is 01:19 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.