PDA

View Full Version : Does PHPDig ignore <base href...?


andybak
01-08-2004, 07:11 AM
I have used the BASE HREF=directive in my dynamic site so that pages that appear to be in subfolders (but actually aren't) can point to external images and css files correctly.

This is correct as far as HTML goes and gives no trouble in any tested browsers.

However PHPDig seems to ignore this setting.

If a page that appears to be in a folder called 'news' links to index.html in the root the link will read 'href='index.html' instead of '../index.html'. The base href tag tells the browser to calculate any realtive URLs fron the root rather than from the current folder (which in my case doesn't exist)

The result of this is that PHPDig finds multiple copies of each page. It thinks that index.html is in a subfolder of news and thus spiders a complete duplicate of the whole site.

Up till now I have been using exclusions to get round this but this requires a lot of manual fiddling every time the site is changed.

Is there a solution or is it a bug in PHPDig?

Charter
01-08-2004, 10:01 AM
Hi. PhpDig looks for links that match the following regex and then processes those links via the phpdigRewriteUrl function.

while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\\"]refresh['\\"] *content=['\\"][0-9]+;url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\\'\\"]?((([[a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\\\,._a-zA-Z0-9\|+-]*))(#[.a-zA-Z0-9-]*)?[\\'\\" ]?",$eval,$regs)) {

In its current form, when PhpDig crawls from a dir1 directory, PhpDig would follow dir1/index2.html rather than go and crawl http://www.domain.com/dir2/index2.html.

<HTML>
<HEAD>
<BASE HREF="http://www.domain.com/dir2/index1.html">
</HEAD>
<BODY>
<A HREF="index2.html">test</A>
</BODY>
</HTML>

andybak
01-08-2004, 10:06 AM
Wouldn't it be fairly simple to check the <head> for the existence of a BASE tag and prefix any relative URLs with that instead of the current path?

Is it worth posting this to the suggestions forum?

Charter
01-08-2004, 03:08 PM
Hi. In robot_functions.php is a function called phpdigExplore.

In this function, replace the following:

else {
$file_content = @file($tempfile);
}

with the following:

else {
$file_content = @file($tempfile);
$my_file_base_content = implode("",$file_content);
if (eregi("<head>(.*)</head>",$my_file_base_content,$base_regs1)) {
$base_regs1 = $base_regs1[1];
if (eregi("<base href[[:space:]]*=[[:space:]]*['\\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\\"]*)['\\"]*[[:space:]]*[/]?>",$base_regs1,$base_regs2)) {
$new_base_path = parse_url($base_regs2[1]);
if ((!isset($new_base_path["path"])) || ($new_base_path["path"] == "/")) {
$path = "";
}
else {
$new_base_path = eregi_replace("^/","",$new_base_path["path"]);
if (eregi("/$",$new_base_path)) {
$path = $new_base_path;
}
else {
$path = dirname($new_base_path)."/";
}
}
}
}
}

Minimal testing was done on this, but it seems to work for the following situations, where the one HTML file is located at http://www.domain.com/dir1/index1.html:

<HTML>
<HEAD>
<BASE HREF="http://www.domain.com/dir2/file.html">
</HEAD>
<BODY>
<A HREF="index2.html">test</A>
</BODY>
</HTML>

Both http://www.domain.com/dir1/index1.html and http://www.domain.com/dir2/index2.html should be crawled. It should also work with the following tags:

<BASE HREF="http://www.domain.com/file.html">
<BASE HREF="http://www.domain.com/dir2/dir3/file.html">
<A HREF="index2.html">test</A>
<!--- or the following tags --->
<BASE HREF="http://www.domain.com/dir2/file.html">
<A HREF="/index2.html">test</A>

Remember to remove any "word" wrapping in the above code.

andybak
01-08-2004, 07:01 PM
Fantastic! Thanks...

flood6
04-19-2004, 11:43 AM
I looked forever for an open source site search done in php and tried several without much success. PHPDig has worked well, but I was having similar problems to the one mentioned above. With Charter's modified statement, I seem to be complaint free.

Nice program and stellar support. Thanks to all!