PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   Does PHPDig ignore <base href...? (http://www.phpdig.net/forum/showthread.php?t=364)

andybak 01-08-2004 07:11 AM

Does PHPDig ignore <base href...?
 
I have used the BASE HREF=directive in my dynamic site so that pages that appear to be in subfolders (but actually aren't) can point to external images and css files correctly.

This is correct as far as HTML goes and gives no trouble in any tested browsers.

However PHPDig seems to ignore this setting.

If a page that appears to be in a folder called 'news' links to index.html in the root the link will read 'href='index.html' instead of '../index.html'. The base href tag tells the browser to calculate any realtive URLs fron the root rather than from the current folder (which in my case doesn't exist)

The result of this is that PHPDig finds multiple copies of each page. It thinks that index.html is in a subfolder of news and thus spiders a complete duplicate of the whole site.

Up till now I have been using exclusions to get round this but this requires a lot of manual fiddling every time the site is changed.

Is there a solution or is it a bug in PHPDig?

Charter 01-08-2004 10:01 AM

Hi. PhpDig looks for links that match the following regex and then processes those links via the phpdigRewriteUrl function.
PHP Code:

while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\\"]refresh['\\"] *content=['"][0-9]+;url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\\'\\"]?((([[a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\\\,._a-zA-Z0-9\|+-]*))(#[.a-zA-Z0-9-]*)?[\\'\\" ]?",$eval,$regs)) { 

In its current form, when PhpDig crawls from a dir1 directory, PhpDig would follow dir1/index2.html rather than go and crawl http://www.domain.com/dir2/index2.html.
Code:

<HTML>
<HEAD>
<BASE HREF="http://www.domain.com/dir2/index1.html">
</HEAD>
<BODY>
<A HREF="index2.html">test</A>
</BODY>
</HTML>


andybak 01-08-2004 10:06 AM

Wouldn't it be fairly simple to check the <head> for the existence of a BASE tag and prefix any relative URLs with that instead of the current path?

Is it worth posting this to the suggestions forum?

Charter 01-08-2004 03:08 PM

Hi. In robot_functions.php is a function called phpdigExplore.

In this function, replace the following:
PHP Code:

else {
    
$file_content = @file($tempfile);


with the following:
PHP Code:

else {
    
$file_content = @file($tempfile);
    
$my_file_base_content implode("",$file_content);
    if (
eregi("<head>(.*)</head>",$my_file_base_content,$base_regs1)) {
      
$base_regs1 $base_regs1[1];
      if (
eregi("<base href[[:space:]]*=[[:space:]]*['\\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\\"]*)['\\"]*[[:space:]]*[/]?>",$base_regs1,$base_regs2)) {
        $new_base_path = parse_url($base_regs2[1]);
        if ((!isset($new_base_path["path"])) || ($new_base_path["path"] == "/")) {
          $path = "";
        }
        else {
          $new_base_path = eregi_replace("^/","",$new_base_path["path"]);
          if (eregi("/$",$new_base_path)) {
            $path = $new_base_path;
          }
          else {
            $path = dirname($new_base_path)."/";
          }
        }
      }
   }


Minimal testing was done on this, but it seems to work for the following situations, where the one HTML file is located at http://www.domain.com/dir1/index1.html:
Code:

<HTML>
<HEAD>
<BASE HREF="http://www.domain.com/dir2/file.html">
</HEAD>
<BODY>
<A HREF="index2.html">test</A>
</BODY>
</HTML>

Both http://www.domain.com/dir1/index1.html and http://www.domain.com/dir2/index2.html should be crawled. It should also work with the following tags:
Code:

<BASE HREF="http://www.domain.com/file.html">
<BASE HREF="http://www.domain.com/dir2/dir3/file.html">
<A HREF="index2.html">test</A>
<!--- or the following tags --->
<BASE HREF="http://www.domain.com/dir2/file.html">
<A HREF="/index2.html">test</A>

Remember to remove any "word" wrapping in the above code.

andybak 01-08-2004 07:01 PM

Fantastic! Thanks...

flood6 04-19-2004 11:43 AM

PHPDig rocks
 
I looked forever for an open source site search done in php and tried several without much success. PHPDig has worked well, but I was having similar problems to the one mentioned above. With Charter's modified statement, I seem to be complaint free.

Nice program and stellar support. Thanks to all!


All times are GMT -8. The time now is 03:53 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.