PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 01-08-2004, 07:11 AM   #1
andybak
Green Mole
 
Join Date: Jan 2004
Posts: 3
Does PHPDig ignore <base href...?

I have used the BASE HREF=directive in my dynamic site so that pages that appear to be in subfolders (but actually aren't) can point to external images and css files correctly.

This is correct as far as HTML goes and gives no trouble in any tested browsers.

However PHPDig seems to ignore this setting.

If a page that appears to be in a folder called 'news' links to index.html in the root the link will read 'href='index.html' instead of '../index.html'. The base href tag tells the browser to calculate any realtive URLs fron the root rather than from the current folder (which in my case doesn't exist)

The result of this is that PHPDig finds multiple copies of each page. It thinks that index.html is in a subfolder of news and thus spiders a complete duplicate of the whole site.

Up till now I have been using exclusions to get round this but this requires a lot of manual fiddling every time the site is changed.

Is there a solution or is it a bug in PHPDig?
andybak is offline   Reply With Quote
Old 01-08-2004, 10:01 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. PhpDig looks for links that match the following regex and then processes those links via the phpdigRewriteUrl function.
PHP Code:
while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\\"]refresh['\\"] *content=['"][0-9]+;url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\\'\\"]?((([[a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\\\,._a-zA-Z0-9\|+-]*))(#[.a-zA-Z0-9-]*)?[\\'\\" ]?",$eval,$regs)) { 
In its current form, when PhpDig crawls from a dir1 directory, PhpDig would follow dir1/index2.html rather than go and crawl http://www.domain.com/dir2/index2.html.
Code:
<HTML>
<HEAD>
<BASE HREF="http://www.domain.com/dir2/index1.html">
</HEAD>
<BODY>
<A HREF="index2.html">test</A>
</BODY>
</HTML>
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-08-2004, 10:06 AM   #3
andybak
Green Mole
 
Join Date: Jan 2004
Posts: 3
Wouldn't it be fairly simple to check the <head> for the existence of a BASE tag and prefix any relative URLs with that instead of the current path?

Is it worth posting this to the suggestions forum?
andybak is offline   Reply With Quote
Old 01-08-2004, 03:08 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. In robot_functions.php is a function called phpdigExplore.

In this function, replace the following:
PHP Code:
else {
    
$file_content = @file($tempfile);

with the following:
PHP Code:
else {
    
$file_content = @file($tempfile);
    
$my_file_base_content implode("",$file_content);
    if (
eregi("<head>(.*)</head>",$my_file_base_content,$base_regs1)) {
      
$base_regs1 $base_regs1[1];
      if (
eregi("<base href[[:space:]]*=[[:space:]]*['\\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\\"]*)['\\"]*[[:space:]]*[/]?>",$base_regs1,$base_regs2)) {
        $new_base_path = parse_url($base_regs2[1]);
        if ((!isset($new_base_path["path"])) || ($new_base_path["path"] == "/")) {
          $path = "";
        }
        else {
          $new_base_path = eregi_replace("^/","",$new_base_path["path"]);
          if (eregi("/$",$new_base_path)) {
            $path = $new_base_path;
          }
          else {
            $path = dirname($new_base_path)."/";
          }
        }
      }
   }

Minimal testing was done on this, but it seems to work for the following situations, where the one HTML file is located at http://www.domain.com/dir1/index1.html:
Code:
<HTML>
<HEAD>
<BASE HREF="http://www.domain.com/dir2/file.html">
</HEAD>
<BODY>
<A HREF="index2.html">test</A>
</BODY>
</HTML>
Both http://www.domain.com/dir1/index1.html and http://www.domain.com/dir2/index2.html should be crawled. It should also work with the following tags:
Code:
<BASE HREF="http://www.domain.com/file.html">
<BASE HREF="http://www.domain.com/dir2/dir3/file.html">
<A HREF="index2.html">test</A>
<!--- or the following tags --->
<BASE HREF="http://www.domain.com/dir2/file.html">
<A HREF="/index2.html">test</A>
Remember to remove any "word" wrapping in the above code.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-08-2004, 07:01 PM   #5
andybak
Green Mole
 
Join Date: Jan 2004
Posts: 3
Fantastic! Thanks...
andybak is offline   Reply With Quote
Old 04-19-2004, 11:43 AM   #6
flood6
Green Mole
 
Join Date: Apr 2004
Location: Texas
Posts: 1
PHPDig rocks

I looked forever for an open source site search done in php and tried several without much success. PHPDig has worked well, but I was having similar problems to the one mentioned above. With Charter's modified statement, I seem to be complaint free.

Nice program and stellar support. Thanks to all!
flood6 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Index HREF in <FORM> ? thenniart Troubleshooting 1 08-15-2005 10:17 AM
Re-indexing Data Base Fast ezytrak Troubleshooting 1 03-15-2005 09:01 AM
¿Why the label <phpdig:complete_path/> change the width of the tables? zertiko How-to Forum 2 07-26-2004 06:49 PM
¿Modify the label <phpdig:update_date/>? zertiko How-to Forum 2 07-25-2004 07:38 AM
Title of the results - how to change from <phpdig:page_link/> bforsyth How-to Forum 12 07-15-2004 08:53 PM


All times are GMT -8. The time now is 11:52 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.