PDA

View Full Version : Exclude links with certain url variabls


jclementson
01-26-2004, 02:35 AM
Hi there,

Every page on my website has a link to a printer-friendly version of the same page, done with [thispage.php?print=y]

I need to exclude these links from the spidering process, but without excluding other url variables such as [news.php?story=11]

Basically I need a way to tell the spidering process not to follow links containing a specific string (in this case '?print=y'). I can't find this feature already there, so can someone guide me to the right fuction and how to modify it?

Thanks

TSO
01-26-2004, 04:12 AM
Every page on my website has a link to a printer-friendly version of the same page, done with [thispage.php?print=y]

Just started figuring out this case also. After line 412 in "search_function.php" add:
$content['file'] = preg_replace("print=y'si","", $content['file']);
(line before: $url = eregi_replace("([a-z0-9])[/]+... )

This strips "print=y" away. Bad thing is that you get double when searhing searching (those without "print" and those with "print" -> only url is filtered). Lets keep up looking...

jclementson
01-26-2004, 04:20 AM
Thanks, that's a useful start.

I'm looking at function phpdigExplore in robot_functions.php, but I can't figure it out yet.

jclementson
01-26-2004, 04:54 AM
Got it!

In robot_functions.php, I've added a test at the end of function phpdigDetectDir.

This is how I've done it for the test I need, showing lines 537 onwards. My addition is at line 543:

//test the exclude with robots.txt
if (phpdigReadRobots($exclude,$link['path'].$link['file']) == 1
|| isset($exclude['@ALL@'])
) {
$link['ok'] = 0;
}
//exclude if specific variable set
if (strpos($link['file'],'print=y')) {
$link['ok'] = 0;
}
//print "<pre>"; print_r($link); print "</pre>\n";
return $link;

TSO
01-26-2004, 06:42 AM
I got it too... somehow
Edited "search_function.php" a bit. It is a bit messy, so i wont post it here. Anyway it works pretty well, not perfect. This feature would be a nice add on future versions.
I have different language versions, so I dont want to rip off search results permanently.

JoNtE
02-25-2004, 12:19 AM
Found this in the config.php file:

// regular expression to ban useless external links in index
define('BANNED','^ad\.|banner|doubleclick');

change it to:

define('BANNED','^ad\.|banner|doubleclick|print=y');


I guess this could be used to exclude the urls with strings matching the reg-exp

Have the same problem... but not tested this possible solution yet... will be back with the result.

// JoNtE