PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)
-   -   Wildcard for banned external links? (http://www.phpdig.net/forum/showthread.php?t=1656)

Slider 12-18-2004 01:27 PM

Wildcard for banned external links?
 
I was looking over this part in the config and wondered if there is a way to use a wildcard such as banner* so it works for banners also or other plurals.
Code:

// regular expression to ban useless external links in index
define('BANNED','^ad\.|banner|banners|doubleclick|links|forum|affiliates');

Not sure I understand how ^ad\. is being used either in this.
The ^ represents what?
The \. represents what?

vinyl-junkie 12-18-2004 06:53 PM

^ means the regular expression starts with the characters following it.

\. is escaping the period.

Putting this regular expression back together and interpreting it in English, it means:

An expression that begins with the characters "ad." (without the quotes), and is followed by one of the following words:
banner
banners
doubleclick
links
forum
affiliates

Expressed another way, it's looking for one of the following strings of characters:

ad.banner
ad.banners
ad.doubleclick
ad.links
ad.forum
ad.affiliates

Hope this helps.

Slider 12-19-2004 02:19 AM

Thanks vinyl-junkie,

You explained that very well.
I'm am wanting to ban links like "links" as in a links page or links/index.html or forum directories. Will i have to make a new line and a brand new define and then simply try to imitate what BANNED is doing?

Line 1264 in robot_functions.php is the only reference I found to BANNED
Code:

if ($regs[5] && $regs[5] != $localdomain && !eregi(BANNED,$regs[5]) && ereg('[a-z]+',$regs[5])) {
So I could make a BANNED2 just under BANNED in config
then would Line 1264 in robot_functions.php be written this way?
Code:

if ($regs[5] && $regs[5] != $localdomain && !eregi(BANNED,$regs[5]) && !eregi(BANNED2,$regs[5]) && ereg('[a-z]+',$regs[5])) {
I'm crawling a lot of websites and a universal setting like this would really help the huge mysql database I have now. I have produced way too many uninformative links in the database.

A little info:
800 sites @ level 1 depth has me at a 20 mb database size. Time to downsize and then decide to get more database space if needed.

Thanks again foryour reply

vinyl-junkie 12-19-2004 07:40 AM

Quote:

Originally Posted by Slider
So I could make a BANNED2 just under BANNED in config
then would Line 1264 in robot_functions.php be written this way?
Code:

if ($regs[5] && $regs[5] != $localdomain && !eregi(BANNED,$regs[5]) && !eregi(BANNED2,$regs[5]) && ereg('[a-z]+',$regs[5])) {

Yes, that would work. Just make sure you have your regular expression for BANNED2 defined correctly. I've found that if you mess those up, you tend to get a blank web page back. Turning on error reporting doesn't seem to help either.

vinyl-junkie 12-19-2004 07:50 AM

I think I gave you some incorrect information with regard to just what BANNED means. I've been struggling to learn regular expressions. What that is saying is that BANNED is looking for one of the following strings:

"ad." (without the quotes) at the beginning of the string, or
"banner" (without the quotes) anywhere in the string, or
"banners" (without the quotes) anywhere in the string, or
"doubleclick" (without the quotes) anywhere in the string, or
"links" (without the quotes) anywhere in the string, or
"forum" (without the quotes) anywhere in the string, or
"affiliates" (without the quotes) anywhere in the string

Just wanted to set that straight.

Slider 12-19-2004 08:07 AM

That was the way I was seeing it reading. Thank you so much for clarifying it for me. I did go to php.net and see example of what you are now saying it reads as.
I will start crawling all over again and see if it ignores links directories now.
I'm trying very hard to reduce the size of the MysQL Database and getting rid of non-informative links.

Thanks again
p.s. I don't mind getting a response if even with only some correct information. A response to a question at all is much appreciated. :) Thanks for being here


All times are GMT -8. The time now is 02:33 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.