PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 12-18-2004, 01:27 PM   #1
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
Wildcard for banned external links?

I was looking over this part in the config and wondered if there is a way to use a wildcard such as banner* so it works for banners also or other plurals.
Code:
// regular expression to ban useless external links in index
define('BANNED','^ad\.|banner|banners|doubleclick|links|forum|affiliates');
Not sure I understand how ^ad\. is being used either in this.
The ^ represents what?
The \. represents what?
Slider is offline   Reply With Quote
Old 12-18-2004, 06:53 PM   #2
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
^ means the regular expression starts with the characters following it.

\. is escaping the period.

Putting this regular expression back together and interpreting it in English, it means:

An expression that begins with the characters "ad." (without the quotes), and is followed by one of the following words:
banner
banners
doubleclick
links
forum
affiliates

Expressed another way, it's looking for one of the following strings of characters:

ad.banner
ad.banners
ad.doubleclick
ad.links
ad.forum
ad.affiliates

Hope this helps.
vinyl-junkie is offline   Reply With Quote
Old 12-19-2004, 02:19 AM   #3
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
Thanks vinyl-junkie,

You explained that very well.
I'm am wanting to ban links like "links" as in a links page or links/index.html or forum directories. Will i have to make a new line and a brand new define and then simply try to imitate what BANNED is doing?

Line 1264 in robot_functions.php is the only reference I found to BANNED
Code:
if ($regs[5] && $regs[5] != $localdomain && !eregi(BANNED,$regs[5]) && ereg('[a-z]+',$regs[5])) {
So I could make a BANNED2 just under BANNED in config
then would Line 1264 in robot_functions.php be written this way?
Code:
if ($regs[5] && $regs[5] != $localdomain && !eregi(BANNED,$regs[5]) && !eregi(BANNED2,$regs[5]) && ereg('[a-z]+',$regs[5])) {
I'm crawling a lot of websites and a universal setting like this would really help the huge mysql database I have now. I have produced way too many uninformative links in the database.

A little info:
800 sites @ level 1 depth has me at a 20 mb database size. Time to downsize and then decide to get more database space if needed.

Thanks again foryour reply

Last edited by Slider; 12-19-2004 at 02:29 AM.
Slider is offline   Reply With Quote
Old 12-19-2004, 07:40 AM   #4
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Quote:
Originally Posted by Slider
So I could make a BANNED2 just under BANNED in config
then would Line 1264 in robot_functions.php be written this way?
Code:
if ($regs[5] && $regs[5] != $localdomain && !eregi(BANNED,$regs[5]) && !eregi(BANNED2,$regs[5]) && ereg('[a-z]+',$regs[5])) {
Yes, that would work. Just make sure you have your regular expression for BANNED2 defined correctly. I've found that if you mess those up, you tend to get a blank web page back. Turning on error reporting doesn't seem to help either.
vinyl-junkie is offline   Reply With Quote
Old 12-19-2004, 07:50 AM   #5
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
I think I gave you some incorrect information with regard to just what BANNED means. I've been struggling to learn regular expressions. What that is saying is that BANNED is looking for one of the following strings:

"ad." (without the quotes) at the beginning of the string, or
"banner" (without the quotes) anywhere in the string, or
"banners" (without the quotes) anywhere in the string, or
"doubleclick" (without the quotes) anywhere in the string, or
"links" (without the quotes) anywhere in the string, or
"forum" (without the quotes) anywhere in the string, or
"affiliates" (without the quotes) anywhere in the string

Just wanted to set that straight.
vinyl-junkie is offline   Reply With Quote
Old 12-19-2004, 08:07 AM   #6
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
That was the way I was seeing it reading. Thank you so much for clarifying it for me. I did go to php.net and see example of what you are now saying it reads as.
I will start crawling all over again and see if it ignores links directories now.
I'm trying very hard to reduce the size of the MysQL Database and getting rid of non-informative links.

Thanks again
p.s. I don't mind getting a response if even with only some correct information. A response to a question at all is much appreciated. Thanks for being here
Slider is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Banned Domains JLutterklas How-to Forum 0 09-05-2006 10:38 AM
partial/wildcard word searching rwillmer How-to Forum 4 08-27-2005 09:36 AM
spidering external links websearch How-to Forum 1 01-11-2005 08:39 AM
Spider External links to a depth of 1 (1.8.3) kenazo How-to Forum 0 10-20-2004 06:28 AM
Searching external domains/links kenazo How-to Forum 3 03-14-2004 02:55 PM


All times are GMT -8. The time now is 07:04 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.