View Full Version : Ban features
Slider
12-19-2004, 09:24 AM
When doing a huge list of sites to index I may add a list and not look through simply because it is so huge of a list.
In the config:
- AutoBan certain Domains. EX: freeservers Geocities and others
- AutoBan universal Directory names: links, banners, affiliates, forums, blog, webrings, Etc.. (not just for a certain site, but for all that are crawled) I'm understanding that this is in Phpdig already, but I have yet to get it working.
- Autoban certain domains by ext: .biz .info .this .that (if you run an english only site this would be very helpful to weed out sites like .ru and others)
This would be a great feature to reduce the size of a bloated Mysq Database.
Charter
12-19-2004, 09:31 AM
Set the words you want to ban in the BANNED constant in the config file.
Slider
12-19-2004, 11:51 AM
if I want to use a . would it have to be escaped as in about\.com or can I just use about.com ?
Charter
12-19-2004, 12:36 PM
Basically a . matches any character whereas a \. matches a period.
Slider
12-19-2004, 01:26 PM
Thanks Charter
Your help greatly reduced the mysql database with all the crud pages that I should not have allowed to be crawled.
I also found an excellent tutorial at this place (http://www.mkssoftware.com/docs/man5/regexp.5.asp)
It kept me from asking you 10 more questions of things I could look up but just wasn't sure where to look at. php.net wasn't very helpful as they usually are.
Slider
12-19-2004, 01:45 PM
ok it kept me from asking you 9 more questions maybe......
http://www.horse-riding.net/cgi-bin/guestbook/book.cgi?url=anything&mode=show&refresh=yes
I add it to ignore guestbook in BANNED (it was a no go for some reason)
I added in FORBIDDEN_EXTENSIONS to have cgi filetype ignored (it was a no go also)
This is after a fresh crawl with nothing else done.
// regular expression to ban useless external links in index
define('BANNED','^ad\.|banner|banners|doubleclick|links|forum|guestbook|geo cities|8m|directory|affiliate|groups|');
// regexp forbidden extensions - return sometimes text/html mime-type !!!
define('FORBIDDEN_EXTENSIONS','\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|ta r|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$');
Any ideas?
The code box above made the geocities and tar file look weird with a space , but it's right in the config. Maybe its my browser playing tricks on me.
Charter
12-19-2004, 06:05 PM
The word guestbook isn't part of actual the domain name. If you want to ban keywords based on the entire link, try $regs[2] instead of $regs[5] in the !eregi(BANNED,$regs[5]) piece in the robot_functions.php file. Also the link doesn't end in cgi so look up what $ means on that regexp tutorial you found and remove the $ from the FORBIDDEN_EXTENSIONS constant in the config.php file if you want.
Slider
12-20-2004, 01:57 PM
I tried $regs[2] as the replacement as suggested and resulted in no change to the indexing. I completely understand that you shouldn't have to answer questons for alterations to your scripting. This is something I really need to work if at all possible. Thank you for taking the time to answer my questions.
I've been reading a lot of php documentation online and still I'm left scratching my head. Possibly is there something else that needs to be changed?
Slider
12-21-2004, 01:43 PM
It needs the functionality to ban the url that contains a directory/path name you wish to ban. This thread was started and intended to be in the Mod Request part of the forums. This is a Mod Request since PhpDig does not have this function.
I am still left with no solution and in the wrong part of the forums.
Could you possibly move this back to it's original placement?
rAdoN
12-30-2004, 01:04 PM
config does already - not mod request
!eregi(BANNED,$regs[2]) work for ban keywords
learn regex - use FORBIDDEN_EXTENSIONS
// no cgi
define('FORBIDDEN_EXTENSIONS','\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|ta r|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)');
// no guestbook
define('FORBIDDEN_EXTENSIONS','(guestbook|\.(cgi|php|asp|pl|rm|ico|cab|swf| css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)');
pick one - take out space - delete guestbook links from admin update - index - not hard
Slider
12-30-2004, 01:49 PM
Thanks rAdoN,
I'm sure that will help later when I get to filenames and types.
I wanted to ban certain PATHS as in
/links/
/guestbook/
/forum/
/cgi-bin/
/webring/
/affiliates/
There are a lot of common paths websites use that could be ignored and greatly reduce the size of the MySql database.
I would think that many people would find this a useful addition to PhpDig
Your help told about FILENAMES and FILETYPES. that wasn't the question.
A path would be everything up to a file and not including the file.
Path : The way to get from here to there . Not the destination
rAdoN
12-30-2004, 02:15 PM
this already exist - not addition
// no links with guestbook forum cgi-bin webring affiliates
// no links ending with .cgi .php .asp .pl .rm .ico .cab ...
define('FORBIDDEN_EXTENSIONS','(guestbook|forum|cgi-bin|webring|affiliates|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tg z|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)');
remove space
why you not want .cgi .php .asp .pl - dynamic pages
FORBIDDEN_EXTENSIONS can be more than extensions
config is for you to config - make regex you want
Slider
12-30-2004, 03:45 PM
www.domain.com/links/file.ext
Note the part of the url in bold is the part I need banned.
Not a file or an extension. A DIRECTORY to a file. Also called the path
I need the directory called LINKS as a banned directory.
If it gets to that directory it doesn't follow it to spider any further.
I honestly do appreciate your help. A answer of any kind is better than no answer at all.
rAdoN
12-30-2004, 04:02 PM
awk - you not understand - make the FORBIDDEN_EXTENSIONS regex you want :bang:
// no links with /guestbook/ /forum/ /cgi-bin/ /webring/ /affiliates/
// no links ending with .cgi .php .asp .pl .rm .ico .cab ...
define('FORBIDDEN_EXTENSIONS','(/guestbook/|/forum/|/cgi-bin/|/webring/|/affiliates/|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)');
you need to go admin update - delete links you no want - run the cleans - edit FORBIDDEN_EXTENSIONS - relax - index after that
ps - what is not clear
vinyl-junkie
12-30-2004, 06:07 PM
www.domain.com/links/file.ext
Note the part of the url in bold is the part I need banned.
Not a file or an extension. A DIRECTORY to a file. Also called the path
I need the directory called LINKS as a banned directory.
If it gets to that directory it doesn't follow it to spider any further.
I honestly do appreciate your help. A answer of any kind is better than no answer at all.
Just use your robots.txt file to do that, like so:User-agent: Phpdig
Disallow: /links/
jmitchell
12-30-2004, 06:26 PM
what if you are indexing other sites?
rAdoN
12-30-2004, 06:30 PM
use admin update - "Click on the noway sign to exclude from future indexings" - "Click on the cross to delete the branch" - "Click on the cross to delete a document" - that delete for links indexed not wanted - use FORBIDDEN_EXTENSIONS to prevent for sites - run the cleans - index
ps - no listen :bang:
Slider
12-31-2004, 06:07 AM
Hello rAdoN,
I apoligize for being such a pain. :)
You really know your stuff and I will never doubt what I hear from you again.
Thank you so much for being here. Maybe I can return the favor in some way in the future.
Slider
12-31-2004, 01:49 PM
I added this line to the config:
define('FORBIDDEN_PATH','(guestbook|forum|cgi-bin|webring|affiliates|links|webrings|banners)');
I added this code to spider.php (the part in bold red is the addition)
//test content-type of this page if not excluded
$result_test_http = '';
if (!phpdigReadRobots($exclude,$temp_path) && !eregi(FORBIDDEN_EXTENSIONS,$temp_file) && !eregi(FORBIDDEN_PATH,$temp_path)) {
$result_test_http = phpdigTestUrl($url_indexing,'date',$cookies);
}
I tried the code you gave and even tried variations of it and never was able to get it to ignore a path or directory. This code should be added to the next phpdig version. it's a neccessity if you want to have a little more control over the content that is being indexed and reduce the MySql database.
rAdoN
01-01-2005, 12:56 PM
hoorah - instead use book.cgi you make mod - good for path - i mod your mod :smoke:
//test content-type of this page if not excluded
$result_test_http = '';
if (!phpdigReadRobots($exclude,$temp_path.$temp_file) && !eregi(FORBIDDEN_EXTENSIONS,$temp_path.$temp_file)) {
$result_test_http = phpdigTestUrl($url_indexing,'date',$cookies);
}
Slider
01-01-2005, 04:12 PM
I'm not familiar with the book.cgi you were talking about.
The new code you posted would have made it work for the path and filename Congrats!
Thank you very much
vBulletin® v3.7.3, Copyright ©2000-2012, Jelsoft Enterprises Ltd.