PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 12-19-2004, 09:24 AM   #1
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
Ban features

When doing a huge list of sites to index I may add a list and not look through simply because it is so huge of a list.

In the config:
- AutoBan certain Domains. EX: freeservers Geocities and others

- AutoBan universal Directory names: links, banners, affiliates, forums, blog, webrings, Etc.. (not just for a certain site, but for all that are crawled) I'm understanding that this is in Phpdig already, but I have yet to get it working.

- Autoban certain domains by ext: .biz .info .this .that (if you run an english only site this would be very helpful to weed out sites like .ru and others)

This would be a great feature to reduce the size of a bloated Mysq Database.
Slider is offline   Reply With Quote
Old 12-19-2004, 09:31 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Set the words you want to ban in the BANNED constant in the config file.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-19-2004, 11:51 AM   #3
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
if I want to use a . would it have to be escaped as in about\.com or can I just use about.com ?
Slider is offline   Reply With Quote
Old 12-19-2004, 12:36 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Basically a . matches any character whereas a \. matches a period.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-19-2004, 01:26 PM   #5
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
Thanks Charter

Your help greatly reduced the mysql database with all the crud pages that I should not have allowed to be crawled.
I also found an excellent tutorial at this place
It kept me from asking you 10 more questions of things I could look up but just wasn't sure where to look at. php.net wasn't very helpful as they usually are.
Slider is offline   Reply With Quote
Old 12-19-2004, 01:45 PM   #6
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
ok it kept me from asking you 9 more questions maybe......

http://www.horse-riding.net/cgi-bin/guestbook/book.cgi?url=anything&mode=show&refresh=yes

I add it to ignore guestbook in BANNED (it was a no go for some reason)
I added in FORBIDDEN_EXTENSIONS to have cgi filetype ignored (it was a no go also)

This is after a fresh crawl with nothing else done.

Code:
// regular expression to ban useless external links in index
define('BANNED','^ad\.|banner|banners|doubleclick|links|forum|guestbook|geocities|8m|directory|affiliate|groups|');

// regexp forbidden extensions - return sometimes text/html mime-type !!!
define('FORBIDDEN_EXTENSIONS','\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$');
Any ideas?

The code box above made the geocities and tar file look weird with a space , but it's right in the config. Maybe its my browser playing tricks on me.
Slider is offline   Reply With Quote
Old 12-19-2004, 06:05 PM   #7
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
The word guestbook isn't part of actual the domain name. If you want to ban keywords based on the entire link, try $regs[2] instead of $regs[5] in the !eregi(BANNED,$regs[5]) piece in the robot_functions.php file. Also the link doesn't end in cgi so look up what $ means on that regexp tutorial you found and remove the $ from the FORBIDDEN_EXTENSIONS constant in the config.php file if you want.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-20-2004, 01:57 PM   #8
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
I tried $regs[2] as the replacement as suggested and resulted in no change to the indexing. I completely understand that you shouldn't have to answer questons for alterations to your scripting. This is something I really need to work if at all possible. Thank you for taking the time to answer my questions.

I've been reading a lot of php documentation online and still I'm left scratching my head. Possibly is there something else that needs to be changed?
Slider is offline   Reply With Quote
Old 12-21-2004, 01:43 PM   #9
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
It needs the functionality to ban the url that contains a directory/path name you wish to ban. This thread was started and intended to be in the Mod Request part of the forums. This is a Mod Request since PhpDig does not have this function.

I am still left with no solution and in the wrong part of the forums.
Could you possibly move this back to it's original placement?
Slider is offline   Reply With Quote
Old 12-30-2004, 01:04 PM   #10
rAdoN
Green Mole
 
Join Date: Oct 2004
Posts: 27
config does already - not mod request

!eregi(BANNED,$regs[2]) work for ban keywords

learn regex - use FORBIDDEN_EXTENSIONS

PHP Code:
// no cgi
define('FORBIDDEN_EXTENSIONS','\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)');
// no guestbook
define('FORBIDDEN_EXTENSIONS','(guestbook|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 
pick one - take out space - delete guestbook links from admin update - index - not hard
__________________
rAdoN was here
rAdoN is offline   Reply With Quote
Old 12-30-2004, 01:49 PM   #11
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
Thanks rAdoN,
I'm sure that will help later when I get to filenames and types.
I wanted to ban certain PATHS as in
/links/
/guestbook/
/forum/
/cgi-bin/
/webring/
/affiliates/

There are a lot of common paths websites use that could be ignored and greatly reduce the size of the MySql database.
I would think that many people would find this a useful addition to PhpDig


Your help told about FILENAMES and FILETYPES. that wasn't the question.
A path would be everything up to a file and not including the file.
Path : The way to get from here to there . Not the destination

Last edited by Slider; 12-30-2004 at 02:05 PM.
Slider is offline   Reply With Quote
Old 12-30-2004, 02:15 PM   #12
rAdoN
Green Mole
 
Join Date: Oct 2004
Posts: 27
this already exist - not addition

PHP Code:
// no links with guestbook forum cgi-bin webring affiliates
// no links ending with .cgi .php .asp .pl .rm .ico .cab ...
define('FORBIDDEN_EXTENSIONS','(guestbook|forum|cgi-bin|webring|affiliates|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 
remove space

why you not want .cgi .php .asp .pl - dynamic pages

FORBIDDEN_EXTENSIONS can be more than extensions

config is for you to config - make regex you want
__________________
rAdoN was here
rAdoN is offline   Reply With Quote
Old 12-30-2004, 03:45 PM   #13
Slider
Orange Mole
 
Join Date: Jan 2004
Posts: 30
www.domain.com/links/file.ext

Note the part of the url in bold is the part I need banned.
Not a file or an extension. A DIRECTORY to a file. Also called the path

I need the directory called LINKS as a banned directory.
If it gets to that directory it doesn't follow it to spider any further.

I honestly do appreciate your help. A answer of any kind is better than no answer at all.
Slider is offline   Reply With Quote
Old 12-30-2004, 04:02 PM   #14
rAdoN
Green Mole
 
Join Date: Oct 2004
Posts: 27
awk - you not understand - make the FORBIDDEN_EXTENSIONS regex you want

PHP Code:
// no links with /guestbook/ /forum/ /cgi-bin/ /webring/ /affiliates/
// no links ending with .cgi .php .asp .pl .rm .ico .cab ...
define('FORBIDDEN_EXTENSIONS','(/guestbook/|/forum/|/cgi-bin/|/webring/|/affiliates/|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 
you need to go admin update - delete links you no want - run the cleans - edit FORBIDDEN_EXTENSIONS - relax - index after that

ps - what is not clear
__________________
rAdoN was here

Last edited by rAdoN; 12-30-2004 at 04:14 PM.
rAdoN is offline   Reply With Quote
Old 12-30-2004, 06:07 PM   #15
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Quote:
Originally Posted by Slider
www.domain.com/links/file.ext

Note the part of the url in bold is the part I need banned.
Not a file or an extension. A DIRECTORY to a file. Also called the path

I need the directory called LINKS as a banned directory.
If it gets to that directory it doesn't follow it to spider any further.

I honestly do appreciate your help. A answer of any kind is better than no answer at all.
Just use your robots.txt file to do that, like so:
Code:
User-agent: Phpdig
Disallow: /links/
vinyl-junkie is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
spidering error = theURL,winName,features ddowdall Troubleshooting 0 03-19-2006 07:28 AM
"search depth" and "links per" features laurentxav How-to Forum 1 01-12-2005 07:27 PM
Bugs, and missing Features in V. 1.6.2 Rolandks Bug Tracker 4 01-23-2004 07:01 AM
New Features Inquiry Charter Feedback & News 20 01-19-2004 07:10 PM


All times are GMT -8. The time now is 06:35 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.