![]() |
|
![]() |
#1 |
Orange Mole
Join Date: Jan 2004
Posts: 30
|
Ban features
When doing a huge list of sites to index I may add a list and not look through simply because it is so huge of a list.
In the config: - AutoBan certain Domains. EX: freeservers Geocities and others - AutoBan universal Directory names: links, banners, affiliates, forums, blog, webrings, Etc.. (not just for a certain site, but for all that are crawled) I'm understanding that this is in Phpdig already, but I have yet to get it working. - Autoban certain domains by ext: .biz .info .this .that (if you run an english only site this would be very helpful to weed out sites like .ru and others) This would be a great feature to reduce the size of a bloated Mysq Database. |
![]() |
![]() |
![]() |
#2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Set the words you want to ban in the BANNED constant in the config file.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
![]() |
![]() |
![]() |
#3 |
Orange Mole
Join Date: Jan 2004
Posts: 30
|
if I want to use a . would it have to be escaped as in about\.com or can I just use about.com ?
|
![]() |
![]() |
![]() |
#4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Basically a . matches any character whereas a \. matches a period.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
![]() |
![]() |
![]() |
#5 |
Orange Mole
Join Date: Jan 2004
Posts: 30
|
Thanks Charter
Your help greatly reduced the mysql database with all the crud pages that I should not have allowed to be crawled. I also found an excellent tutorial at this place It kept me from asking you 10 more questions of things I could look up but just wasn't sure where to look at. php.net wasn't very helpful as they usually are. |
![]() |
![]() |
![]() |
#6 |
Orange Mole
Join Date: Jan 2004
Posts: 30
|
ok it kept me from asking you 9 more questions maybe......
http://www.horse-riding.net/cgi-bin/guestbook/book.cgi?url=anything&mode=show&refresh=yes I add it to ignore guestbook in BANNED (it was a no go for some reason) I added in FORBIDDEN_EXTENSIONS to have cgi filetype ignored (it was a no go also) This is after a fresh crawl with nothing else done. Code:
// regular expression to ban useless external links in index define('BANNED','^ad\.|banner|banners|doubleclick|links|forum|guestbook|geocities|8m|directory|affiliate|groups|'); // regexp forbidden extensions - return sometimes text/html mime-type !!! define('FORBIDDEN_EXTENSIONS','\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$'); The code box above made the geocities and tar file look weird with a space , but it's right in the config. Maybe its my browser playing tricks on me. |
![]() |
![]() |
![]() |
#7 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
The word guestbook isn't part of actual the domain name. If you want to ban keywords based on the entire link, try $regs[2] instead of $regs[5] in the !eregi(BANNED,$regs[5]) piece in the robot_functions.php file. Also the link doesn't end in cgi so look up what $ means on that regexp tutorial you found and remove the $ from the FORBIDDEN_EXTENSIONS constant in the config.php file if you want.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
![]() |
![]() |
![]() |
#8 |
Orange Mole
Join Date: Jan 2004
Posts: 30
|
I tried $regs[2] as the replacement as suggested and resulted in no change to the indexing. I completely understand that you shouldn't have to answer questons for alterations to your scripting. This is something I really need to work if at all possible. Thank you for taking the time to answer my questions.
I've been reading a lot of php documentation online and still I'm left scratching my head. Possibly is there something else that needs to be changed? |
![]() |
![]() |
![]() |
#9 |
Orange Mole
Join Date: Jan 2004
Posts: 30
|
It needs the functionality to ban the url that contains a directory/path name you wish to ban. This thread was started and intended to be in the Mod Request part of the forums. This is a Mod Request since PhpDig does not have this function.
I am still left with no solution and in the wrong part of the forums. Could you possibly move this back to it's original placement? |
![]() |
![]() |
![]() |
#10 |
Green Mole
Join Date: Oct 2004
Posts: 27
|
config does already - not mod request
!eregi(BANNED,$regs[2]) work for ban keywords learn regex - use FORBIDDEN_EXTENSIONS PHP Code:
__________________
![]() ![]() |
![]() |
![]() |
![]() |
#11 |
Orange Mole
Join Date: Jan 2004
Posts: 30
|
Thanks rAdoN,
I'm sure that will help later when I get to filenames and types. I wanted to ban certain PATHS as in /links/ /guestbook/ /forum/ /cgi-bin/ /webring/ /affiliates/ There are a lot of common paths websites use that could be ignored and greatly reduce the size of the MySql database. I would think that many people would find this a useful addition to PhpDig Your help told about FILENAMES and FILETYPES. that wasn't the question. A path would be everything up to a file and not including the file. Path : The way to get from here to there . Not the destination Last edited by Slider; 12-30-2004 at 02:05 PM. |
![]() |
![]() |
![]() |
#12 |
Green Mole
Join Date: Oct 2004
Posts: 27
|
this already exist - not addition
PHP Code:
why you not want .cgi .php .asp .pl - dynamic pages FORBIDDEN_EXTENSIONS can be more than extensions config is for you to config - make regex you want
__________________
![]() ![]() |
![]() |
![]() |
![]() |
#13 |
Orange Mole
Join Date: Jan 2004
Posts: 30
|
www.domain.com/links/file.ext
Note the part of the url in bold is the part I need banned. Not a file or an extension. A DIRECTORY to a file. Also called the path I need the directory called LINKS as a banned directory. If it gets to that directory it doesn't follow it to spider any further. I honestly do appreciate your help. A answer of any kind is better than no answer at all. |
![]() |
![]() |
![]() |
#14 |
Green Mole
Join Date: Oct 2004
Posts: 27
|
awk - you not understand - make the FORBIDDEN_EXTENSIONS regex you want
![]() PHP Code:
ps - what is not clear
__________________
![]() ![]() Last edited by rAdoN; 12-30-2004 at 04:14 PM. |
![]() |
![]() |
![]() |
#15 | |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Quote:
Code:
User-agent: Phpdig Disallow: /links/ |
|
![]() |
![]() |
![]() |
|
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
spidering error = theURL,winName,features | ddowdall | Troubleshooting | 0 | 03-19-2006 07:28 AM |
"search depth" and "links per" features | laurentxav | How-to Forum | 1 | 01-12-2005 07:27 PM |
Bugs, and missing Features in V. 1.6.2 | Rolandks | Bug Tracker | 4 | 01-23-2004 07:01 AM |
New Features Inquiry | Charter | Feedback & News | 20 | 01-19-2004 07:10 PM |