PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Submissions

Reply
 
Thread Tools
Old 03-26-2004, 06:50 AM   #1
cybercox
Green Mole
 
Join Date: Jan 2004
Location: Italy
Posts: 11
Forking when spidering

Hi all,
I have completed a little script on the spider process.
The code will spider a site forking when spidering. It is only experimental code.
So if someone wants to improve it... let me know!
The base code (the one that processes the url) is taken from charter's spider.php. My code only wraps the base code.
Actually the program forks up to 20 times when spidering and it need already a site to be present in the db.
So: no records in tempspider table no spidering...
It doesn't implement the same logic of spider.php for several reasons.
1: it doesn't lock the site which is spidering.
2: It uses one connection to mysql on the parent process and others in the children (see the comments) to avoid the "LOST CONNECTION" problem
3: it has a very basic clean up routine called every 30 minutes... this is because it same inserts in the temporary table are still possible. No idea to avoid this... suggests are very appreciated!
4: NOT ALL the fantastic features that make phpdig what it is are implemented (no excludes, no robots.txt, no levels no this, no that)... is very far to be a good program!


With this script i've indexed up to 1000 documents in 2 hours. It is very fast. and very dangerous too. It could fill up your disk
very quickly.

USE IT WITH CARE!!! IT LOOPS......

place the file in your admin directory nad rename as spider_fork.php

Create a new comun in the tempspider table called HASH timestamp or varchar(250)

To call it you must first set up the records in the database and then type from shell:

php -f spider_fork.php

Enjoy and let me know what you think

Bye

Simone Capra

capra__nospam__@erweb.it
E.R.WEB - s.r.l.
http://www.erweb.it
cybercox is offline   Reply With Quote
Old 03-26-2004, 07:38 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. I've downloaded the code and will take a closer look when I get a chance. However, I've also removed the attachment, and here's why.

>> NOT ALL the fantastic features that make phpdig what it is are implemented (no excludes, no robots.txt, no levels no this, no that)... is very far to be a good program!

Users who run PhpDig on their own sites would likely account for personal preference and take care not to exceed their bandwidth allotment or server resources. Care should also be taken not to disregard someone else's preference or adversely affect someone else's machine or pocketbook.

Also, the attachment since removed may get PhpDig placed on bad bot lists, especially because the user agent in the since removed code lists the PhpDig.net robot information page which says PhpDig should obey a robots.txt file.

This isn't a personal slam or anything like that. It's just a note to let PhpDig users know that, in order to keep PhpDig a benevolent and viable open source project, PhpDig and modifications thereto need to be as "Net Friendly" as they can be.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-27-2004, 01:59 AM   #3
cybercox
Green Mole
 
Join Date: Jan 2004
Location: Italy
Posts: 11
Well charter!
I totally agree with you. Perhaps you might share it with the ones who are really interested in forking and in developing this technology...
i'm running it and it is very dangerous...
downloads everything and uses the bandwidth not in a "clever" manner.

>USE IT WITH CARE!!! IT LOOPS......

Could be a problem for end users who don't really know what does it mean. Actually the code loops on every site...

But i think is a good start to let you know IT COULD BE A WAY to speed up the spider and to use all your resources (i think about it in en company environment... my resources are what i spent to buy them... :-))

Anyway: i was looking for an efficent way to use ALL the bandwidth and ALL the processor :-)


personally i think i owe to phpdig something... and i really don't take it as a "personal slam" since the GPL code says to share modifications and what is done with modifucations is up to the project owner... well this is my personal mod, it's very dangerous but works.. and is not so complicated. So people out there who want to fork when spidering, you can share my code with charter.รน by asking him.

I FORGOT php must be compiled with
--enable-pcntl
for forking to function....

if i could i would implement all this technology to be suitable to end users but time is what i don't have so.... and not everybody can recompile the php to have the pcntl functions....

sharing the code with you is what i wanted and what i got.

Regards and let me know something!!!!
Simone Capra


capra__nospam__@erweb.it
E.R.WEB - s.r.l.
http://www.erweb.it
cybercox is offline   Reply With Quote
Old 09-27-2004, 12:36 PM   #4
rockyourbody
Green Mole
 
Join Date: Sep 2004
Posts: 5
I need this script...

I run phpDig on my internal network for the searching of documents within our organisation. I'm could really do with a method of speeding up the spider as it currently takes me over 30 hours to index all of the pages available.

Could you post this change please?
rockyourbody is offline   Reply With Quote
Old 09-28-2004, 04:30 AM   #5
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
The file should not be redistributed for reasons already mentioned.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 09-29-2004, 12:53 AM   #6
rockyourbody
Green Mole
 
Join Date: Sep 2004
Posts: 5
Exclamation

Sorry Charter, but that simply isn't good enough. This is an open source project, yet you are censoring other peoples work because it doesn't fit in with your ideals, regardless of how it may actually benefit some of us.

From where I'm standing, this runs contrary to the whole point of open source and I'm very dissappointed, not to mention put out by your stance.

Last edited by rockyourbody; 09-29-2004 at 12:56 AM. Reason: Spelling
rockyourbody is offline   Reply With Quote
Old 09-29-2004, 07:18 AM   #7
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Quote:
Originally Posted by rockyourbody
Sorry Charter, but that simply isn't good enough.
If Charter could be sure that no one would use this mod to abuse the bandwidth on someone else's server, I'm sure she would be more than willing to have this redistributed en masse. However, you know and I know that no such guarantee could ever be made.

Any misuse of phpdig - and consuming mass quantities of bandwidth on someone else's server would clearly be a misuse of this software - would reflect badly on phpdig. Do you really want to go there?

You don't have to care about phpdig's reputation, but Charter does. If there's a chance that reputation could be tarnished, then I think it would be a dangerous thing to allow this mod to be redistributed.

Just my $0.02.
vinyl-junkie is offline   Reply With Quote
Old 09-29-2004, 08:21 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Yes, vinyl-junkie, what you state in your post, plus right from the GNU GPL FAQs...
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 09-29-2004, 09:22 AM   #9
rockyourbody
Green Mole
 
Join Date: Sep 2004
Posts: 5
Well I'm unhappy.
rockyourbody is offline   Reply With Quote
Old 09-30-2004, 05:00 AM   #10
cybercox
Green Mole
 
Join Date: Jan 2004
Location: Italy
Posts: 11
well since the mod is mine.....
i choose not to distribute it. basically to respect charter's work.
The mod is very powerful, actually is not looping anymore but has a lot
of problems. Like reading robots.txt to respect the standard... each time chases
a page.

So, charter has her own copy. if she likes can do whatever she wants
Regards and many thanks to charter,
Simone Capra
cybercox is offline   Reply With Quote
Old 09-30-2004, 11:56 PM   #11
rockyourbody
Green Mole
 
Join Date: Sep 2004
Posts: 5
Such a waste Killed by the gpl nazi's
rockyourbody is offline   Reply With Quote
Old 10-12-2004, 09:40 PM   #12
shamu
Green Mole
 
Join Date: Oct 2004
Posts: 4
So is there no way to fix the code to respect robots.txt?

I'm just wondering.
shamu is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Forking jmitchell How-to Forum 2 01-18-2005 08:58 AM
Forking when spidering obottek Mod Requests 5 03-13-2004 11:38 AM


All times are GMT -8. The time now is 06:05 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.