PDA

View Full Version : Forking when spidering


cybercox
03-26-2004, 06:50 AM
Hi all,
I have completed a little script on the spider process.
The code will spider a site forking when spidering. It is only experimental code.
So if someone wants to improve it... let me know!
The base code (the one that processes the url) is taken from charter's spider.php. My code only wraps the base code.
Actually the program forks up to 20 times when spidering and it need already a site to be present in the db.
So: no records in tempspider table no spidering...
It doesn't implement the same logic of spider.php for several reasons.
1: it doesn't lock the site which is spidering.
2: It uses one connection to mysql on the parent process and others in the children (see the comments) to avoid the "LOST CONNECTION" problem
3: it has a very basic clean up routine called every 30 minutes... this is because it same inserts in the temporary table are still possible. No idea to avoid this... suggests are very appreciated!
4: NOT ALL the fantastic features that make phpdig what it is are implemented (no excludes, no robots.txt, no levels no this, no that)... is very far to be a good program!


With this script i've indexed up to 1000 documents in 2 hours. It is very fast. and very dangerous too. It could fill up your disk
very quickly.

USE IT WITH CARE!!! IT LOOPS......

place the file in your admin directory nad rename as spider_fork.php

Create a new comun in the tempspider table called HASH timestamp or varchar(250)

To call it you must first set up the records in the database and then type from shell:

php -f spider_fork.php

Enjoy and let me know what you think

Bye

Simone Capra

capra__nospam__@erweb.it
E.R.WEB - s.r.l.
http://www.erweb.it

Charter
03-26-2004, 07:38 PM
Hi. I've downloaded the code and will take a closer look when I get a chance. However, I've also removed the attachment, and here's why.

>> NOT ALL the fantastic features that make phpdig what it is are implemented (no excludes, no robots.txt, no levels no this, no that)... is very far to be a good program!

Users who run PhpDig on their own sites would likely account for personal preference and take care not to exceed their bandwidth allotment or server resources. Care should also be taken not to disregard someone else's preference or adversely affect someone else's machine or pocketbook.

Also, the attachment since removed may get PhpDig placed on bad bot lists, especially because the user agent in the since removed code lists the PhpDig.net robot information page which says PhpDig should obey a robots.txt file.

This isn't a personal slam or anything like that. It's just a note to let PhpDig users know that, in order to keep PhpDig a benevolent and viable open source project, PhpDig and modifications thereto need to be as "Net Friendly" as they can be.

cybercox
03-27-2004, 01:59 AM
Well charter!
I totally agree with you. Perhaps you might share it with the ones who are really interested in forking and in developing this technology...
i'm running it and it is very dangerous...
downloads everything and uses the bandwidth not in a "clever" manner.

>USE IT WITH CARE!!! IT LOOPS......

Could be a problem for end users who don't really know what does it mean. Actually the code loops on every site...

But i think is a good start to let you know IT COULD BE A WAY to speed up the spider and to use all your resources (i think about it in en company environment... my resources are what i spent to buy them... :-))

Anyway: i was looking for an efficent way to use ALL the bandwidth and ALL the processor :-)


personally i think i owe to phpdig something... and i really don't take it as a "personal slam" since the GPL code says to share modifications and what is done with modifucations is up to the project owner... well this is my personal mod, it's very dangerous but works.. and is not so complicated. So people out there who want to fork when spidering, you can share my code with charter.รน by asking him.

I FORGOT php must be compiled with
--enable-pcntl
for forking to function....

if i could i would implement all this technology to be suitable to end users but time is what i don't have so.... and not everybody can recompile the php to have the pcntl functions....

sharing the code with you is what i wanted and what i got.

Regards and let me know something!!!!
Simone Capra


capra__nospam__@erweb.it
E.R.WEB - s.r.l.
http://www.erweb.it

rockyourbody
09-27-2004, 12:36 PM
I run phpDig on my internal network for the searching of documents within our organisation. I'm could really do with a method of speeding up the spider as it currently takes me over 30 hours to index all of the pages available.

Could you post this change please?

Charter
09-28-2004, 04:30 AM
The file should not be redistributed for reasons already mentioned.

rockyourbody
09-29-2004, 12:53 AM
Sorry Charter, but that simply isn't good enough. This is an open source project, yet you are censoring other peoples work because it doesn't fit in with your ideals, regardless of how it may actually benefit some of us.

From where I'm standing, this runs contrary to the whole point of open source and I'm very dissappointed, not to mention put out by your stance.

vinyl-junkie
09-29-2004, 07:18 AM
Sorry Charter, but that simply isn't good enough.If Charter could be sure that no one would use this mod to abuse the bandwidth on someone else's server, I'm sure she would be more than willing to have this redistributed en masse. However, you know and I know that no such guarantee could ever be made.

Any misuse of phpdig - and consuming mass quantities of bandwidth on someone else's server would clearly be a misuse of this software - would reflect badly on phpdig. Do you really want to go there?

You don't have to care about phpdig's reputation, but Charter does. If there's a chance that reputation could be tarnished, then I think it would be a dangerous thing to allow this mod to be redistributed.

Just my $0.02.

Charter
09-29-2004, 08:21 AM
Yes, vinyl-junkie, what you state in your post, plus right from the GNU GPL FAQs...

If I know someone has a copy of a GPL-covered program, can I demand he give me a copy? (http://www.gnu.org/licenses/gpl-faq.html#CanIDemandACopy)

No. The GPL gives him permission to make and redistribute copies of the program if he chooses to do so. He also has the right not to redistribute the program, if that is what he chooses.

rockyourbody
09-29-2004, 09:22 AM
Well I'm unhappy. :(

cybercox
09-30-2004, 05:00 AM
well since the mod is mine.....
i choose not to distribute it. basically to respect charter's work.
The mod is very powerful, actually is not looping anymore but has a lot
of problems. Like reading robots.txt to respect the standard... each time chases
a page.

So, charter has her own copy. if she likes can do whatever she wants
Regards and many thanks to charter,
Simone Capra

rockyourbody
09-30-2004, 11:56 PM
Such a waste :( Killed by the gpl nazi's

shamu
10-12-2004, 09:40 PM
So is there no way to fix the code to respect robots.txt?

I'm just wondering.