PDA

View Full Version : multiple crawlers


xibalba
03-14-2004, 12:47 PM
anyone think this is a good way to get phpDig to run multiple crawlers from sites in the database? When I spider.php I noticed that it would only do one site at a time and when it came across a site such as dmoz.org it would take days, if not weeks to index all of that.

http://rbhs.ath.cx/~reza/phpdig/wrapper.php

Charter
03-14-2004, 03:54 PM
<?php
show_source("wrapper.php");
exit;
error_reporting(E_FATAL);

$db = "phpdig";
$threads = 6;


//Make connection
$link = mysql_connect('localhost', 'root', '');
if (!$link) {
die('Could not connect: ' . mysql_error());
}

//Select DB
$db_selected = mysql_select_db($db, $link);
if (!$db_selected) {
die ('Can\\'t use $db : ' . mysql_error());
}


//How many sites are currently being crawled?
$SQL = "select locked from sites where locked = 1";

$result = mysql_query($SQL);
if (!$result) {
die('Invalid query: ' . mysql_error());
}

$num_rows = mysql_num_rows($result);

//If num_rows (sites currently crawled > than threads we want to run exit.
if($num_rows > $threads)
{
//Drop out we dont need to initiate any more crawlers
echo "Doing nothing...";
exit;
}
else
{
//The amount of threads we should start
$threads = $threads - ($num_rows - 1);


//Find sites currently not being crawled
$SQL2 = "select site_url from sites where locked = 0";
$result2 = mysql_query($SQL2);
if(!$result2)
{
die('Invalid query: ' . mysql_error());
}
else
{
for($i = 0; $i < $threads; $i++)
{
$rowNum = rand(0, mysql_num_rows($result2));
mysql_data_seek($result2, $rowNum);
$tmp = mysql_fetch_row($result2);

$url = parse_url($tmp[0]);
$cmd = "screen -A -m -d -S ".$url[host]."_phpdig php -f spider.php ".$tmp[0];
system($cmd);
}
}
}

mysql_close($link);

?>

Hi. I copied the code here for ease of read. One thing I noticed is that $rowNum is r****m so it might be possible to get the same $rowNum more than one time in the loop.

xibalba
03-14-2004, 08:11 PM
i made the following fix.


$rowNum = -1;
for($i = 0; $i < $threads; $i++)
{

$tmp = rand(0, mysql_num_rows($result2));
if($tmp == $rowNum);
$threads--;

marb
03-15-2004, 07:59 AM
Think it's a nice option.
But where must the file install?
And how the get it work?

Marten :)

xibalba
03-15-2004, 08:37 AM
Hey, I usually just run it in the same directory as spider.php

%pwd
/usr/home/reza/public_html/phpdig/admin
%php -f wrapper.php
screen -A -m -d -S freebsd.org_phpdig php -f spider.php http://freebsd.org/
screen -A -m -d -S openbsd.org_phpdig php -f spider.php http://openbsd.org/

%screen -list
There are screens on:
13219.daily.daemonnews.org_phpdig (Detached)
10700.freebsd.org_phpdig (Detached)
13241.openbsd.org_phpdig (Detached)
88053.staff.daemonnews.org_phpdig (Detached)
88057.seclists.org_phpdig (Detached)
6 Sockets in /tmp/screens/S-reza.

%

and you can add it to crontab to have it run whenever you want

jmitchell
01-06-2005, 06:18 PM
I guess I"m not totally understanding how this works...

If I'm correct, you install this script, and run it via cron jobs, and it will see if there are sites to be indexed, and add multiple spiders to handle it - right?

jmitchell

jmitchell
01-06-2005, 06:21 PM
and, I don't see anything at this link - http://rbhs.ath.cx/~reza/phpdig/wrapper.php

jmitchell