PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Mod Submissions (http://www.phpdig.net/forum/forumdisplay.php?f=24)
-   -   multiple crawlers (http://www.phpdig.net/forum/showthread.php?t=662)

xibalba 03-14-2004 12:47 PM

multiple crawlers
 
anyone think this is a good way to get phpDig to run multiple crawlers from sites in the database? When I spider.php I noticed that it would only do one site at a time and when it came across a site such as dmoz.org it would take days, if not weeks to index all of that.

http://rbhs.ath.cx/~reza/phpdig/wrapper.php

Charter 03-14-2004 03:54 PM

PHP Code:

<?php 
show_source
("wrapper.php"); 
exit; 
error_reporting(E_FATAL); 

$db "phpdig"
$threads 6


//Make connection 
$link mysql_connect('localhost''root'''); 
if (!
$link) { 
die(
'Could not connect: ' mysql_error()); 


//Select DB 
$db_selected mysql_select_db($db$link); 
if (!
$db_selected) { 
   die (
'Can\\'use $db ' . mysql_error()); 



//How many sites are currently being crawled? 
$SQL = "select locked from sites where locked = 1"; 

$result = mysql_query($SQL); 
if (!$result) { 
   die('
Invalid query' . mysql_error()); 


$num_rows = mysql_num_rows($result); 

//If num_rows (sites currently crawled > than threads we want to run exit. 
if($num_rows > $threads) 

  //Drop out we dont need to initiate any more crawlers 
  echo "Doing nothing..."; 
  exit; 

else 

   //The amount of threads we should start 
   $threads = $threads - ($num_rows - 1); 


   //Find sites currently not being crawled 
   $SQL2 = "select site_url from sites where locked = 0"; 
   $result2 = mysql_query($SQL2); 
   if(!$result2) 
   { 
       die('
Invalid query: ' . mysql_error()); 
   } 
   else 
   { 
    for($i = 0; $i < $threads; $i++) 
    { 
       $rowNum = rand(0, mysql_num_rows($result2)); 
       mysql_data_seek($result2, $rowNum); 
       $tmp =  mysql_fetch_row($result2); 

       $url = parse_url($tmp[0]); 
       $cmd = "screen -A -m -d -S ".$url[host]."_phpdig php -f spider.php ".$tmp[0]; 
       system($cmd); 
    } 
   } 


mysql_close($link); 

?>

Hi. I copied the code here for ease of read. One thing I noticed is that $rowNum is r****m so it might be possible to get the same $rowNum more than one time in the loop.

xibalba 03-14-2004 08:11 PM

fix
 
i made the following fix.

PHP Code:

$rowNum = -1
    for(
$i 0$i $threads$i++) 
    { 
        
       
$tmp rand(0mysql_num_rows($result2)); 
       if(
$tmp == $rowNum); 
        
$threads--; 


marb 03-15-2004 07:59 AM

Think it's a nice option.
But where must the file install?
And how the get it work?

Marten :)

xibalba 03-15-2004 08:37 AM

how to work?
 
Hey, I usually just run it in the same directory as spider.php

%pwd
/usr/home/reza/public_html/phpdig/admin
%php -f wrapper.php
screen -A -m -d -S freebsd.org_phpdig php -f spider.php http://freebsd.org/
screen -A -m -d -S openbsd.org_phpdig php -f spider.php http://openbsd.org/

%screen -list
There are screens on:
13219.daily.daemonnews.org_phpdig (Detached)
10700.freebsd.org_phpdig (Detached)
13241.openbsd.org_phpdig (Detached)
88053.staff.daemonnews.org_phpdig (Detached)
88057.seclists.org_phpdig (Detached)
6 Sockets in /tmp/screens/S-reza.

%

and you can add it to crontab to have it run whenever you want

jmitchell 01-06-2005 06:18 PM

I guess I"m not totally understanding how this works...

If I'm correct, you install this script, and run it via cron jobs, and it will see if there are sites to be indexed, and add multiple spiders to handle it - right?

jmitchell

jmitchell 01-06-2005 06:21 PM

and, I don't see anything at this link - http://rbhs.ath.cx/~reza/phpdig/wrapper.php

jmitchell


All times are GMT -8. The time now is 05:03 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.