PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Submissions

Reply
 
Thread Tools
Old 03-14-2004, 12:47 PM   #1
xibalba
Green Mole
 
Join Date: Mar 2004
Posts: 9
multiple crawlers

anyone think this is a good way to get phpDig to run multiple crawlers from sites in the database? When I spider.php I noticed that it would only do one site at a time and when it came across a site such as dmoz.org it would take days, if not weeks to index all of that.

http://rbhs.ath.cx/~reza/phpdig/wrapper.php
xibalba is offline   Reply With Quote
Old 03-14-2004, 03:54 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
PHP Code:
<?php 
show_source
("wrapper.php"); 
exit; 
error_reporting(E_FATAL); 

$db "phpdig"
$threads 6


//Make connection 
$link mysql_connect('localhost''root'''); 
if (!
$link) { 
die(
'Could not connect: ' mysql_error()); 


//Select DB 
$db_selected mysql_select_db($db$link); 
if (!
$db_selected) { 
   die (
'Can\\'use $db ' . mysql_error()); 



//How many sites are currently being crawled? 
$SQL = "select locked from sites where locked = 1"; 

$result = mysql_query($SQL); 
if (!$result) { 
   die('
Invalid query' . mysql_error()); 


$num_rows = mysql_num_rows($result); 

//If num_rows (sites currently crawled > than threads we want to run exit. 
if($num_rows > $threads) 

  //Drop out we dont need to initiate any more crawlers 
  echo "Doing nothing..."; 
  exit; 

else 

   //The amount of threads we should start 
   $threads = $threads - ($num_rows - 1); 


   //Find sites currently not being crawled 
   $SQL2 = "select site_url from sites where locked = 0"; 
   $result2 = mysql_query($SQL2); 
   if(!$result2) 
   { 
       die('
Invalid query: ' . mysql_error()); 
   } 
   else 
   { 
    for($i = 0; $i < $threads; $i++) 
    { 
       $rowNum = rand(0, mysql_num_rows($result2)); 
       mysql_data_seek($result2, $rowNum); 
       $tmp =  mysql_fetch_row($result2); 

       $url = parse_url($tmp[0]); 
       $cmd = "screen -A -m -d -S ".$url[host]."_phpdig php -f spider.php ".$tmp[0]; 
       system($cmd); 
    } 
   } 


mysql_close($link); 

?>
Hi. I copied the code here for ease of read. One thing I noticed is that $rowNum is r****m so it might be possible to get the same $rowNum more than one time in the loop.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-14-2004, 08:11 PM   #3
xibalba
Green Mole
 
Join Date: Mar 2004
Posts: 9
fix

i made the following fix.

PHP Code:
$rowNum = -1
    for(
$i 0$i $threads$i++) 
    { 
        
       
$tmp rand(0mysql_num_rows($result2)); 
       if(
$tmp == $rowNum); 
        
$threads--; 
xibalba is offline   Reply With Quote
Old 03-15-2004, 07:59 AM   #4
marb
Green Mole
 
Join Date: Mar 2004
Posts: 19
Think it's a nice option.
But where must the file install?
And how the get it work?

Marten
marb is offline   Reply With Quote
Old 03-15-2004, 08:37 AM   #5
xibalba
Green Mole
 
Join Date: Mar 2004
Posts: 9
how to work?

Hey, I usually just run it in the same directory as spider.php

%pwd
/usr/home/reza/public_html/phpdig/admin
%php -f wrapper.php
screen -A -m -d -S freebsd.org_phpdig php -f spider.php http://freebsd.org/
screen -A -m -d -S openbsd.org_phpdig php -f spider.php http://openbsd.org/

%screen -list
There are screens on:
13219.daily.daemonnews.org_phpdig (Detached)
10700.freebsd.org_phpdig (Detached)
13241.openbsd.org_phpdig (Detached)
88053.staff.daemonnews.org_phpdig (Detached)
88057.seclists.org_phpdig (Detached)
6 Sockets in /tmp/screens/S-reza.

%

and you can add it to crontab to have it run whenever you want
xibalba is offline   Reply With Quote
Old 01-06-2005, 06:18 PM   #6
jmitchell
Orange Mole
 
Join Date: Dec 2004
Location: Tennessee
Posts: 60
I guess I"m not totally understanding how this works...

If I'm correct, you install this script, and run it via cron jobs, and it will see if there are sites to be indexed, and add multiple spiders to handle it - right?

jmitchell
__________________
60,000 pages indexed!!!!! http://www.sharemylink.com
jmitchell is offline   Reply With Quote
Old 01-06-2005, 06:21 PM   #7
jmitchell
Orange Mole
 
Join Date: Dec 2004
Location: Tennessee
Posts: 60
and, I don't see anything at this link - http://rbhs.ath.cx/~reza/phpdig/wrapper.php

jmitchell
__________________
60,000 pages indexed!!!!! http://www.sharemylink.com
jmitchell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Re-upload multiple crawlers ammo Mod Requests 1 05-25-2005 07:49 AM
Multiple Spiders jmitchell How-to Forum 3 12-16-2004 04:43 PM
multiple crawlers searchboy How-to Forum 1 09-11-2004 06:10 AM
multiple SIDs chilling How-to Forum 1 05-17-2004 04:47 PM
Multiple spiders tryangle How-to Forum 3 04-24-2004 02:43 AM


All times are GMT -8. The time now is 08:04 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.