JWSmythe
05-12-2004, 11:14 PM
I hope this question isn't terribly redundant. If it is, feel free to point me towards a message with the right answer. : ) (I know it'll happen anyways)
I just ran across PhpDig, and am very impressed with it. I've taken a few attempts at writing my own search engines over the years, and have never been very pleased with my results. We're using ht://dig with another of our sites, and I'm not exactly pleased with it. For this application, it was pretty much unusable.
And before anyone freaks out, I'm going to have some porn site names in this message. I'm not advertising. Don't go to them if you don't want to. I'm honestly looking for information.
proadult.com has something like 60,000 sites listed with it. Our search is rather archaic. So, I'm giving PhpDig a shot.
With that many sites, I may spend the rest of my life waiting for all the sites to spider. So I decided to have a shot at having multiple spiders running at once. It seems to be working, I see plenty of results showing up in the search, but I want to be sure to get all the sites in there.
The search page is at http://search.proadult.com . No one is looking at it yet. Other than a few people on staff, and readers here, no one knows about it yet, so if I really messed anything up, there's no real harm in dumping the whole database and starting over.
Right now, the admin page shows:
DataBase status
Hosts : 3118 Entries
Pages : 1751 Entries
Index : 156759 Entries
Keywords : 15533 Entries
Temporary table : 19999 Entries
I put these scripts together to split up the spidering tasks.. This first script takes all the sites, and makes $searchthreads (20) lists to work with.
--- collect.masterlist.pl
#!/usr/bin/perl
use DBI;
$searchthreads = "20";
$masterlist = "/host/users/search.proadult.com/masterlist/masterlist.txt";
$spiderbase = "/host/users/search.proadult.com/masterlist/spiderfile";
system (`rm $spiderbase*.list`);
$source_db = DBI->connect("DBI:mysql:x:x", x, 'x') || die "$!";
$source_query = "SELECT blah
FROM blah
WHERE ACTIVE = 1";
$source = $source_db->prepare("$source_query") || die "error on source prepare\n";
$source->execute || print "Error on source execute\n";
#pushing into an array, so we have something to play with.
while (@curarray = $source->fetchrow_array){
$curline = $curarray[0];
push @sitesarray, $curline;
};
open (OUT, ">$masterlist.new.test");
#open all our spider files
$spidercount = 0;
while ($spidercount < $searchthreads){
$spidercount++;
open ("SPIDER$spidercount", ">$spiderbase.$spidercount.list");
};
$spidercount = 0;
foreach $curline (@sitesarray){
print OUT "$curline\n";
$spidercount++;
$curoutfile = "SPIDER$spidercount";
print $curoutfile "$curline\n";
if ($spidercount >= $searchthreads){
$spidercount = 0;
};
};
close (OUT);
--- end collect.masterlist
Then I run runspiders.sh, which gets the ball rolling. It assigns each list to another task, which does the spidering.
--- begin runspiders.sh
#!/bin/tcsh
cd /host/users/search.proadult.com/masterlist
rm /host/users/search.proadult.com/masterlist/logs/spider.log
foreach x (`ls -1 spiderfile*`)
cd /host/users/search.proadult.com/htdocs
admin/spider.php /host/users/search/proadult.com/masterlist/$x > /host/users/search.proadult.com/masterlist/log/$x.log\n"
/host/users/search.proadult.com/masterlist/spidertask.sh $x &
end
--- end runspiders.sh
This is the real meat of it. There are $searchthreads (20) of these running right now.
--- begin spidertask.sh
#!/bin/tcsh
echo "Running list $1";
foreach x (`cat /host/users/search.proadult.com/masterlist/$1`)
/usr/local/bin/php -f admin/spider.php $x >> /host/users/search.proadult.com/masterlist/log/spider.log
end
--- end spidertask.sh
Our site structure is kind of funny. There is a mixture of sites that live under http://[something].domain.com/[userdirectory] , and others that are under their own domains. 99% of these are on our servers, so there's no harm in beating hard on the server.
I'll see a whole bunch of what look like good runs roll by, then I'll see a few screenfulls of this:
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
Is this bad?? Does it mean that those tasks aren't being allowed to run, or are they just lingering til another task completes?
On one of the previous attempts, I had 20 tasks running long lists of sites (2000+ URL's per list), which never actually added anything to the search, but I think they were dumped into the temporary table.
Is there a better way to do this?
I saw someone else had code for reducing the locks, and I've already added that to spider.php, which may or may not have been the right thing to do..
In the time it took to write this message, the stats got up to:
Hosts : 3118 Entries
Pages : 2125 Entries
Index : 197074 Entries
Keywords : 16600 Entries
Temporary table : 20573 Entries
If the locks are ok, can I crank up the number of simultanious tasks the spider machine starts running low on memory or CPU time, or is there a practical limitation I may run into first?
Thanks in advance for your time and advice.
I just ran across PhpDig, and am very impressed with it. I've taken a few attempts at writing my own search engines over the years, and have never been very pleased with my results. We're using ht://dig with another of our sites, and I'm not exactly pleased with it. For this application, it was pretty much unusable.
And before anyone freaks out, I'm going to have some porn site names in this message. I'm not advertising. Don't go to them if you don't want to. I'm honestly looking for information.
proadult.com has something like 60,000 sites listed with it. Our search is rather archaic. So, I'm giving PhpDig a shot.
With that many sites, I may spend the rest of my life waiting for all the sites to spider. So I decided to have a shot at having multiple spiders running at once. It seems to be working, I see plenty of results showing up in the search, but I want to be sure to get all the sites in there.
The search page is at http://search.proadult.com . No one is looking at it yet. Other than a few people on staff, and readers here, no one knows about it yet, so if I really messed anything up, there's no real harm in dumping the whole database and starting over.
Right now, the admin page shows:
DataBase status
Hosts : 3118 Entries
Pages : 1751 Entries
Index : 156759 Entries
Keywords : 15533 Entries
Temporary table : 19999 Entries
I put these scripts together to split up the spidering tasks.. This first script takes all the sites, and makes $searchthreads (20) lists to work with.
--- collect.masterlist.pl
#!/usr/bin/perl
use DBI;
$searchthreads = "20";
$masterlist = "/host/users/search.proadult.com/masterlist/masterlist.txt";
$spiderbase = "/host/users/search.proadult.com/masterlist/spiderfile";
system (`rm $spiderbase*.list`);
$source_db = DBI->connect("DBI:mysql:x:x", x, 'x') || die "$!";
$source_query = "SELECT blah
FROM blah
WHERE ACTIVE = 1";
$source = $source_db->prepare("$source_query") || die "error on source prepare\n";
$source->execute || print "Error on source execute\n";
#pushing into an array, so we have something to play with.
while (@curarray = $source->fetchrow_array){
$curline = $curarray[0];
push @sitesarray, $curline;
};
open (OUT, ">$masterlist.new.test");
#open all our spider files
$spidercount = 0;
while ($spidercount < $searchthreads){
$spidercount++;
open ("SPIDER$spidercount", ">$spiderbase.$spidercount.list");
};
$spidercount = 0;
foreach $curline (@sitesarray){
print OUT "$curline\n";
$spidercount++;
$curoutfile = "SPIDER$spidercount";
print $curoutfile "$curline\n";
if ($spidercount >= $searchthreads){
$spidercount = 0;
};
};
close (OUT);
--- end collect.masterlist
Then I run runspiders.sh, which gets the ball rolling. It assigns each list to another task, which does the spidering.
--- begin runspiders.sh
#!/bin/tcsh
cd /host/users/search.proadult.com/masterlist
rm /host/users/search.proadult.com/masterlist/logs/spider.log
foreach x (`ls -1 spiderfile*`)
cd /host/users/search.proadult.com/htdocs
admin/spider.php /host/users/search/proadult.com/masterlist/$x > /host/users/search.proadult.com/masterlist/log/$x.log\n"
/host/users/search.proadult.com/masterlist/spidertask.sh $x &
end
--- end runspiders.sh
This is the real meat of it. There are $searchthreads (20) of these running right now.
--- begin spidertask.sh
#!/bin/tcsh
echo "Running list $1";
foreach x (`cat /host/users/search.proadult.com/masterlist/$1`)
/usr/local/bin/php -f admin/spider.php $x >> /host/users/search.proadult.com/masterlist/log/spider.log
end
--- end spidertask.sh
Our site structure is kind of funny. There is a mixture of sites that live under http://[something].domain.com/[userdirectory] , and others that are under their own domains. 99% of these are on our servers, so there's no harm in beating hard on the server.
I'll see a whole bunch of what look like good runs roll by, then I'll see a few screenfulls of this:
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
Is this bad?? Does it mean that those tasks aren't being allowed to run, or are they just lingering til another task completes?
On one of the previous attempts, I had 20 tasks running long lists of sites (2000+ URL's per list), which never actually added anything to the search, but I think they were dumped into the temporary table.
Is there a better way to do this?
I saw someone else had code for reducing the locks, and I've already added that to spider.php, which may or may not have been the right thing to do..
In the time it took to write this message, the stats got up to:
Hosts : 3118 Entries
Pages : 2125 Entries
Index : 197074 Entries
Keywords : 16600 Entries
Temporary table : 20573 Entries
If the locks are ok, can I crank up the number of simultanious tasks the spider machine starts running low on memory or CPU time, or is there a practical limitation I may run into first?
Thanks in advance for your time and advice.