Big dig database - spidering question [Archive]

JWSmythe

05-12-2004, 11:14 PM

I hope this question isn't terribly redundant. If it is, feel free to point me towards a message with the right answer. : ) (I know it'll happen anyways)

I just ran across PhpDig, and am very impressed with it. I've taken a few attempts at writing my own search engines over the years, and have never been very pleased with my results. We're using ht://dig with another of our sites, and I'm not exactly pleased with it. For this application, it was pretty much unusable.

And before anyone freaks out, I'm going to have some porn site names in this message. I'm not advertising. Don't go to them if you don't want to. I'm honestly looking for information.

proadult.com has something like 60,000 sites listed with it. Our search is rather archaic. So, I'm giving PhpDig a shot.

With that many sites, I may spend the rest of my life waiting for all the sites to spider. So I decided to have a shot at having multiple spiders running at once. It seems to be working, I see plenty of results showing up in the search, but I want to be sure to get all the sites in there.

The search page is at http://search.proadult.com . No one is looking at it yet. Other than a few people on staff, and readers here, no one knows about it yet, so if I really messed anything up, there's no real harm in dumping the whole database and starting over.

Right now, the admin page shows:

DataBase status
Hosts : 3118 Entries
Pages : 1751 Entries
Index : 156759 Entries
Keywords : 15533 Entries
Temporary table : 19999 Entries

I put these scripts together to split up the spidering tasks.. This first script takes all the sites, and makes $searchthreads (20) lists to work with.

--- collect.masterlist.pl
#!/usr/bin/perl

use DBI;

$searchthreads = "20";
$masterlist = "/host/users/search.proadult.com/masterlist/masterlist.txt";
$spiderbase = "/host/users/search.proadult.com/masterlist/spiderfile";
system (`rm $spiderbase*.list`);

$source_db = DBI->connect("DBI:mysql:x:x", x, 'x') || die "$!";

$source_query = "SELECT blah
FROM blah
WHERE ACTIVE = 1";

$source = $source_db->prepare("$source_query") || die "error on source prepare\n";
$source->execute || print "Error on source execute\n";

#pushing into an array, so we have something to play with.
while (@curarray = $source->fetchrow_array){
$curline = $curarray[0];
push @sitesarray, $curline;
};

open (OUT, ">$masterlist.new.test");

#open all our spider files
$spidercount = 0;

while ($spidercount < $searchthreads){
$spidercount++;
open ("SPIDER$spidercount", ">$spiderbase.$spidercount.list");
};

$spidercount = 0;
foreach $curline (@sitesarray){
print OUT "$curline\n";
$spidercount++;
$curoutfile = "SPIDER$spidercount";
print $curoutfile "$curline\n";
if ($spidercount >= $searchthreads){
$spidercount = 0;
};
};
close (OUT);
--- end collect.masterlist

Then I run runspiders.sh, which gets the ball rolling. It assigns each list to another task, which does the spidering.

--- begin runspiders.sh
#!/bin/tcsh

cd /host/users/search.proadult.com/masterlist

rm /host/users/search.proadult.com/masterlist/logs/spider.log

foreach x (`ls -1 spiderfile*`)
cd /host/users/search.proadult.com/htdocs
admin/spider.php /host/users/search/proadult.com/masterlist/$x > /host/users/search.proadult.com/masterlist/log/$x.log\n"

/host/users/search.proadult.com/masterlist/spidertask.sh $x &

end
--- end runspiders.sh

This is the real meat of it. There are $searchthreads (20) of these running right now.

--- begin spidertask.sh
#!/bin/tcsh

echo "Running list $1";

foreach x (`cat /host/users/search.proadult.com/masterlist/$1`)
/usr/local/bin/php -f admin/spider.php $x >> /host/users/search.proadult.com/masterlist/log/spider.log
end
--- end spidertask.sh

Our site structure is kind of funny. There is a mixture of sites that live under http://[something].domain.com/[userdirectory] , and others that are under their own domains. 99% of these are on our servers, so there's no harm in beating hard on the server.

I'll see a whole bunch of what look like good runs roll by, then I'll see a few screenfulls of this:

*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*
*http://fans.domain.com/ Locked*

Is this bad?? Does it mean that those tasks aren't being allowed to run, or are they just lingering til another task completes?

On one of the previous attempts, I had 20 tasks running long lists of sites (2000+ URL's per list), which never actually added anything to the search, but I think they were dumped into the temporary table.

Is there a better way to do this?

I saw someone else had code for reducing the locks, and I've already added that to spider.php, which may or may not have been the right thing to do..

In the time it took to write this message, the stats got up to:

Hosts : 3118 Entries
Pages : 2125 Entries
Index : 197074 Entries
Keywords : 16600 Entries
Temporary table : 20573 Entries

If the locks are ok, can I crank up the number of simultanious tasks the spider machine starts running low on memory or CPU time, or is there a practical limitation I may run into first?

Thanks in advance for your time and advice.

vinyl-junkie

05-13-2004, 03:28 AM

Welcome to the forum, JWSmythe. :D

Check out this thread (http://www.phpdig.net/showthread.php?s=&threadid=357) for an approach to spidering your site.

bloodjelly

05-13-2004, 11:13 AM

I think it is a problem right now running multiple instances of spider.php. I'm trying to do the same thing, and I used the wrapper in this thread: http://www.phpdig.net/showthread.php?threadid=662 but it still has problems. Hopefully (hint hint) charter will make it easy to do this in the next version.:)

JWSmythe

05-14-2004, 10:52 PM

Thanks for the replies.

I just let it run, and it didn't seem to break anything. I really have no clue if it got all of the sites. I kinda sorta accidently killed the tasks, after like 3 days, but now the stats are:

Hosts : 3118 Entries
Pages : 31824 Entries
Index : 2421863 Entries
Keywords : 183779 Entries
Temporary table : 162490 Entries

I may change the threading scheme, so it simply monitors how many tasks are going, so I'll have a positive stopping point, rather than knowing it may have gotten to particular points in many files. Then I can put a restart into it, so if I kill it again, it can restart on the last record run, rather than starting from the top. I'll post what I do here, so if anyone's sadistic (err, interested..), they can play with it a bit, and try it themselves. :)

I made an interesting change, displaying thumbnails of the resulting pages, which looks rather nice, or at least I think.. I'll start another thread to put that code into.