Problem with indexing from text file [Archive]

View Full Version : Problem with indexing from text file

bloodjelly

04-12-2004, 05:28 PM

Hi -

I'm using the hack I made to limit number of links spidered (this post (http://www.phpdig.net/showthread.php?s=&threadid=300)) so this might be a problem of my making and if it is, sorry.:angel:

Anyway when I try to index a URI through a text file, PhpDig spiders it, limiting the links to the number I set, but then goes on to spider the rest of the already indexed URIs in the DB without the limit. Right now, pDig is seems to be in a loop, spidering wetcanvas.com (logfile:http://www.tjhdesign.com/spider.log) This makes me think it is most definitely because of an error with my code changes on the post listed at the top.

Regardless of my code changes though, should pDig re-spider URIs outside of the ones given to it by the text file? Any idea on why it's looping? The limit code works very well when spidering through a browser, it just seems to have problems when using a text file. Thanks :)

bloodjelly

04-13-2004, 09:39 AM

Well I searched before I posted but I guess not good enough. Seems like the answer to the text file question has to do with tempspider not being empty between runs. I'll try that, but I still don't know why it's looping. hmm...

Charter

04-13-2004, 07:59 PM

Hi. Looked at http://www.tjhdesign.com/spider.log but not sure what I'm suposed to see. Is it that the wetcanvas.com index results show up twice including the stuff from the robots.txt file?

bloodjelly

04-13-2004, 10:48 PM

Thanks -

Yes, you'll see that it says indexing complete after 300 links and then starts over, and it will keep indexing after that indefinitely, rather than adhering to the limit I set. This seems to only happen when I spider from a text file of URLs. I did make modifications, though, so maybe I should start from scratch or try that mod that limit links, it's probably better than mine.

Charter

04-13-2004, 10:53 PM

Hi. Which code change are you using from this (http://www.phpdig.net/showthread.php?threadid=300) thread?

bloodjelly

04-13-2004, 10:58 PM

My code - the last two posts. I also changed this line of robot_functions.php if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].'#$]',$key)) to this if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and isset($common_words[$key]) and ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].'#$]',$key)) (removed the ! in front of isset($common_words) to only index words in that file (for a project I'm doing).

Charter

04-14-2004, 03:32 AM

Hi. I've figured it out; here's the deal.

Short version: Empty the tempspider table between runs.

Long version: When wetcanvas.com was previously indexed, level two was reached before the cap was met. Once the cap was met, the remaining level two plus links were not deleted from the tempspider table. Afterwards, graphics.com was indexed using a file, and because there was level two plus data left over from wetcanvas.com, and possibly info from other sites, the $list_sites variable in spider.php was set to array(graphics.com info, wetcanvas.com info,...,wetcanvas.com info, domain.com info,...,domain.com info).

The wetcanvas.com info was entered into the array as many times as the number of wetcanvas.com level two plus links remaining in the tempspider table because of the join query used to form the $list_sites variable. That is why your logfile contains 'no links, level 1..., no links, level 2..., links' for wetcanvas.com and why the wetcanvas.com index starts over, indexing links using the tempspider table.

If the tempspider table would have been empty between runs, the join query would have produced just array(graphics.com info) for the the $list_sites variable, assuming graphics.com was the only site in the file. Just be sure to empty the tempspider table between runs and you should be fine. To empty the tempspider table from the admin panel, click the delete button without selecting a site.

bloodjelly

04-14-2004, 02:08 PM

Awesome, charter, nice detective work. I have many websites to crawl, though - won't this cause a problem, as the temp index will never really be empty when I have 5 instances of the crawler running at once?

Charter

04-15-2004, 01:32 PM

Hi. Part of this could be solved by adding DISTINCT to the following query (or make a join query) in the spider.php file:

$query = "SELECT DISTINCT(".PHPDIG_DB_PREFIX."sites.site_id),".PHPDIG_DB_PREFIX."sites.site_url,"
.PHPDIG_DB_PREFIX."sites.username as user,".PHPDIG_DB_PREFIX."sites.password as pass,"
.PHPDIG_DB_PREFIX."sites.port FROM ".PHPDIG_DB_PREFIX."sites,".PHPDIG_DB_PREFIX."tempspider WHERE "
.PHPDIG_DB_PREFIX."sites.site_id = ".PHPDIG_DB_PREFIX."tempspider.site_id";

This should make it so if file1 contains domainA, domainB then the bot1 array will only contain one instance of each domain. I say partly solved because once bot1 runs on domainA, domainB there will be stuff in the tempspider table, so when bot2 runs file2 containing domainC, domainD then the bot2 array will be domainA, domainB, domainC, domainD.

I suppose AND ".PHPDIG_DB_PREFIX."sites.locked = 0" could be added to the WHERE part of above query, but that still doesn't guarantee unique arrays across bots unless you make sure that each bot gets a chance to lock its sites before the next bot is fired up but before said bots unlock their sites. Even still the tempspider table would need to be emptied after all bots are done.

bloodjelly

04-19-2004, 03:56 PM

Thanks Charter - I added the code with the AND LOCKED=0 clause but it didn't seem to work. In fact now I'm having problems of runaway spidering when running any of the methods, even after I set the spider code back to normal and the tempspider table is empty to start. I have a feeling it's my other tweaks that are the cause of the problem. Really, the only 3 things I want to do are:

1) limit the number of links found per site (done this)

2) invert the common_words application so that only words in that file are spidered (done this as well)

3) perform updates using multiple spiders to index the whole lot faster, with links limit in place for updates as well (buggy - text file or not)

I think maybe I'll start over, then make changes slowly, one at a time, and see when the problem starts. I appreciate the help, though - awesome customer service you can't even get when you pay people for it! :D