PDA

View Full Version : QUESTION: How-to Spider Multiple URL's, not just one at a time.


2wheelin
05-22-2004, 06:00 PM
Is there any way to spider Multiple URL's instead of all the pages in a single URL?
:confused: :bang:
I need to index subject specific web sites, the top or index page of each site only. If phpDig will do this, please tell me how to do it or better yet, direct me to an existing thread with this info.

THANKS!

bloodjelly
05-23-2004, 05:07 PM
Multiple spidering seems like one of the most requested features. As of right now, to do what you want, you could try the wrapper on this board (search "wrapper") and use the limit total number of links per site mod that I posted (search "limit"). Or, Charter's current 1.8.1 alpha version apparently can do this as well, but multiple spiders aren't yet supported.

2wheelin
05-24-2004, 06:53 AM
Thanks bloodjelly!

I'll check it out.

misterbearcom
06-13-2004, 08:31 PM
I was wondering if you simply could simply create several copies of phpDig and run them simultaneously as separate spiders, but before you start digging to assign the site_ids and spider_ids to numbers that will not overlap each other.

So, if you had database one with the following:

site_id starts at 1
spider_id starts at 1

Then database 2:

site_id starts at 10,000,000.
spider_id starts at 10,000,000.

etc. until you've made as many spiders as you wanted.

Then run them all simultaneously.

Afterwards, dump the info for all of these databases into one final version of the phpdig database and transfer all the txt_content .txt files into the final site folder.

Would that work? I haven't personally tried it so I guess I should try it before suggesting it. In anycase, hopefully there will be further modifications to PHPDIG in the future which will remove this issue.

(My apologies if I sound ignorant. I'm the first one who will probably admit that I am. :D)

bloodjelly
06-13-2004, 10:42 PM
Hi misterbear -

We definitely need a way to run multiple spiders, but I think running separate versions on separate databases is sort of like hiring two slow typists and paying them double instead of just hiring a fast one and paying him the regular wage. What I mean to say is, it doesn't really solve the problem. I think the solution will involve one database, and multiple running spider processes that are smart enough to know which sites are being spidered already. In this way, we can limit redundant data and tables, and simplify the process for people that are allowed only one database with their host (for example, freeservers). But thanks for trying to work out solutions!