PDA

View Full Version : Indexing Directories


emcclary
01-17-2004, 10:03 AM
Hi,

I'm trying to spider my website and need to index a lot of pages.
(easily over 120,0000) most of the spidering is done but it's just getting longer and longer to index. The site is static (a newspaper archive) and added to daily. The pages are broke down like this:

www.foo.com/years/xxxx<-(being the year)/(the issues in this format 0112 <--January 12th /

I've told phpdig to spider the site by typing out the year url - for example - www.foo.com/years/2003/

the problems are the a) it always show's up as indexed site www.foo.com in the control panel - not the individual years
and B) It always wants to look thru all the previous years to index

Do I have to actually create sub domains (i.e. 1996.foo.com) to have seperate directories indexed or is there some other way.

I basically want to make a static search database and don't need to reindex anything but the current days additions. Thanks in advance if you have any ideas.

Eric McClary
www.recordernews.com

Charter
01-18-2004, 08:39 AM
Hi. What happens when you click a site, click the update button, and then click a green check mark for a specific directory?

emcclary
01-28-2004, 06:07 PM
Sorry about the late reply,

I can't even open that screen (it's too big) I run out of virtual memory on my computer (besides the minimum 15 minutes to open.
Like I said the site is HUUUGGGEEE.

Any way, I'm playing around with a robots.txt file but that doenst seem to work. Even though I told it to exclude all it still seems to take a look at the ones I already did.

So a couple of questions:
A) Whats the excludes table - can I place parts I don't want reindexed in this table?
B) Is there a way to make it not recheck the stuff I already did?
C) Last but not least, the only solution I can see is multple installs of phpdig (each with there own database) of course I don't like this answer and If I did this is there a way to have phpdig still search through these databases and give one result page?

I know I'm asking alot but I'm hoping there is a solution to searching my Huge archaic website.

Thanks

Eric McClary
www.recordernews.com

emcclary
01-28-2004, 06:11 PM
Also On a quick note - how about I modify all the update field to some time in the far future (like 2080 or something) would that make them skip checking them (i.e. does it only look at items by current date)?

Thanks Again
Eric

emcclary
01-28-2004, 08:12 PM
Just tried using a txt file (via command line) same problem - updates all (at least checks) I just want to add to the database not update the database.

Charter
01-29-2004, 07:28 PM
Hi. Perhaps increase LIMIT_DAYS in the config file. Also, you might try version 1.8.0 and a text file via command line, making sure tempspider is empty between runs and SPIDER_MAX_LIMIT, SPIDER_DEFAULT_LIMIT, and RESPIDER_LIMIT are all set to zero in the config file so that just the one page gets indexed, no links are followed.