PDA

View Full Version : Informations to customize spider


noel
10-30-2005, 04:48 PM
Hello CHARTER,

I would like to know what's your opinion about these questions :


1°) I think indexed +- 2500 sites, do you think it is realist or it isn't possible with PHPDIG ?

2°) If it is possible, how many days will you put, first to reindex a site, me I would put 30 days and you ?

3°) In order to do something "realist", what do you think, with

Number Level Dept : 20
Link for each DEPT : 50 ?? much or less ?
I tried to illimited link but it took too much time to index.

4°) A problem I don't find the answer, when it is spidering, crawling, can I put a new link ,or have I to wait that it stops crawling ?


5°) Is ist possible to have more than one spider whit shell command , what I have to do ?

6°) I have a big problem when he is spidering forums, he always find 100 links yet indexed, one link new, after 100 links yet indexed, one link new...etc... what can I do for that, it spend a lot of time for nothing ?

7°) When using shell command are all the informations in the config file are using by shell spider ? :cool: sorry for my english ;)



Thank you !

Noël

Charter
11-01-2005, 03:35 PM
1°) I think indexed +- 2500 sites, do you think it is realist or it isn't possible with PHPDIG ?
I have not had 2500 sites indexed at one time, but check out this (http://www.phpdig.net/forum/showthread.php?t=1794) thread for some numbers.

2°) If it is possible, how many days will you put, first to reindex a site, me I would put 30 days and you ?
For the online demo, I leave LIMIT_DAYS at zero, but for a 'real' site I think 30 days is fine. As the number of sites grows, you'll of course want to consider what and when to index.

3°) In order to do something "realist", what do you think, with

Number Level Dept : 20
Link for each DEPT : 50 ?? much or less ?
I tried to illimited link but it took too much time to index.
The maximum pages found per site is ((depth * links) + 1) when links is greater than zero, so just think about how many pages per site you would like to find, and then set depth and links accordingly.

4°) A problem I don't find the answer, when it is spidering, crawling, can I put a new link ,or have I to wait that it stops crawling ?
It would be better to wait until the crawling is complete, as PhpDig locks when indexing to let you know it is busy.

5°) Is ist possible to have more than one spider whit shell command , what I have to do ?
Having more than one spider at a time would still use the same tables and slow the process down, but there is a thread here (http://www.phpdig.net/forum/showthread.php?t=1646) about multiple spiders.

6°) I have a big problem when he is spidering forums, he always find 100 links yet indexed, one link new, after 100 links yet indexed, one link new...etc... what can I do for that, it spend a lot of time for nothing ?
Does 'duplicate of an existing document' appear onscreen? If so, use PHPDIG_SESSID_VAR in the config file, especially for links that contain session IDs.

7°) When using shell command are all the informations in the config file are using by shell spider ?
All of the index related settings in the config file are used when indexing from shell, except for RESPIDER_LIMIT and RELINKS_LIMIT and maybe a couple of others.

BTW, your English is fine. :)