Indexation problems [Archive]

RPIEL

02-10-2004, 08:23 AM

Hi,

I encounter 3 different problems:

1) Most of my urls look like this :
http://resoform/services/index.php?s=1&p=annua-liste&m=tous

the script which is called is always index.php.
For each page found the spider tries to index index.php without the query string. That makes no sense. The complete URIs are well indexed but the url "index.php" (without query string) is also found each time and identified as a doublon. Is it possible to change this ?

2) I have some listing pages where the navigation operates with links through the different pages. Because links are provided to the first, last, next or previous page the number of levels required to visit all the items of the list may be very important (perhaps greater than the phpDig limit). Is there a solution to this problem ?

3) I used phpDig html comments to exclude and include some parts of the html code. However I saw that some links which should have been excluded where visited. I don't feel that is normal. Does the exclude comment stop the content being indexed and the links being followed ?

Thanks for your help, and sorry to ask 3 questions at a time

Régis

:confused:

tomas

02-11-2004, 12:58 AM

hello rpiel,

try:
1) config.php line 97: define('PHPDIG_DEFAULT_INDEX',false);
set false to true
2) config.php line 84-86: define('SPIDER_MAX_LIMIT',100);
define('SPIDER_DEFAULT_LIMIT',100);
define('RESPIDER_LIMIT',100);
set limit eg. to 100 or more
3) you have to use the expression set in line 92 and 94 in
config.php: default 
in your html code use exclusive lines eg.
<html>
<body>
text to be searched....

text not to be searched....

......
</body>
</html>

hope this helps a little
tomas

RPIEL

02-11-2004, 03:49 AM

Hello Tomas,

Thank you for your answer.

1) defining PHPDIG_DEFAULT_INDEX to true did'nt solve my problem : now the url http://resodorm/services/ is indexed x times.
This takes a few seconds each time, after which the spider sees that there is a doublon. Anyway, in my system the script "index.php" isn't a page by itself but only with dynamic inclusion of other scripts and templates. It does'nt make sense to crawl it.

2) OK, putting the limit very high seems to be a solution.
However when I do this the same pages are indexed a lot of times ans the whole process takes several hours when it should takes about 15 minutes...
I think it would be a more reliable solution to maintain (or compute) a page containing all the links to index. This might solve the problem of requiring multiple levels to go through the items of a list.

3) I did use  comments in my pages, but sometimes the result was not what I expected.

sincerely

Régis

RPIEL

02-11-2004, 04:24 AM

About  comments :

They appear to work in indexing or not indexing the content of a page. The words in excludes parts of the document are not indexed.
However it seems to me that the links in these parts are followed. That is exactly what I want to avoid !

Has anyone dealed with this issue ?

Thanks by advance,

Régis

Charter

02-11-2004, 11:44 AM

Hi. For one, try the code in this (http://www.phpdig.net/showthread.php?postid=1946#post1946) post but replace:

//exclude if specific variable set
if (strpos($link['file'],'print=y')) {
$link['ok'] = 0;
}

with the following:

//exclude if specific link
if (eregi('index.php$',$link['file'])) {
$link['ok'] = 0;
}

For two, it's probably faster to index your site in pieces, or start the index process using a different page.

For three, this (http://www.phpdig.net/showthread.php?threadid=383) thread may help.

RPIEL

02-11-2004, 10:52 PM

Hi Charter,

Thank you for your answer.
I get the better results in building a special page for the indexing process.

This page has a robots meta tag with "noindex, follow" content.

Because the parts of the site I want to index can be retrieved by a simple query on my database, building this special page was easy.
Now I only need a depth of 1 for spidering process and it runs very fast.
All the pages I want to be indexed are retrieved just one time.

I think it is always (when possible), a good solution to build special pages in order to index the site. The path can be simplified and this will avoid to test a lot of pages to see if there are doublons.

For three I have to make some tests.

Sincerely,

Régis