PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   indexing for the 1st time but getting "duplicate of existing doc" msg with some files (http://www.phpdig.net/forum/showthread.php?t=1679)

Morphea 12-28-2004 01:59 PM

indexing for the 1st time but getting "duplicate of existing doc" msg with some files
 
Hello everyone,

I just installed phpdig and I'm in the process of indexing my website. It's working great, except for a little problem. I'm indexing from directory index pages, listing the contents of the directory dynamically, with a search depth of 1. All the links are detected, and most of the pages are searched and indexed.

But somehow, with some of the pages I get the message "duplicate of an existing document". Does any of you have any idea why that would happen, since I haven't indexed these pages before?

When I go to the update form, the "duplicate" files don't appear in their directory. When I try to index the page by specifying the full URL, I still get the "duplicate" message, even when I set "Use values from Update sites table if present and use default values if values absent from table" to "no".

Is there any way to "force" the indexing of these pages?

Any idea, anyone?

redlock 12-28-2004 10:42 PM

Quote:

Originally Posted by Morphea
Hello everyone,

I just installed phpdig and I'm in the process of indexing my website. It's working great, except for a little problem. I'm indexing from directory index pages, listing the contents of the directory dynamically, with a search depth of 1. All the links are detected, and most of the pages are searched and indexed.

But somehow, with some of the pages I get the message "duplicate of an existing document". Does any of you have any idea why that would happen, since I haven't indexed these pages before?

When I go to the update form, the "duplicate" files don't appear in their directory. When I try to index the page by specifying the full URL, I still get the "duplicate" message, even when I set "Use values from Update sites table if present and use default values if values absent from table" to "no".

Is there any way to "force" the indexing of these pages?

Any idea, anyone?

sorry which I cannot help you. I have the same problem also. however only since the version 1.8x.

Morphea 12-29-2004 03:47 AM

Getting weirder...
 
Just tried something else:

I created a plain .htm file with links to all the pages that wouldn't be indexed the first time around.
This file is placed in subdirectory alpha/
It links to pages placed in subdirectories alpha/b to alpha/z (relative links)
The pages in alpha/a were indexed correctly the first time, so I didn't link to any of them.

Now I try to index my htm file with search depth set to 1: it detects the many links in the page (plenty of + + + + +), but it doesn't go on to level 1. Instead, it tries to index files in alpha/a (while I don't have any links to any of these pages on my indexing page) before stopping suddenly after the 12th file in that subdirectory (no more activity in browser but still no [back to admin] link.

I'm probably doing something wrong.
Could someone please help?

Morphea 12-29-2004 01:00 PM

Still working on the problem... Searched the forum's archives and didn't find anything useful...

I uploaded phpdig in another directory, installed it with another table prefix, and tried to index the plain htm file I described above: this time it managed to index most of the linked files (still not all of them though).
So now I have tables with the biggest part of my website indexed on it, and other tables with (most of) the missing files indexed. At the moment I'm working on a script to join both tables, for a lack of a better solution...

Anyone with a quicker and easier idea is more than welcome!

Morphea 12-30-2004 07:20 AM

Just in case anyone is still following this thread, I'll let you know that my script worked: the pages and associated keywords were correctly added to the tables. The pages also show up in the search page, which is good!

The only tiny remaining problem: when one of these pages is in the results of the search, the snippet is the beginning of the file, and not the part featuring the (highlighted) keyword.
So now I'm having a look at the search_function.php script to see how it works and why it wouldn't show a correct snippet for the files I added "manually".

rAdoN 12-30-2004 11:08 AM

you make too much work for nothing - tsk tsk - search on duplicate - see

http://www.phpdig.net/forum/showthre...ight=duplicate

Morphea 12-30-2004 01:06 PM

I *did* search on duplicate... But I didn't look too far back in time in case it was due to the version or something...
But thanks for the link, I'll try that!

rAdoN 12-30-2004 01:43 PM

be not afraid - if doubt - compare code - if different - use newest code

Morphea 12-30-2004 01:48 PM

At first I edited my last post, but seeing you replied in the meantime I thought I'd post it in a new message.

So I increased the CHUNK_SIZE (doubled it to 2048), but it still wouldn't work.
My pages are simple php files, with no variables called in the URL like in the exemple in the link provided. They're like entries of a dictionary or an encyclopedia, so each page is quite different from any other (maybe not filesize-wise though).
After changing the CHUNK_SIZE, and trying to index a specific page with its full URL, with depth to 0, the spider didn't even seem to try indexing this one page, but instead began trying to index files in another directory... (and not even the root directory)

I'm sure there HAS to be some way to fix this... But right now I'm quite discouraged...

rAdoN 12-30-2004 03:03 PM

remove mods - do virgin 1.8.6 install - use CHUNK_SIZE 2048 or 4096 - use dynamic title - use LIMIT_TO_DIRECORY false - use SPIDER_MAX_LIMIT 100 - use LINKS_MAX_LIMIT 100 - use search depth 100 - use links per 0 - links in iframe or heavy javascript not followed - index after setting config - expect duplicate of an existing document at high depth - may be real duplicate - same link is duplicate - see

http://www.phpdig.net/forum/showthread.php?t=1139

relax - know not what more


All times are GMT -8. The time now is 02:14 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.