PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 12-28-2004, 01:59 PM   #1
Morphea
Green Mole
 
Join Date: Dec 2004
Posts: 6
Question indexing for the 1st time but getting "duplicate of existing doc" msg with some files

Hello everyone,

I just installed phpdig and I'm in the process of indexing my website. It's working great, except for a little problem. I'm indexing from directory index pages, listing the contents of the directory dynamically, with a search depth of 1. All the links are detected, and most of the pages are searched and indexed.

But somehow, with some of the pages I get the message "duplicate of an existing document". Does any of you have any idea why that would happen, since I haven't indexed these pages before?

When I go to the update form, the "duplicate" files don't appear in their directory. When I try to index the page by specifying the full URL, I still get the "duplicate" message, even when I set "Use values from Update sites table if present and use default values if values absent from table" to "no".

Is there any way to "force" the indexing of these pages?

Any idea, anyone?
Morphea is offline   Reply With Quote
Old 12-28-2004, 10:42 PM   #2
redlock
Green Mole
 
Join Date: Sep 2003
Location: Germany
Posts: 7
Quote:
Originally Posted by Morphea
Hello everyone,

I just installed phpdig and I'm in the process of indexing my website. It's working great, except for a little problem. I'm indexing from directory index pages, listing the contents of the directory dynamically, with a search depth of 1. All the links are detected, and most of the pages are searched and indexed.

But somehow, with some of the pages I get the message "duplicate of an existing document". Does any of you have any idea why that would happen, since I haven't indexed these pages before?

When I go to the update form, the "duplicate" files don't appear in their directory. When I try to index the page by specifying the full URL, I still get the "duplicate" message, even when I set "Use values from Update sites table if present and use default values if values absent from table" to "no".

Is there any way to "force" the indexing of these pages?

Any idea, anyone?
sorry which I cannot help you. I have the same problem also. however only since the version 1.8x.
redlock is offline   Reply With Quote
Old 12-29-2004, 03:47 AM   #3
Morphea
Green Mole
 
Join Date: Dec 2004
Posts: 6
Getting weirder...

Just tried something else:

I created a plain .htm file with links to all the pages that wouldn't be indexed the first time around.
This file is placed in subdirectory alpha/
It links to pages placed in subdirectories alpha/b to alpha/z (relative links)
The pages in alpha/a were indexed correctly the first time, so I didn't link to any of them.

Now I try to index my htm file with search depth set to 1: it detects the many links in the page (plenty of + + + + +), but it doesn't go on to level 1. Instead, it tries to index files in alpha/a (while I don't have any links to any of these pages on my indexing page) before stopping suddenly after the 12th file in that subdirectory (no more activity in browser but still no [back to admin] link.

I'm probably doing something wrong.
Could someone please help?
Morphea is offline   Reply With Quote
Old 12-29-2004, 01:00 PM   #4
Morphea
Green Mole
 
Join Date: Dec 2004
Posts: 6
Still working on the problem... Searched the forum's archives and didn't find anything useful...

I uploaded phpdig in another directory, installed it with another table prefix, and tried to index the plain htm file I described above: this time it managed to index most of the linked files (still not all of them though).
So now I have tables with the biggest part of my website indexed on it, and other tables with (most of) the missing files indexed. At the moment I'm working on a script to join both tables, for a lack of a better solution...

Anyone with a quicker and easier idea is more than welcome!
Morphea is offline   Reply With Quote
Old 12-30-2004, 07:20 AM   #5
Morphea
Green Mole
 
Join Date: Dec 2004
Posts: 6
Just in case anyone is still following this thread, I'll let you know that my script worked: the pages and associated keywords were correctly added to the tables. The pages also show up in the search page, which is good!

The only tiny remaining problem: when one of these pages is in the results of the search, the snippet is the beginning of the file, and not the part featuring the (highlighted) keyword.
So now I'm having a look at the search_function.php script to see how it works and why it wouldn't show a correct snippet for the files I added "manually".
Morphea is offline   Reply With Quote
Old 12-30-2004, 11:08 AM   #6
rAdoN
Green Mole
 
Join Date: Oct 2004
Posts: 27
you make too much work for nothing - tsk tsk - search on duplicate - see

http://www.phpdig.net/forum/showthre...ight=duplicate
__________________
rAdoN was here
rAdoN is offline   Reply With Quote
Old 12-30-2004, 01:06 PM   #7
Morphea
Green Mole
 
Join Date: Dec 2004
Posts: 6
I *did* search on duplicate... But I didn't look too far back in time in case it was due to the version or something...
But thanks for the link, I'll try that!

Last edited by Morphea; 12-30-2004 at 01:48 PM.
Morphea is offline   Reply With Quote
Old 12-30-2004, 01:43 PM   #8
rAdoN
Green Mole
 
Join Date: Oct 2004
Posts: 27
be not afraid - if doubt - compare code - if different - use newest code
__________________
rAdoN was here
rAdoN is offline   Reply With Quote
Old 12-30-2004, 01:48 PM   #9
Morphea
Green Mole
 
Join Date: Dec 2004
Posts: 6
At first I edited my last post, but seeing you replied in the meantime I thought I'd post it in a new message.

So I increased the CHUNK_SIZE (doubled it to 2048), but it still wouldn't work.
My pages are simple php files, with no variables called in the URL like in the exemple in the link provided. They're like entries of a dictionary or an encyclopedia, so each page is quite different from any other (maybe not filesize-wise though).
After changing the CHUNK_SIZE, and trying to index a specific page with its full URL, with depth to 0, the spider didn't even seem to try indexing this one page, but instead began trying to index files in another directory... (and not even the root directory)

I'm sure there HAS to be some way to fix this... But right now I'm quite discouraged...
Morphea is offline   Reply With Quote
Old 12-30-2004, 03:03 PM   #10
rAdoN
Green Mole
 
Join Date: Oct 2004
Posts: 27
remove mods - do virgin 1.8.6 install - use CHUNK_SIZE 2048 or 4096 - use dynamic title - use LIMIT_TO_DIRECORY false - use SPIDER_MAX_LIMIT 100 - use LINKS_MAX_LIMIT 100 - use search depth 100 - use links per 0 - links in iframe or heavy javascript not followed - index after setting config - expect duplicate of an existing document at high depth - may be real duplicate - same link is duplicate - see

http://www.phpdig.net/forum/showthread.php?t=1139

relax - know not what more
__________________
rAdoN was here
rAdoN is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
shows blank page if "Search All" and "exact phrase", timeout? alokjain9 Troubleshooting 2 03-07-2006 07:08 AM
Re-indexing a page, "boosting" pages Kozz How-to Forum 2 04-06-2005 01:32 PM
"search depth" and "links per" features laurentxav How-to Forum 1 01-12-2005 07:27 PM
Problem with indexing "links found : 0" IAMHHawaii Troubleshooting 1 09-20-2004 12:06 PM
indexing " numeric " words laurentxav How-to Forum 2 01-26-2004 05:11 AM


All times are GMT -8. The time now is 08:00 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.