PDA

View Full Version : Documents disappear


kzant
07-29-2005, 08:23 AM
I have two sites on which I'm experiencing the same issue. Both sites contain a large number of PDF and DOC files - one site has 300, the other about 500. Every time a new document is added to the site, I manually add that single document to the index.

However, it appears that certain documents are not coming up in searches after a period of time. If I just re-submit the document, it says that its already there, of course.

But if I delete all documents and rebuild the entire index the documents will show up again. Then they stop being returned on searches after a period of time.

I am at a loss as to why this is happening; any advice is appreciated.

Charter
07-29-2005, 08:34 AM
What version of PhpDig are you using? Do you use a cron job or the PhpDig admin panel?

kzant
07-29-2005, 08:37 AM
v.1.8.7
Admin panel. I could never get the chron to work correctly.

Charter
07-29-2005, 09:11 AM
Do you run any of the "cleans" prior to experiencing this issue?

kzant
07-29-2005, 09:16 AM
No - should I?

Charter
07-30-2005, 06:51 AM
No, you do not have to run the "cleans." What about space; are you running out of space? Maybe adding a document is wiping a previous document, or when you reindex, do all (new and old) documents show up in the search results?

kzant
07-30-2005, 07:11 AM
Space isn't an issue. Each document is uniquely named -- I trap for that to ensure nothing is getting wiped out.

The documents still exist in the spider table. The keywords still exist in the keywords table. But the connection between the two disappears from engine.

So, when I try to resubmit the document, it says its already there. But its not coming up on searches as the keyword connection is gone. But, if I delete that document from the system (using admin interface) and re-submit it, then its fine.

But of course, I can't tell what's been axed and whats okay when I get hollered at, so I wipe out the whole thing and re-index the whole thing again. And then that seems to make things better.

Of course, I'd like to preserve the original index. But if there is something going on that precludes that, can you suggest a way I could re-index the site (300/500 docs) w/o my intervention? Something I could run nightly that wouldn't timeout?

I really appreciate any advice. this has happened a few times and I really don't like the testy calls from passive aggressive clients.

Charter
07-30-2005, 07:26 AM
Is CONTENT_TEXT set to 1 in the config file? If so, is there ever a case where a TXT file in the TEXT_CONTENT_PATH directory is manually removed? The text files in the TEXT_CONTENT_PATH directory are named spider_id.txt (spider_id is a number from the spider table). For a cron job, do as in the documentation, and also make the change shown in this (http://www.phpdig.net/forum/showthread.php?t=2005) thread.