PhpDig.net - View Single Post - Long post, possible issues, and maybe the data to solve some?

Paul D. Buck · 02-03-2005, 07:17 AM

Quote:

Originally Posted by Charter

Q: I see "exclude Paths" when I start the spider. How do I set that? and what does it do?

A: Use a robots.txt file or exclude content from the admin panel. It excludes content from index.

I am looking at the v1.8.7 admin panel and there is no option to exclude from the site. I only have one site listed, and listed as locked now. Which is a new issue, what the heck is locked, how did it get locked and how do I unlock it.

Hmm, found a way to unlock it ...

The reason that I posted all of these questions by the way is that if these are answered and then inserted into the documentation as part of the operational section, it would save new people from having to try to figure this stuff out
on their own.

Quote:

Originally Posted by Charter

Q: How can I tell if one of those time out errors occurs?

A: Check that safe_mode is off, review your server error logs, or ask your host if the process is killed.

Safe mode is off. I am not sure I can do either one of the others. But I will look into it

Quote:

Originally Posted by Charter

Q: Does it hurt anything if I just stop the spider in the middle of things? are my pages still correctly processed if I do that?

A: No, not if you use the stop spider link. Yes, for documents already indexed.

I do use the spider link and it makes several passes before it says it is done.

And over time I do *seem* to get an indexed site.

Quote:

Originally Posted by Charter

Q: I see a list of words that I can "purge" should I add more words? Does adding words make my database less useful for searches?

A: Add more words if you want. It depends on the words you add.

Quote:

Originally Posted by Charter

Q: How do I get my site reindexed after I have made the first pass?

A: Use the admin panel text box or spider from shell.

If I select the top end, and do re-index it is supposed to locate the pages that have been updated and index those? I re-arranged a folder, droppin some pages and adding others by it does not look like it is really finding the new pages, well, I will play with it some more ...

Quote:

Originally Posted by Charter

Q: Should I run the delete processes before I restarting the indexing?

A: If you want, but it is not necessary.

Ok, I ran them anyway.

It would be nice to have a fuller explanation of what each process does and what its intent is ...

Quote:

Originally Posted by Charter

Q: You mention a parameter that will prevent "early" reindexing, what value is it set to and how can I change it?

A: It is set to zero. Look for define('LIMIT_DAYS',0); in the config file, or set revisit-after META tags.

So, if limit days is 0, I can reindex at will?

Quote:

Originally Posted by Charter

Q: On the update page you say depth trumps links, does that mean values 0, 0 will do my entire site?

A: No, set LIMIT_TO_DIRECTORY to false, choose no, set a large search depth, set links per to zero.

Ok, I will try this, but I think I already did and it did not work, well I will try again.

Quote:

Originally Posted by Charter

Q: what does "No link in temporary table" mean? Is this a good thing? Or a bad thing?

A: The tempspider table is empty. It is good.

This is one of those that SHOULD be in your document. I saw mousing about here that this is a common question.

Quote:

Q: What does: "Duplicate of an existing document" Mean?

A: The document looks like an already indexed document.

Ok, so it is more of a warning that it is passing over the docuement.

Quote:

Originally Posted by Charter

Q: Could this "dynamic" behavior I have been seeing be because the pages I have are highly dynamic and constantly change parts of the content with some factors controlled by a r****mly generated value?

A: No, the values in the update sites table are currently being used.

I think we talked past each other on this one and the last one. I will pass till then.

Quote:

Originally Posted by Charter

Q: Is the "spider" crawler identifiable? In other words, when it asks for a page, can I detect that it is the spider and not a normal query?

A: Yes, review your server access logs when indexing your site.

I was thinking about catching it at the page when it is called, maybe to have the page make itself more static but that will miss content.

Quote:

Originally Posted by Charter

Q: Would changing the Primary key to be filename + file title be a better index.

A: No, primary keys must be unique.

Yes, my point exactly. I have a page, say "profile-edit.php", this page is called by the viewer in any one of 6 different renderings. The page has content that is driven by the Project that the person is interested in. So, you go to that page and ask for a page titled: "Profile Edit - SETI@Home ..." as profile-edit.php; whereas I ask for "Profile Edit - LHC@Home ..." as profile-edit.php ... we both pull the same page, but get different content based on the project we wish to see.

This means that the one FILE name is NOT unique in the way that it is viewed. You cannot index the page fully as only one version as some of the content is never delivered for your view, and other information from my view.

With the concatenated key of page FILE NAME and PAGE TITLE my pages are now uniquely identifable in the tables.

++++++++++++++
New questions
Q. Would adding the project as a passed parameter make the page table entries based on that identifier?

Q. I have tried settings to force a single page only look so I could feed in a list of pages so I could just drop the site and then feed in the list of pages. But I have not been able to get that to work reliably.

Q. If I have a static page, how to I run phpDig so that it only indexes pages that have been detected as changed? Or is the MD5 signature stored for some other purpose?