View Single Post
Old 02-02-2005, 06:41 AM   #2
Paul D. Buck
Green Mole
 
Join Date: Jan 2005
Location: Sacramento
Posts: 8
continued

=================
Could this "dynamic" behavior I have been seeing be because the pages I have are highly dynamic and constantly change parts of the content with some factors controlled by a r****mly generated value?

Is the "spider" crawler identifiable? In other words, when it asks for a page, can I detect that it is the spider and not a normal query?

File name + title is unique identifier, on my site in almost all cases ....

Example: page file name is "account-data.php"

with a page title of:
"Account Data" Page - ClimatePrediction.net - Web Site Owner's Manual
"Account Data" Page - Einstein@Home - Web Site Owner's Manual
"Account Data" Page - LHC@Home - Web Site Owner's Manual
"Account Data" Page - Predictor@Home - Web Site Owner's Manual
"Account Data" Page - SETI@Home - Web Site Owner's Manual

Would changing the Primary key to be filename + file title be a better index. I know that in most cases on simpler sites the numbers of entries would be smaller as there would be no true difference in the pages indexed. In my case, I would be getting 4-6 times the number of pages, but NOW with this I would be able to track the pages that are missing/need indexing, and would not be reindexing the same pages in error.

This would obviously mean changes to the "Modify page" also.

Oh, one other positive thing would be that the MD5 values could be used as they are intended to see if the page is truly different.

=================
I don't know the significance of this, but monitoriing the tempspider table showed that most of my "freezes" occurred when the spider had done 30 entries into the table and about 18 were flagged as "1" indexed. I know tht the table can grow beyond this because I once saw it upto 168 pages ... also frozen. I let the spider continue to run with no changes observed, of course if you are not doing immediate commits then more work could have been pending, but after 30 minutes it seemed to be time to quit.

Re-running the analysis may or may not have 'frozen at the same point/page.
Paul D. Buck is offline   Reply With Quote