View Single Post
Old 02-02-2005, 06:40 AM   #1
Paul D. Buck
Green Mole
 
Join Date: Jan 2005
Location: Sacramento
Posts: 8
Long post, possible issues, and maybe the data to solve some?

Ok, I seem to have phpDig installed and operational. But here are some of my observations based upon the last couple of days experience.

=============
The documentation is weak. About 80% of it is dedicated to the installation and some on configuration, but there is only a few paragraphs on how to operate the software. Looking through the forums there are a lot of questions that seem to be repeated and they are all related to "How do I ...". Putting more of this into the documentation would be a great step forward.

Most tellingly, people are only going to to be installing it once, updating it on occasion, but operating it every day ...

=============
Much of the configuration information is explained to the extent that a person already familiar with your tool will understand the explanation.

Picking a section, just for example:
// regexp forbidden extensions - return sometimes text/html mime-type !!!
define('FORBIDDEN_EXTENSIONS','\.(rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|
arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$');

The BANNED constant means to ban external links in index, meaning that those links do not show up as keys in search results. The FORBIDDEN_EXTENSIONS constant means to ban certain links from being indexed. Don't let the name fool you. A regex can be set in the FORBIDDEN_EXTENSIONS constant to ban various types of links from even being indexed. Again, BANNED is to ban keys from search results, and FORBIDDEN_EXTENSIONS is to ban the index of links.

This explanation does not tell me how to change this setting, or, in the first one how to create what appears at first glance as a regular expression. If this is true, then the way that it works should be explained.

================
In paragraph 6.2 you mention that you can use the "No way" to remove a site, but then don't explain that it can be/must be added back with a rescan and this is the way that you do that. I don't recall when I accidentally clicked one if I got a "Are you Sure?" or not.

==============
Also in 6.3 "Clean common words - deletes words that appear in the common_words.txt file." does not have advice on how and when this file can be/should be updated, and if you do update it what words should be added in ...

+++++++++++++++++++++++
Unanswered questions.

These that follow are a combination of guess work, database snooping, and conjecture about the engine, how it works and what is going on, and why the spider may not be indexing my site correctly. Most of this is presented as questions I had hoped would have been answered in the documentation ...

Site description:
Paul's Web Site
Top level consists of one "Index" page
there are "../" and "../../" levels below with almost all content residing in the "../../" level with some additional indexes in the "../" levels

The site consists of about 250 individual pages by file name along with, now, 3 PHP tools in their own structures, pgpMyFAQ, phpMyAdmin, and phpDig in addition to the base content.

The internal link structure is a huge tangle of back and forth links, depending on the link tester I have 2,000 to 19,000 links, the difference between unique links and link references. I have on the order of 150 broken links for material that is being added. Most pages validate to W3C "Strict" with about 75-100 that are failing validation at this time.

every page links to the top of the site ...

===========
I see "exclude Paths" when I start the spider. How do I set that? and what does it do?

===========
The system starts to spider and then freezes, I am using FireFox, but I get anywhere from 10-20 pages parsed before I get the freeze. Letting it sit for long periods does not seem to work.

===========
Does it hurt anything if I just stop the spider in the middle of things? are my pages still correctly processed if I do that?

===========
How can I tell if one of those time out errors occurs?

===========
I see a list of words that I can "purge" should I add more words?
Does adding words make my database less useful for searches?

===========
How do I get my site reindexed after I have made the first pass?

===========
Should I run the delete processes before I restarting the indexing?

===========
You mention a parameter that will prevent "early" reindexing, what value is it set to and how can I change it?

===========
Most of the time the spider does not complete. Is that a bad thing? I only see it finish a spider and then listing the pages completed occasionally.

===========
On the update page you say depth trumps links, does that mean values 0, 0 will do my entire site?

===========
I did a spider crawl on my site, then I repeated it from the index page, I indexed some pages elsewhere in the site. Then I restarted from the top and some of the pages that it told me it had successfully indexed were re-indexed.

Since the processes do not stop cleanly I am confused about whether or not my site is correctly indexed. I stop the spider cleanly, but this still does not make sense to me.

============
My site is very recursive, yet I understood the indexing function was supposed to have a delay factor, which I assumed was: define('LIMIT_DAYS',0); this implies that I can reindex at will. Yet, there are times when the pages are happily reindexed, and other times when they are not. I cannot figure out the pattern ...

Over time, my update form grows to contain the additional pages, and the search function does seem to find the correct pages, but the behavior does not seem to match the documentation.

I would assume that a 0, 0, no setting would prevent exiting the page to enter another, but this does not seem to be the case.

===========
what does "No link in temporary table" mean?
Is this a good thing? Or a bad thing?

if good "No links in the temporary table, Yea!" if bad, we need troubleshooting tips

============
There does not seem to be a clear explanation of how I can do a reindex with the tool only reindexing the pages that need it based on the MD 5 values, of course I am assuming that you made these numbers for this purpose.

===========
What does: "Duplicate of an existing document" Mean?
doing 3 pages at a time with "yes" 0 0

1:http://boinc-doc.net/site-boinc/boin...oject-list.php
(time : 00:00:06)
2:http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
(time : 00:00:12)
3:http://boinc-doc.net/site-boinc/oman...-menu-help.php
(time : 00:00:19)
4:http://boinc-doc.net/site-boinc/oman...-menu-file.php
(time : 00:00:25)
5:http://boinc-doc.net/index.php
(time : 00:00:32)
6:http://boinc-doc.net/site-boinc/oman-app/app-intro.php
(time : 00:00:38)
Duplicate of an existing document
7:http://boinc-doc.net/site-boinc/oman...pp-install.php
(time : 00:00:44)
8:http://boinc-doc.net/site-boinc/oman-app/app-icons.php
(time : 00:00:50)
No link in temporary table

====================
settings 0, 0, yes both runs
Pass with 3 new pages:

Spidering in progress... [Stop spider]
SITE : http://boinc-doc.net/
Exclude paths :
- @NONE@
1:http://boinc-doc.net/site-boinc/oman...-menu-help.php
(time : 00:00:07)
2:http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
(time : 00:00:13)
3:http://boinc-doc.net/site-boinc/oman...pp-install.php
(time : 00:00:19)
4:http://boinc-doc.net/site-boinc/oman-app/app-icons.php
(time : 00:00:26)
5:http://boinc-doc.net/index.php
(time : 00:00:32)
6:http://boinc-doc.net/site-boinc/oman-app/app-intro.php
(time : 00:00:38)
7:http://boinc-doc.net/site-boinc/boin...oject-list.php
(time : 00:00:43)
8:http://boinc-doc.net/site-boinc/oman...saver-cpdn.php
(time : 00:00:49)
9:http://boinc-doc.net/site-boinc/oman...-cpdn-full.php
(time : 00:00:55)
10:http://boinc-doc.net/site-boinc/oman-app/app-over.php
(time : 00:01:02)
11:http://boinc-doc.net/site-boinc/oman...-menu-file.php
(time : 00:01:07)
Duplicate of an existing document
12:http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php
(time : 00:01:13)
13:http://boinc-doc.net/site-boinc/oman...u-settings.php
(time : 00:01:19)
14:http://boinc-doc.net/site-boinc/oman...menu-popup.php
(time : 00:01:25)
No link in temporary table
links found : 14
http://boinc-doc.net/site-boinc/oman...-menu-help.php
http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
http://boinc-doc.net/site-boinc/oman...pp-install.php
http://boinc-doc.net/site-boinc/oman-app/app-icons.php
http://boinc-doc.net/index.php
http://boinc-doc.net/site-boinc/oman-app/app-intro.php
http://boinc-doc.net/site-boinc/boin...oject-list.php
http://boinc-doc.net/site-boinc/oman...saver-cpdn.php
http://boinc-doc.net/site-boinc/oman...-cpdn-full.php
http://boinc-doc.net/site-boinc/oman-app/app-over.php
http://boinc-doc.net/site-boinc/oman...-menu-file.php
http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php
http://boinc-doc.net/site-boinc/oman...u-settings.php
http://boinc-doc.net/site-boinc/oman...menu-popup.php
Optimizing tables...
Indexing complete ! [Back]

------------
Next pass with 3 new pages:

Spidering in progress... [Stop spider]
SITE : http://boinc-doc.net/
Exclude paths :
- @NONE@
1:http://boinc-doc.net/site-boinc/oman...u-settings.php
(time : 00:00:07)
2:http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php
(time : 00:00:12)
3:http://boinc-doc.net/site-boinc/oman-app/app-icons.php
(time : 00:00:18)
4:http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
(time : 00:00:24)
5:http://boinc-doc.net/site-boinc/oman...-menu-help.php
(time : 00:00:30)
6:http://boinc-doc.net/site-boinc/oman...-menu-file.php
(time : 00:00:36)
7:http://boinc-doc.net/index.php
(time : 00:00:42)
Duplicate of an existing document
8:http://boinc-doc.net/site-boinc/oman-app/app-intro.php
(time : 00:00:47)
9:http://boinc-doc.net/site-boinc/oman...pp-install.php
(time : 00:00:54)
10:http://boinc-doc.net/site-boinc/boin...oject-list.php
(time : 00:01:00)
11:http://boinc-doc.net/site-boinc/oman...r-old-seti.php
(time : 00:01:07)
12:http://boinc-doc.net/site-boinc/oman...-saver-lhc.php
(time : 00:01:13)
13:http://boinc-doc.net/site-boinc/oman...r-lhc-full.php
(time : 00:01:18)
14:http://boinc-doc.net/site-boinc/oman...menu-popup.php
(time : 00:01:24)
15:http://boinc-doc.net/site-boinc/oman...saver-cpdn.php
(time : 00:01:30)
Duplicate of an existing document
16:http://boinc-doc.net/site-boinc/oman...-cpdn-full.php
(time : 00:01:36)
17:http://boinc-doc.net/site-boinc/oman-app/app-over.php
(time : 00:01:42)
No link in temporary table
links found : 17
http://boinc-doc.net/site-boinc/oman...u-settings.php
http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php
http://boinc-doc.net/site-boinc/oman-app/app-icons.php
http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
http://boinc-doc.net/site-boinc/oman...-menu-help.php
http://boinc-doc.net/site-boinc/oman...-menu-file.php
http://boinc-doc.net/index.php
http://boinc-doc.net/site-boinc/oman-app/app-intro.php
http://boinc-doc.net/site-boinc/oman...pp-install.php
http://boinc-doc.net/site-boinc/boin...oject-list.php
http://boinc-doc.net/site-boinc/oman...r-old-seti.php
http://boinc-doc.net/site-boinc/oman...-saver-lhc.php
http://boinc-doc.net/site-boinc/oman...r-lhc-full.php
http://boinc-doc.net/site-boinc/oman...menu-popup.php
http://boinc-doc.net/site-boinc/oman...saver-cpdn.php
http://boinc-doc.net/site-boinc/oman...-cpdn-full.php
http://boinc-doc.net/site-boinc/oman-app/app-over.php
Optimizing tables...
Indexing complete ! [Back]


================
Removing the "-" char as suggested for allowing pop-up to be searched on seems to work, but the display of the "text" in the search dialog still has the "-" missing, so, the database and the source pages have "Pop-Up" and searches for Pop-Up work, but the quoted material contains "pop up" ...

================
Trying to find the error 403 lead me to: changing the code slightly and this is an output:

url: HEAD /site-boinc/boinc-projects/ HTTP/1.1 Host: boinc-doc.net Cookie: PHPSESSID=6963c6e4d54cc72fc71505cd795854be Cookie: PageCount=0 Cookie: ProjectAbbr=predictor Accept: */* Accept-Charset: iso-8859-1 Accept-Encoding: identity Connection: close User-Agent: PhpDig/1.8.7 (+http://www.phpdig.net/robot.php)

HTTP/1.1 403 - http://boinc-doc.net/site-boinc/boinc-projects//
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for explanation.
url: HEAD /site-boinc/boinc-projects/ HTTP/1.1 Host: boinc-doc.net Cookie: PHPSESSID=6963c6e4d54cc72fc71505cd795854be Cookie: PageCount=0 Cookie: ProjectAbbr=predictor Accept: */* Accept-Charset: iso-8859-1 Accept-Encoding: identity Connection: close User-Agent: PhpDig/1.8.7 (+http://www.phpdig.net/robot.php)

Code change was in ROBOT_Vunctions.php:

elseif ($regs[1] == 403) {
echo "url: " . $request . "\r\n <br />\r\n";
print "<br>\n".$answer." - ".$url."<br>\nSee http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for explanation.<br>\n";

This seems to me to be that the spider is trying to locate the cookie(?).
Paul D. Buck is offline   Reply With Quote