PDA

View Full Version : No link in temporary table (yet another one)


renehaentjens
03-30-2004, 05:54 AM
Yet another one:

SITE : http://cordoba.ugent.be/
Exclude paths :
- @NONE@
1:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/
(time : 00:00:06)
No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://cordoba.ugent.be/LW02AC/document/Refs/Theses/
Optimizing tables...
Indexing complete !


Charter, I give up for now. I've tried a couple of suggestions from other similar posts, I've re-installed 180 from scratch. Nothing helps!
This is a site that I've indexed before, with no problems. I've changed the PHP script, I admit. Now it won't index...
http://cordoba.ugent.be/LW02AC/document/Refs/Theses/

renehaentjens
03-30-2004, 06:08 AM
Here's the result with print $answer."<br>\n"; - no Forbiddens.
I also tried the hosts file modifs (some at least), the new robot_functions, I checked tempspider, it's empty...


Spidering in progress...
HTTP/1.1 404 Not Found
HTTP/1.1 200 OK
Date: Tue, 30 Mar 2004 14:06:16 GMT
Server: Apache/1.3.27 (Win32) PHP/4.3.3
X-Powered-By: PHP/4.3.3
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Content-Type: text/html

HTTP/1.1 200 OK
Date: Tue, 30 Mar 2004 14:06:17 GMT
Server: Apache/1.3.27 (Win32) PHP/4.3.3
X-Powered-By: PHP/4.3.3
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Content-Type: text/html

HTTP/1.1 404 Not Found

--------------------------------------------------------------------------------
SITE : http://cordoba.ugent.be/
Exclude paths :
- @NONE@
HTTP/1.1 200 OK
Date: Tue, 30 Mar 2004 14:06:17 GMT
Server: Apache/1.3.27 (Win32) PHP/4.3.3
X-Powered-By: PHP/4.3.3
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Content-Type: text/html

1:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/
(time : 00:00:02)
No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://cordoba.ugent.be/LW02AC/document/Refs/Theses/
Optimizing tables...
Indexing complete !

renehaentjens
03-30-2004, 06:21 AM
After spidering, there is one file in the text_content directory (not counting keepalive.txt). It contains one line with 7 spaces, nothing else.

It doesn't help if I delete it and start all over...

renehaentjens
03-30-2004, 06:30 AM
First impression from looking at the Apache access log is that spider doesn't even try to fetch the page:

157.193.197.26 - - [30/Mar/2004:16:28:51 +0200] "HEAD /robots.txt HTTP/1.1" 404 0
157.193.197.26 - - [30/Mar/2004:16:28:51 +0200] "HEAD /LW02AC/document/Refs/Theses/ HTTP/1.1" 200 0
157.193.197.26 - - [30/Mar/2004:16:28:52 +0200] "HEAD /LW02AC/document/Refs/Theses/ HTTP/1.1" 200 0
157.193.197.26 - - [30/Mar/2004:16:28:52 +0200] "HEAD /robots.txt HTTP/1.1" 404 0
157.193.197.26 - - [30/Mar/2004:16:28:52 +0200] "HEAD /LW02AC/document/Refs/Theses/ HTTP/1.1" 200 0
157.193.197.26 - admin [30/Mar/2004:16:28:52 +0200] "POST /LW02AC/180phpdig/admin/spider.php HTTP/1.1" 200 1880
157.193.197.26 - - [30/Mar/2004:16:28:53 +0200] "GET /LW02AC/180phpdig/admin/yes.gif HTTP/1.1" 304 -

renehaentjens
03-30-2004, 07:30 AM
With "the new robot_functions" in my earlier reply, I mean the version dated 25 Feb 2004 (unchanged).

I also tried to put PHPDIG_SESSID_REMOVE to false, no result.

Charter
03-30-2004, 07:50 AM
Hi. The main page that you are trying to index doesn't look to contain any redirects; it seems just simple HTML. Below is the start of an index of the page. What happens if you use a fresh install on new tables? What are SPIDER_MAX_LIMIT, SPIDER_DEFAULT_LIMIT, RESPIDER_LIMIT, and LIMIT_DAYS set to in the config file?

SITE : http://cordoba.ugent.be/
Exclude paths :
- @NONE@
1:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/
(time : 00:00:12)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
level 1...
2:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/index.php?dirpath=.%2F&row=1&item=Unknown
(time : 00:00:57)

3:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/index.php?dirpath=.%2F&row=1&item=2003
(time : 00:01:05)

4:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/index.php?dirpath=.%2F&row=1&item=2001
(time : 00:01:13)

5:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/index.php?dirpath=.%2F&row=1&item=2002
(time : 00:01:20)

6:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/index.php?dirpath=.%2F&row=1&item=2000
(time : 00:01:28)

7:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/index.php?dirpath=.%2F&row=1&item=1998
(time : 00:01:36)

8:http://cordoba.ugent.be/LW02AC/document/Refs/Theses/index.php?dirpath=.%2F&row=1&item=1999
(time : 00:01:43)
...

Also, from the admin panel, when you click the site, then the update button, are any of the directories excluded?

renehaentjens
03-30-2004, 10:41 PM
SPIDER_MAX_LIMIT=20, SPIDER_DEFAULT_LIMIT=3, RESPIDER_LIMIT=4, LIMIT_DAYS=7. (I never touched these.) My most recent tests were with a fresh installed 180, new database, new tables. The PHP script indeed generates straightforward simple HTML with no tricks.

Nothing on the update page seems to indicate any exclusions. Anyway, in my most recent tests, I always deleted the site and checked tempspider and text_content to be empty before my next try.

The good news is that you can index my site.

The problem must be with my PhpDig installation, I get the same problem (No link in temporary table ... links found : 1) if I try to index other sites...

I can't even index a simple text page in the root of my website (same PC as PhpDig, cordoba): PhpDig finds it all right, but seems to ignore its content altogether: (spider table) first_words = filename of the text page, nothing more, 0 words, filesize 0; after "indexing" there are 0 keywords and one text_content file with 7 spaces. (The text page does contain a few lines of text!)

Most recent changes on my PC (after last successful indexing with PhpDig): new virus checker eTrust 7.0.139 replaced McAfee, upgrade of ZoneAlarm to 4.5.538.001.

renehaentjens
03-30-2004, 11:46 PM
I think I found it: allow_url_fopen was Off (I put it off because of a security problem) and I just spotted the place where PhpDig gets the page content, yes, you guess it: with:
$file_content = @file($uri);
in function phpdigTempFile (robot_functions)

See also:
http://www.phpdig.net/showthread.php?s=&threadid=316&highlight=allow_url_fopen

Is there a checklist somewhere, with requirements for PhpDig to function correctly? Could you add some checks in the code of the next version?