PDA

View Full Version : spidering does *nothing* ?


davenewt
07-08-2004, 06:52 AM
Hi folks,

I've installed phpdig, solved the logging-in problem with the help of your forums here, but have come across another.

Namely, that spidering my site doesn't appear to be working, throwing out any errors, or anything.

All I get is:

Spidering in progress...

--------------------------------------------------------------------------------
SITE : http://localhost
Exclude paths :
- @NONE@

(I'm running phpdig on the server to spider the site)

...and then, nothing. It just sits there. I've waited a while, wondering if my 20-page site I'm testing it with is maybe taking 5mins per page, but no. Still nothing.


What gives? Any help gratefully appreciated.

Thanks,
Dave.

davenewt
07-08-2004, 06:54 AM
PS I've tried adding /index.php and just / to the path, but still no joy. Help! :)

Charter
07-08-2004, 07:21 AM
Hi. In this (http://www.phpdig.net/showthread.php?threadid=741) thread, although a bit dated, there is talk of various issues. Does anything there help?

davenewt
07-12-2004, 12:05 AM
Thanks Charter, I've looked into both those threads, but still no joy. When I comment out the //print $answer line in robot_functions.php I get the following output. Does this shed any light for anyone here?


Spidering in progress...
HTTP/1.1 404 Not Found
HTTP/1.1 200 OK
Connection: close
Date: Mon, 12 Jul 2004 07:55:07 GMT
Server: Microsoft-IIS/6.0
Content-type: text/html
X-Powered-By: PHP/4.3.6
Set-Cookie: phpbb2mysql_data=a%3A0%3A%7B%7D; expires=Tue, 12-Jul-2005 07:55:06 GMT; path=/
Set-Cookie: phpbb2mysql_sid=39f7eac47575e95830dfc26253f49aa6; path=/

HTTP/1.1 200 OK
Connection: close
Date: Mon, 12 Jul 2004 07:55:07 GMT
Server: Microsoft-IIS/6.0
Content-type: text/html
X-Powered-By: PHP/4.3.6
Set-Cookie: phpbb2mysql_data=a%3A0%3A%7B%7D; expires=Tue, 12-Jul-2005 07:55:07 GMT; path=/
Set-Cookie: phpbb2mysql_sid=0254fab8d05bec6fe0a3b103bd8207eb; path=/

HTTP/1.1 404 Not Found

--------------------------------------------------------------------------------
SITE : http://knet/
Exclude paths :
- @NONE@
HTTP/1.1 200 OK
Connection: close
Date: Mon, 12 Jul 2004 07:55:12 GMT
Server: Microsoft-IIS/6.0
Content-type: text/html
X-Powered-By: PHP/4.3.6
Set-Cookie: phpbb2mysql_data=a%3A0%3A%7B%7D; expires=Tue, 12-Jul-2005 07:55:12 GMT; path=/
Set-Cookie: phpbb2mysql_sid=22564c9ca04014e3f4dc5dfb837cd46d; path=/


Any ideas?

Thanks,
Dave.

Charter
07-12-2004, 12:15 AM
Hi. Is magic_quotes_runtime On or Off?

davenewt
07-12-2004, 12:20 AM
Off. Why?

Thanks for the quick response :-)

Dave.

Charter
07-12-2004, 12:30 AM
Just a quoting bug when magic_quotes_runtime is on... :(

Anything showing up in your error logs?

Charter
07-12-2004, 12:45 AM
PS: Not using version 1.8.1? On Win? Set USE_IS_EXECUTABLE_COMMAND to 0 in the config.php file.

davenewt
07-12-2004, 02:03 AM
Thanks, have got a little further now. Changed USE_IS_EXECUTABLE_COMMAND to 0 as you suggested, and now get the following output:


Spidering in progress...
HTTP/1.1 404 Not Found
HTTP/1.1 200 OK
Connection: close
Date: Mon, 12 Jul 2004 09:54:14 GMT
Server: Microsoft-IIS/6.0
Content-type: text/html
X-Powered-By: PHP/4.3.6
Set-Cookie: phpbb2mysql_data=a%3A0%3A%7B%7D; expires=Tue, 12-Jul-2005 09:54:14 GMT; path=/
Set-Cookie: phpbb2mysql_sid=a4c25478069a56787cb47195438c3285; path=/

HTTP/1.1 200 OK
Connection: close
Date: Mon, 12 Jul 2004 09:54:14 GMT
Server: Microsoft-IIS/6.0
Content-type: text/html
X-Powered-By: PHP/4.3.6
Set-Cookie: phpbb2mysql_data=a%3A0%3A%7B%7D; expires=Tue, 12-Jul-2005 09:54:14 GMT; path=/
Set-Cookie: phpbb2mysql_sid=9066b3688621a9f1c2e32fd7be8d953f; path=/

HTTP/1.1 404 Not Found

--------------------------------------------------------------------------------
SITE : http://knet/
Exclude paths :
- @NONE@
HTTP/1.1 200 OK
Connection: close
Date: Mon, 12 Jul 2004 09:54:19 GMT
Server: Microsoft-IIS/6.0
Content-type: text/html
X-Powered-By: PHP/4.3.6
Set-Cookie: phpbb2mysql_data=a%3A0%3A%7B%7D; expires=Tue, 12-Jul-2005 09:54:19 GMT; path=/
Set-Cookie: phpbb2mysql_sid=dd7951c2b01cc6f4830f7eb1a0dc2589; path=/

[tick symbol]1:http://knet/
(time : 00:00:06)
No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://knet/
Optimizing tables...
Indexing complete !


Does not seem to go any further than the index page, despite there being plenty of links to other pages.

Glad we're getting somewhere though!

What do I need to do next, to get it to spider the entire site?

Thanks,
Dave.

Charter
07-12-2004, 02:11 AM
Hi. Look in the text_content directory, at the file with the highest number. What is in that file, the contents from the main page or something else? If something else, is it showing a 404 message page? If so, then add a robots.txt page to web root and see if it will go.

simple robots.txt file:

User-agent: *
Disallow:

davenewt
07-12-2004, 02:20 AM
Hi. There's only a 1.txt file, and it's the content of the index page. This includes text which is a link in the HTML, so it's obvisously missing something...

Thanks,
Dave.

Charter
07-12-2004, 02:24 AM
What's in the file, or at least the suspicious looking piece, and how does it compare to the HTML of the index page?

davenewt
07-12-2004, 02:35 AM
In the HTML:

<span class="xhead">Latest News <a href="/newsboard/index.php">[View Archive]</a></span>


In 1.txt:

Latest News [View Archive]

Charter
07-12-2004, 03:36 AM
That's as it should work, strip away the tags and leave the text. PhpDig looks for links prior to that. Is there anything showing in your error logs?

davenewt
07-12-2004, 06:19 AM
I don't see any error log files within the phpdig directories...?

davenewt
07-12-2004, 06:25 AM
Hang on, just tried re-indexing the root and it seems to be working... sorry, not sure how or why, but I'm getting a ton of output, which is a good sign :)

Will report back...

Charter
07-12-2004, 06:30 AM
Server error logs... ;)

davenewt
07-12-2004, 06:56 AM
Server logs... thought that's what you meant... anyway, seems I had to "Dig This" to add the site to start with, then use the "Update Form" part of the Admin interface and click the green tick icon next to "Root" to re-index.

However...

Doing this initially started to pick up all links to the forum directory, so I added a robots.txt file to exclude this directory.

Now I have discovered the main problem. The only navigation on the site so far is via a dynamic javascript menu which is added to the HTML at runtime. There are no links (except to the aforementioned forums) in the body of the HTML.

How can I spider sub-directories which aren't linked to from the root index file? Do I need to add a line of navigation to the bottom of the index page which will make the spider aware of the subdirectories, or can I tell the spider to look for them some other way?

Almost there :) Thanks so much for the help.

Cheers,
Dave.

Charter
07-12-2004, 07:33 PM
>> How can I spider sub-directories which aren't linked to from the root index file?

Check out PhpDig version 1.8.2... :D

davenewt
07-13-2004, 12:15 AM
To update from 1.8.0 can I just copy over all the files, or is there a safer process? Just checking in case I'm about to screw up something else :)

Charter
07-13-2004, 12:03 PM
Hi. Yes, copy over all the files, add the new tables, and then use the new connect.php and config.php files.

davenewt
07-14-2004, 12:40 AM
Okay, am doing that... but I still don't see an easy way to index the entire site when my index.php file contains no static HTML links to my subdirectories' pages.

I found the line:
define(LIMIT_TO_DIRECTORY,true); //limit index to directory, no sub dir, set in URL
in config.php which I thought might have something to do with it, but there were no single quotes around the variable name so I added them and changed the line to:
define('LIMIT_TO_DIRECTORY',false); //limit index to directory, no sub dir, set in URL to see if it made any difference, but no.

So even with this latest version, it seems I still need to spider all the subdirectories manually, yes?

davenewt
07-14-2004, 12:53 AM
Hmmm, there is no longer anything being put into my text_content directory. Nor is the spidering process picking up on basic links in the index file AGAIN. It seems we're back to the same stage as the last post on page 1 of this discussion. Which magically fixed itself (see first post of page 2) seemingly without me doing anything (that I remember). So I don't know what to do next. Back to square one :(

Charter
07-14-2004, 01:02 AM
Hi. Set define('LIMIT_TO_DIRECTORY',true); and then index http://www.domain.com/dir1/dir2/ or some such thing, assuming the page at dir1/dir2/ has links to other pages.

davenewt
07-14-2004, 02:02 AM
Ok, now I get:

SITE : http://knet/
Exclude paths :
- newsboard/
1:http://knet/index.php
(time : 00:00:05)
No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://knet/index.php
Optimizing tables...
Indexing complete !

BUT there's a link to test.php in there which STILL isn't found (this is the problem I referred to above, which 'magically' fixed itself). Forget about subdirectories for a moment, I need it to start picking up bog-standard links again first :(

Thanks again.

Charter
07-14-2004, 02:23 AM
Hi. If you do an update after indexing the main page, does it start up?

davenewt
07-14-2004, 02:30 AM
Nope. Nothing. When I click on the Update Form button (after selecting the site, obviously) I get:

Found tree :
Click on the cross to delete the branch
Click on the green sign to update it
Click on the noway sign to exclude from future indexings
Warning ! Exclude will delete indexed entries


[Back] to admin interface. (i.e. nothing under "Found Tree:" - no documents, nothing :(

Charter
07-14-2004, 02:32 AM
Is http://knet/ pointed to 127.0.0.1 in the Hosts file?

davenewt
07-14-2004, 02:44 AM
No, only localhost. Maybe I should spider localhost then (seeing as I'm performing all this on the server)?

Charter
07-14-2004, 02:45 AM
Does it work with localhost?

davenewt
07-14-2004, 02:46 AM
...nope, that results in exactly the same result. Spiders the page, finds only a link to that page, and no pages (not even Root) is shown on the "update form" page :(

davenewt
07-14-2004, 02:47 AM
...and my command of the English language is also deteriorating rapidly :)

Charter
07-14-2004, 03:00 AM
Thoughts...

Set http://knet/ to 127.0.0.1 in Hosts file
CD to the admin dir and spider from command line

php -f spider.php http://knet/

Turn off renice command in config

davenewt
07-14-2004, 03:17 AM
nope, nope and nope again :(

I wish I knew what happened to make it work last time!

davenewt
07-14-2004, 03:27 AM
Interestingly, using the previous phpdig install (I renamed the directory "phpdig1" when I installed the new version in "phpdig"), things are very different. Using the admin script, spidering localhost, this is what I get:

SITE : http://localhost/
Exclude paths :
- newsboard/
1:http://localhost/
(time : 00:00:06)
No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://localhost/
Optimizing tables...
Indexing complete !Now, after the "- newsboard" line it picks up on the URL of the page and displays the green tick symbol which I don't remember seeing with the new install.

I can then go back to http://localhost using "Update Form" and lo and behold, a "[x] [tick] Root [arrow]" part appears!

When I click on the tick to re-index, it starts picking up the links and spidering properly.

Does that give you any clues as to what might be different with the new install?