PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 06-15-2004, 12:27 AM   #1
nathansc
Green Mole
 
Join Date: Jun 2004
Posts: 3
Unhappy spidering problem

I'm installing phpdig and I really like it... but I know I'm doing something wrong here. When I goto the admin interface, and enter the URL I want to spider, it is only indexing the index page and no further. I am using 5 or 6 for the search depth.

First: it only seems to be spidering the index. It doesn't index the pages that are linked off of the index.

Second: I wonder is the problem because my site is a subdirectory itself. http://depts.washington.edu/vei/

Third: My site has several dynamically generated, php, pages (a course catalog.) The spider is defintely not indexing these. These pages are about three links down from the index.

Almost all the pages on my site are at least partially dynamic. They all have a dynamic nav bar which lives in a different directory.

Thanks for any help you can give.

Nathan
__________________
Nathan
nathansc is offline   Reply With Quote
Old 06-15-2004, 01:55 AM   #2
bloodjelly
Purple Mole
 
Join Date: Dec 2003
Posts: 106
Hi Nathan - welcome to the site.

If you search around the forums here (troubleshooting particularly), you will find many answers to this question. Many of them are titled "no links in temporary table", "links found: 0" and similar. Once you've read all of the previous answers, if it still doesn't work, we'll be happy to tackle the new problem. Good luck.
bloodjelly is offline   Reply With Quote
Old 06-16-2004, 12:24 AM   #3
nathansc
Green Mole
 
Join Date: Jun 2004
Posts: 3
Hi, thank you for the prompt response. I'm sorry to keep bugging about something which I know has been answered millions of times on here, but I just can't seem to find a response which helps me. I spent about 10 hours fiddling with this and reading posts yesterday and today. So, I'll tell you what I've done.

First of all, I'm running:
mysql 4.0.15
PHP version 4.2.0
AIX 4.3.3

The site I'm working on is:
http://depts.washington.edu/vei/

One of the first things that I tried was creating a robots.txt file. That didn't work, and I found another post suggesting to erase the robots.txt file, so I did that tonight. I also replaced this line

$user_agent = $regs[1];

with

if ($regs[1] == "*") {
$user_agent = "'$regs[1]'";
} else {
$user_agent = $regs[1]
}

as another post suggested.

I also tried indexing http://www.php.net as a test, and it didn't work. I also found lots of other posts talking about this same problem, suggesting changes to the php.ini file, but none of those applieds, as I already had correct settings on the php.ini. So... I'm sure there's some simple answer that I'm just not getting so I appreciate the help.

When I tried to index php.net, this is the result I get.

SITE : http://www.php.net/
Exclude paths :
- '*'
- @NONE@
HTTP/1.1 200 OK
Date: Wed, 16 Jun 2004 07:15:59 GMT
Server: Apache/1.3.26 (Unix) mod_gzip/1.3.26.1a PHP/4.3.3-dev
X-Powered-By: PHP/4.3.3-dev
Last-Modified: Wed, 16 Jun 2004 07:11:02 GMT
Content-language: en
Set-Cookie: COUNTRY=USA%2C140.142.16.139; expires=Wed, 23-Jun-04 07:15:59 GMT; path=/; domain=.php.net
Connection: close
Content-Type: text/html;charset=ISO-8859-1

1:http://www.php.net/\1/
(time : 00:00:31)

No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://www.php.net/\1/
Optimizing tables...
Indexing complete !

When I try to index my site (http://depts.washington.edu/vei/index.php), i get the same result:

SITE : http://depts.washington.edu/
Exclude paths :
- '*'
- @NONE@
HTTP/1.1 200 OK
Date: Wed, 16 Jun 2004 07:23:26 GMT
Server: Apache/1.3.29 (Unix) mod_pubcookie/a5/1.77.2.4 mod_uwa/2.2 Resin/2.1.8 mod_fastcgi/2.2.12 mod_ssl/2.8.16 OpenSSL/0.9.7a
X-Powered-By: PHP/4.2.0
Content-Type: text/html

1:http://depts.washington.edu/vei/
(time : 00:00:09)

No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://depts.washington.edu/vei/
Optimizing tables...
Indexing complete !

Thanks again.

Nathan
__________________
Nathan
nathansc is offline   Reply With Quote
Old 06-17-2004, 04:25 PM   #4
nathansc
Green Mole
 
Join Date: Jun 2004
Posts: 3
Hi... I figured my problem out, and it wasn't something that I saw in the newsgroup, so I'll leave it here in the hopes that someday it will help someone else.

I figured this out by just plain old debugging the code, so there may be a more direct way to do this, but this is what worked for me.

What was happening was that the function phpdigExplore was not returning the URLs contained within my page. The reason was that I have magic_quotes = On. So the eregi functions was failing. So before the line:

while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([[a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) {

I added this line of code:
$eval = stripslashes($eval);

hope this helps someone.

Nathan
nathansc is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Spidering problem mark40 Troubleshooting 1 08-28-2007 05:06 AM
Problem with spidering tomjed Troubleshooting 0 02-09-2006 03:50 AM
Spidering problem please help KaZ Troubleshooting 1 12-05-2005 07:59 AM
Problem Spidering Trallis Troubleshooting 6 11-02-2005 08:58 AM
Problem Spidering jmitchell Troubleshooting 3 12-29-2004 06:42 PM


All times are GMT -8. The time now is 10:22 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.