PDA

View Full Version : spidering problem


nathansc
06-14-2004, 11:27 PM
I'm installing phpdig and I really like it... but I know I'm doing something wrong here. When I goto the admin interface, and enter the URL I want to spider, it is only indexing the index page and no further. I am using 5 or 6 for the search depth.

First: it only seems to be spidering the index. It doesn't index the pages that are linked off of the index.

Second: I wonder is the problem because my site is a subdirectory itself. http://depts.washington.edu/vei/

Third: My site has several dynamically generated, php, pages (a course catalog.) The spider is defintely not indexing these. These pages are about three links down from the index.

Almost all the pages on my site are at least partially dynamic. They all have a dynamic nav bar which lives in a different directory.

Thanks for any help you can give.

Nathan

bloodjelly
06-15-2004, 12:55 AM
Hi Nathan - welcome to the site.

If you search around the forums here (troubleshooting particularly), you will find many answers to this question. Many of them are titled "no links in temporary table", "links found: 0" and similar. Once you've read all of the previous answers, if it still doesn't work, we'll be happy to tackle the new problem. Good luck.

nathansc
06-15-2004, 11:24 PM
Hi, thank you for the prompt response. I'm sorry to keep bugging about something which I know has been answered millions of times on here, but I just can't seem to find a response which helps me. I spent about 10 hours fiddling with this and reading posts yesterday and today. So, I'll tell you what I've done.

First of all, I'm running:
mysql 4.0.15
PHP version 4.2.0
AIX 4.3.3

The site I'm working on is:
http://depts.washington.edu/vei/

One of the first things that I tried was creating a robots.txt file. That didn't work, and I found another post suggesting to erase the robots.txt file, so I did that tonight. I also replaced this line

$user_agent = $regs[1];

with

if ($regs[1] == "*") {
$user_agent = "'$regs[1]'";
} else {
$user_agent = $regs[1]
}

as another post suggested.

I also tried indexing http://www.php.net as a test, and it didn't work. I also found lots of other posts talking about this same problem, suggesting changes to the php.ini file, but none of those applieds, as I already had correct settings on the php.ini. So... I'm sure there's some simple answer that I'm just not getting so I appreciate the help.

When I tried to index php.net, this is the result I get.

SITE : http://www.php.net/
Exclude paths :
- '*'
- @NONE@
HTTP/1.1 200 OK
Date: Wed, 16 Jun 2004 07:15:59 GMT
Server: Apache/1.3.26 (Unix) mod_gzip/1.3.26.1a PHP/4.3.3-dev
X-Powered-By: PHP/4.3.3-dev
Last-Modified: Wed, 16 Jun 2004 07:11:02 GMT
Content-language: en
Set-Cookie: COUNTRY=USA%2C140.142.16.139; expires=Wed, 23-Jun-04 07:15:59 GMT; path=/; domain=.php.net
Connection: close
Content-Type: text/html;charset=ISO-8859-1

1:http://www.php.net/\1/
(time : 00:00:31)

No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://www.php.net/\1/
Optimizing tables...
Indexing complete !

When I try to index my site (http://depts.washington.edu/vei/index.php), i get the same result:

SITE : http://depts.washington.edu/
Exclude paths :
- '*'
- @NONE@
HTTP/1.1 200 OK
Date: Wed, 16 Jun 2004 07:23:26 GMT
Server: Apache/1.3.29 (Unix) mod_pubcookie/a5/1.77.2.4 mod_uwa/2.2 Resin/2.1.8 mod_fastcgi/2.2.12 mod_ssl/2.8.16 OpenSSL/0.9.7a
X-Powered-By: PHP/4.2.0
Content-Type: text/html

1:http://depts.washington.edu/vei/
(time : 00:00:09)

No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://depts.washington.edu/vei/
Optimizing tables...
Indexing complete !

Thanks again.

Nathan

nathansc
06-17-2004, 03:25 PM
Hi... I figured my problem out, and it wasn't something that I saw in the newsgroup, so I'll leave it here in the hopes that someday it will help someone else.

I figured this out by just plain old debugging the code, so there may be a more direct way to do this, but this is what worked for me.

What was happening was that the function phpdigExplore was not returning the URLs contained within my page. The reason was that I have magic_quotes = On. So the eregi functions was failing. So before the line:

while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([[a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) {

I added this line of code:
$eval = stripslashes($eval);

hope this helps someone.

Nathan