PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   0 links found (http://www.phpdig.net/forum/showthread.php?t=630)

xibalba 03-09-2004 09:00 AM

0 links found
 
Hi, I applied the patch from the http://www.phpdig.net/showthread.php?threadid=573 thread. And i'm still getting 0 links found. Here is the stdout from cmd line.

%php -f spider.php forceall
47472: old priority 0, new priority 18

Spidering in progress...
-----------------------------
SITE : http://maggiv8.funpic.de/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0

-----------------------------
SITE : http://rbhs.ath.cx/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0

-----------------------------
SITE : http://localhost/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0
Optimizing tables...
Indexing complete !
%

running Fbsd 4.9 w/Apache/1.3.29 (Unix) PHP/4.3.4

any suggestions?

Charter 03-09-2004 10:15 AM

Hi xibalba, and welcome to PhpDig.net!

Perhaps something in this thread might help.

Below is output using search depth one:

SITE : http://maggiv8.funpic.de/
Exclude paths :
- @NONE@
1:http://maggiv8.funpic.de/
(time : 00:00:15)
+ + +
level 1...
2:http://maggiv8.funpic.de/www/
(time : 00:00:27)

3:http://maggiv8.funpic.de/search.php
(time : 00:00:33)

4:http://maggiv8.funpic.de/phpinfo.php
(time : 00:00:41)

No link in temporary table

--------------------------------------------------------------------------------

links found : 4
http://maggiv8.funpic.de/
http://maggiv8.funpic.de/www/
http://maggiv8.funpic.de/search.php
http://maggiv8.funpic.de/phpinfo.php
Optimizing tables...
Indexing complete !


SITE : http://rbhs.ath.cx/
Exclude paths :
- @NONE@
1:http://rbhs.ath.cx/
(time : 00:00:09)
+ + + + +
level 1...
2:http://rbhs.ath.cx/uebimiau/
(time : 00:00:23)

3:http://rbhs.ath.cx/webalizer/
(time : 00:00:29)

4:http://rbhs.ath.cx/moregroupware/
(time : 00:00:35)

5:http://rbhs.ath.cx/phpMyAdmin/
(time : 00:00:41)

6:http://rbhs.ath.cx/phpSysInfo/
(time : 00:00:49)

No link in temporary table

--------------------------------------------------------------------------------

links found : 6
http://rbhs.ath.cx/
http://rbhs.ath.cx/uebimiau/
http://rbhs.ath.cx/webalizer/
http://rbhs.ath.cx/moregroupware/
http://rbhs.ath.cx/phpMyAdmin/
http://rbhs.ath.cx/phpSysInfo/
Optimizing tables...
Indexing complete !

xibalba 03-09-2004 11:23 AM

search depth
 
should I be careful with how high I set the search depth?

Even with the search depth set as one for both freebsd.org and rbhs.ath.cx, i get the following output.

%php -f spider.php forceall
47723: old priority 0, new priority 18

Spidering in progress...
-----------------------------
SITE : http://rbhs.ath.cx/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0

-----------------------------
SITE : http://freebsd.org/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0
Optimizing tables...
Indexing complete !
%

perhaps something is wrong in my configuration.
I read over the other thread you linked me too and couldn't find anything in there that would seem to have fixed this problem.


Weird...it seems to correctly crawl if I add a URI via the command line
%php -f spider.php http://maggiv8.funpic.de/
47732: old priority 0, new priority 18

Spidering in progress...
-----------------------------
SITE : http://maggiv8.funpic.de/
Exclude paths :
- @NONE@
+1:http://maggiv8.funpic.de/
(time : 00:00:07)
+ + +
level 1...
+2:http://maggiv8.funpic.de/phpinfo.php
(time : 00:00:28)

+3:http://maggiv8.funpic.de/search.php
(time : 00:00:35)
+4:http://maggiv8.funpic.de/www/
(time : 00:00:40)
+ + + + + + + + + + + +
level 2...
.....

Charter 03-09-2004 11:39 AM

Hi. The forceall option is meant to try and force the reindex of sites already indexed regardless of the default days before reindex. If the sites haven't been previously indexed, forceall won't index them.

xibalba 03-09-2004 11:50 AM

Thanks for the help Charter. On a tangent, is it possible to setup phpDig in a distributed fashion?

Say I want to crawl a huge domain, www.example.com with multiple machines crawling that domain. Is there a way currently to set phpdig up in this style?

Charter 03-09-2004 12:45 PM

Hi. Some users have run spider.php on different (sub)domains at the same time using the same database tables without incident. However, PhpDig doesn't specifically account for multithreading issues.

If you want to try running PhpDig in a distributed fashion on the same domain, perhaps set the the following in the config.php file, where X is one or two:
PHP Code:

define('SPIDER_MAX_LIMIT',X);         //max recurse levels in spider
define('SPIDER_DEFAULT_LIMIT',X);     //default value
define('RESPIDER_LIMIT',X);           //recurse limit for update
define('LIMIT_DAYS',0);               //default days before reindex a page 

and try entering the site at different spots, for example:
Code:

prompt> php -f spider.php http://www.domain.com/dir1/ &
prompt> php -f spider.php http://www.domain.com/dir2/ &

The & backgrounds the process and returns you to the shell prompt.


All times are GMT -8. The time now is 12:05 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.