PDA

View Full Version : 0 links found


xibalba
03-09-2004, 09:00 AM
Hi, I applied the patch from the http://www.phpdig.net/showthread.php?threadid=573 (this) thread. And i'm still getting 0 links found. Here is the stdout from cmd line.

%php -f spider.php forceall
47472: old priority 0, new priority 18

Spidering in progress...
-----------------------------
SITE : http://maggiv8.funpic.de/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0

-----------------------------
SITE : http://rbhs.ath.cx/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0

-----------------------------
SITE : http://localhost/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0
Optimizing tables...
Indexing complete !
%

running Fbsd 4.9 w/Apache/1.3.29 (Unix) PHP/4.3.4

any suggestions?

Charter
03-09-2004, 10:15 AM
Hi xibalba, and welcome to PhpDig.net!

Perhaps something in this (http://www.phpdig.net/showthread.php?threadid=519) thread might help.

Below is output using search depth one:

SITE : http://maggiv8.funpic.de/
Exclude paths :
- @NONE@
1:http://maggiv8.funpic.de/
(time : 00:00:15)
+ + +
level 1...
2:http://maggiv8.funpic.de/www/
(time : 00:00:27)

3:http://maggiv8.funpic.de/search.php
(time : 00:00:33)

4:http://maggiv8.funpic.de/phpinfo.php
(time : 00:00:41)

No link in temporary table

--------------------------------------------------------------------------------

links found : 4
http://maggiv8.funpic.de/
http://maggiv8.funpic.de/www/
http://maggiv8.funpic.de/search.php
http://maggiv8.funpic.de/phpinfo.php
Optimizing tables...
Indexing complete !


SITE : http://rbhs.ath.cx/
Exclude paths :
- @NONE@
1:http://rbhs.ath.cx/
(time : 00:00:09)
+ + + + +
level 1...
2:http://rbhs.ath.cx/uebimiau/
(time : 00:00:23)

3:http://rbhs.ath.cx/webalizer/
(time : 00:00:29)

4:http://rbhs.ath.cx/moregroupware/
(time : 00:00:35)

5:http://rbhs.ath.cx/phpMyAdmin/
(time : 00:00:41)

6:http://rbhs.ath.cx/phpSysInfo/
(time : 00:00:49)

No link in temporary table

--------------------------------------------------------------------------------

links found : 6
http://rbhs.ath.cx/
http://rbhs.ath.cx/uebimiau/
http://rbhs.ath.cx/webalizer/
http://rbhs.ath.cx/moregroupware/
http://rbhs.ath.cx/phpMyAdmin/
http://rbhs.ath.cx/phpSysInfo/
Optimizing tables...
Indexing complete !

xibalba
03-09-2004, 11:23 AM
should I be careful with how high I set the search depth?

Even with the search depth set as one for both freebsd.org and rbhs.ath.cx, i get the following output.

%php -f spider.php forceall
47723: old priority 0, new priority 18

Spidering in progress...
-----------------------------
SITE : http://rbhs.ath.cx/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0

-----------------------------
SITE : http://freebsd.org/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0
Optimizing tables...
Indexing complete !
%

perhaps something is wrong in my configuration.
I read over the other thread you linked me too and couldn't find anything in there that would seem to have fixed this problem.


Weird...it seems to correctly crawl if I add a URI via the command line
%php -f spider.php http://maggiv8.funpic.de/
47732: old priority 0, new priority 18

Spidering in progress...
-----------------------------
SITE : http://maggiv8.funpic.de/
Exclude paths :
- @NONE@
+1:http://maggiv8.funpic.de/
(time : 00:00:07)
+ + +
level 1...
+2:http://maggiv8.funpic.de/phpinfo.php
(time : 00:00:28)

+3:http://maggiv8.funpic.de/search.php
(time : 00:00:35)
+4:http://maggiv8.funpic.de/www/
(time : 00:00:40)
+ + + + + + + + + + + +
level 2...
.....

Charter
03-09-2004, 11:39 AM
Hi. The forceall option is meant to try and force the reindex of sites already indexed regardless of the default days before reindex. If the sites haven't been previously indexed, forceall won't index them.

xibalba
03-09-2004, 11:50 AM
Thanks for the help Charter. On a tangent, is it possible to setup phpDig in a distributed fashion?

Say I want to crawl a huge domain, www.example.com with multiple machines crawling that domain. Is there a way currently to set phpdig up in this style?

Charter
03-09-2004, 12:45 PM
Hi. Some users have run spider.php on different (sub)domains at the same time using the same database tables without incident. However, PhpDig doesn't specifically account for multithreading (http://www.google.com/search?q=what+is+multithreading) issues.

If you want to try running PhpDig in a distributed fashion on the same domain, perhaps set the the following in the config.php file, where X is one or two:

define('SPIDER_MAX_LIMIT',X); //max recurse levels in spider
define('SPIDER_DEFAULT_LIMIT',X); //default value
define('RESPIDER_LIMIT',X); //recurse limit for update
define('LIMIT_DAYS',0); //default days before reindex a page

and try entering the site at different spots, for example:

prompt> php -f spider.php http://www.domain.com/dir1/ &
prompt> php -f spider.php http://www.domain.com/dir2/ &

The & backgrounds (http://www.slackware.com/book/index.php?source=c2628.html) the process and returns you to the shell prompt.