View Full Version : Crawling Options

09-26-2003, 08:58 AM
I just finished configuring PHPDIG on W2K/Appache 1.3x
IT works great, awesome program. But I have a question the crawling functionality:
From the admin Index page you can enter domais to crawl, but is there a way to have phpdig crawl r****mly? Or go beyond the specified domain?
Thank you,

09-26-2003, 05:38 PM
Hi. PhpDig crawls the links in a page, depending on the number of levels chosen. I'm not sure what you mean by "go beyond the specified domain" though. Do you mean crawl subdirectories?

David J Harmon
09-26-2003, 06:54 PM
Yes I was thinking the samething, if it could do r****m crawls. I'm building up my db and I would like it to crawl all over the place. The only site I don't want is Porn sites.

btw can you tell me more about Level I don't completely understand it...

David J Harmon
Cappuccino David

09-26-2003, 07:06 PM
Currently PhpDig crawls links from a page.

Levels mean the number of links to follow from a page, looking for more links. For example, level one means to only follow the links on one page, but not links from links on that same page.

Confused??? ;)

Level One Example:

- a1.com
-- a11.com
-- a12.com
- a2.com
-- a21.com
-- a22.com

So, indexing a.com at level one will crawl to a1.com and a2.com but no further.

David J Harmon
09-26-2003, 07:34 PM
Got it... thanks

I'm still new to php, but I see it is a powerful lang. I just use HTML but moving to php.

Anyway to give the spider a list of url for he (yes a he) can go out when I'm not at my computer? I'm trying to build up my db.

David J Harmon
Cappuccino David

09-26-2003, 07:43 PM
Hi. You could set up a cron job or make a text file with URLs and use shell access to crawl.

David J Harmon
09-26-2003, 08:58 PM
didn't think about that, could you give me an example.

09-27-2003, 05:59 PM
On *nix say you want to run a cron job that spiders on the 1st and 15th of every month.

First make a list of URLs, one per line, in a file called cronlist.txt

Then create a file called cronjob.txt that contains the following on one line, editing the paths to php and to spider.php:

0 0 1,15 * * /path/to/php -f /path/to/admin/spider.php cronlist.txt >> spider.log

Put cronlist.txt and cronjob.txt in the same directory as spider.php and then type the following shell command from that same directory:

crontab cronjob.txt -u

If you want to delete the cron job, type crontab -d from shell.

A cron tutorial can be found at http://www.linuxhelp.net/guides/cron/.

David J Harmon
09-27-2003, 10:51 PM
Thank you that should help me out some, I've still have a long ways on my db but its coming along. It should be open to the public end of this week, their I hope to build it up.

If theire is anything I could help out let me know???

09-28-2003, 10:19 PM
I've done all the cronjob.txt , cronlist.txt thing , but I guess this is a stupid question to ask , but , How do you run a Shell comand on a Linux server? Please give me an example/tell me how to do that.
Need help here, argent.


David J Harmon
09-28-2003, 10:45 PM
Do you have CPanel?? if so go to Cron Job it has most of it done for you. just put - /path/to/php -f /path/to/admin/spider.php cronlist.txt >> spider.log - into the open space and it will set you up for the job. But remember to change the path to the right one...


09-29-2003, 04:26 PM
If you don't have CPanel or the like, then you'd need shell access via Telnet/SSH to use crontab, assuming crontab is available on your machine. If your host doesn't allow Telnet/SSH access, there is a CGI-Telnet (http://www.rohitab.com/cgiscripts/cgitelnet.html) script that you could try in order to run non-interactive shell commands from your browser.

09-30-2003, 11:09 AM
I found the answer to my question on another thread, thank you to all that replied.
Here is what I found, but I haven't tested it yet.

"There is a function in robots_functions.php located in your admin directory. This function compares the current with the new URL.... it returns either true or false, set the false to true and it will follow any domain link or URL it finds.... though be careful, your database will exceed your hosting space soon!!!! I have currently over 10000 sites indexed and the database is a bit more than 1 GB! The functions name is phpdigCompareDomains($url1,$url2) search for this and as I sad replace false with true. Also there is a Flag in your config.php that has also to be true (StayInDomain or something like this.. u gotta find it)" ..