PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 09-26-2003, 07:58 AM   #1
jimigisme
Green Mole
 
Join Date: Sep 2003
Posts: 10
Thumbs up Crawling Options

I just finished configuring PHPDIG on W2K/Appache 1.3x
IT works great, awesome program. But I have a question the crawling functionality:
From the admin Index page you can enter domais to crawl, but is there a way to have phpdig crawl r****mly? Or go beyond the specified domain?
Thank you,
jimigisme
jimigisme is offline   Reply With Quote
Old 09-26-2003, 04:38 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. PhpDig crawls the links in a page, depending on the number of levels chosen. I'm not sure what you mean by "go beyond the specified domain" though. Do you mean crawl subdirectories?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 09-26-2003, 05:54 PM   #3
David J Harmon
Orange Mole
 
David J Harmon's Avatar
 
Join Date: Sep 2003
Location: Corbin KY
Posts: 45
Yes I was thinking the samething, if it could do r****m crawls. I'm building up my db and I would like it to crawl all over the place. The only site I don't want is Porn sites.

btw can you tell me more about Level I don't completely understand it...

David J Harmon
Cappuccino David
David J Harmon is offline   Reply With Quote
Old 09-26-2003, 06:06 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Currently PhpDig crawls links from a page.

Levels mean the number of links to follow from a page, looking for more links. For example, level one means to only follow the links on one page, but not links from links on that same page.

Confused???

Code:
Level One Example:

a.com
  - a1.com
      -- a11.com
      -- a12.com
  - a2.com
      -- a21.com
      -- a22.com
So, indexing a.com at level one will crawl to a1.com and a2.com but no further.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 09-26-2003, 06:34 PM   #5
David J Harmon
Orange Mole
 
David J Harmon's Avatar
 
Join Date: Sep 2003
Location: Corbin KY
Posts: 45
Got it... thanks

I'm still new to php, but I see it is a powerful lang. I just use HTML but moving to php.

Anyway to give the spider a list of url for he (yes a he) can go out when I'm not at my computer? I'm trying to build up my db.

David J Harmon
Cappuccino David
David J Harmon is offline   Reply With Quote
Old 09-26-2003, 06:43 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. You could set up a cron job or make a text file with URLs and use shell access to crawl.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 09-26-2003, 07:58 PM   #7
David J Harmon
Orange Mole
 
David J Harmon's Avatar
 
Join Date: Sep 2003
Location: Corbin KY
Posts: 45
didn't think about that, could you give me an example.
David J Harmon is offline   Reply With Quote
Old 09-27-2003, 04:59 PM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
On *nix say you want to run a cron job that spiders on the 1st and 15th of every month.

First make a list of URLs, one per line, in a file called cronlist.txt

Then create a file called cronjob.txt that contains the following on one line, editing the paths to php and to spider.php:
Code:
0 0 1,15 * * /path/to/php -f /path/to/admin/spider.php cronlist.txt >> spider.log
Put cronlist.txt and cronjob.txt in the same directory as spider.php and then type the following shell command from that same directory:
Code:
crontab cronjob.txt -u
If you want to delete the cron job, type crontab -d from shell.

A cron tutorial can be found at http://www.linuxhelp.net/guides/cron/.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 09-27-2003, 09:51 PM   #9
David J Harmon
Orange Mole
 
David J Harmon's Avatar
 
Join Date: Sep 2003
Location: Corbin KY
Posts: 45
Thank you that should help me out some, I've still have a long ways on my db but its coming along. It should be open to the public end of this week, their I hope to build it up.

If theire is anything I could help out let me know???
David J Harmon is offline   Reply With Quote
Old 09-28-2003, 09:19 PM   #10
sid
Former Member
 
Join Date: Sep 2003
Posts: 34
Hi,
I've done all the cronjob.txt , cronlist.txt thing , but I guess this is a stupid question to ask , but , How do you run a Shell comand on a Linux server? Please give me an example/tell me how to do that.
Need help here, argent.

Cheers
sid is offline   Reply With Quote
Old 09-28-2003, 09:45 PM   #11
David J Harmon
Orange Mole
 
David J Harmon's Avatar
 
Join Date: Sep 2003
Location: Corbin KY
Posts: 45
Do you have CPanel?? if so go to Cron Job it has most of it done for you. just put - /path/to/php -f /path/to/admin/spider.php cronlist.txt >> spider.log - into the open space and it will set you up for the job. But remember to change the path to the right one...

David
David J Harmon is offline   Reply With Quote
Old 09-29-2003, 03:26 PM   #12
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
If you don't have CPanel or the like, then you'd need shell access via Telnet/SSH to use crontab, assuming crontab is available on your machine. If your host doesn't allow Telnet/SSH access, there is a CGI-Telnet script that you could try in order to run non-interactive shell commands from your browser.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 09-30-2003, 10:09 AM   #13
jimigisme
Green Mole
 
Join Date: Sep 2003
Posts: 10
Smile

I found the answer to my question on another thread, thank you to all that replied.
Here is what I found, but I haven't tested it yet.

"There is a function in robots_functions.php located in your admin directory. This function compares the current with the new URL.... it returns either true or false, set the false to true and it will follow any domain link or URL it finds.... though be careful, your database will exceed your hosting space soon!!!! I have currently over 10000 sites indexed and the database is a bit more than 1 GB! The functions name is phpdigCompareDomains($url1,$url2) search for this and as I sad replace false with true. Also there is a Flag in your config.php that has also to be true (StayInDomain or something like this.. u gotta find it)" ..
jimigisme is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
*please help*: crawling search sites -howto make a metasearcher nicozab How-to Forum 1 07-04-2006 04:04 PM
crawling of only internal links? manute Troubleshooting 1 06-19-2004 05:38 AM
admin options mikeduff Mod Requests 3 06-09-2004 03:39 PM
what do these options mean? orbitalz How-to Forum 1 04-29-2004 03:54 AM


All times are GMT -8. The time now is 10:22 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.