PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Requests

Reply
 
Thread Tools
Old 04-03-2004, 12:24 AM   #1
peterpeter
Green Mole
 
Join Date: Mar 2004
Posts: 7
Speed up spidering by skipping internal page links

As I explained in this thread, spidering can be very slow due to the existence of (many) internal page links, such as the <A HREF="#1070721880">xxx</A> and <A NAME="1070721880"></A> pair. Since such links don't serve any purpose for the spider functionality, I suggest to skip spidering these links.

Peter
peterpeter is offline   Reply With Quote
Old 04-10-2004, 03:42 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Perhaps try removing the # symbol from the two pieces of code shown in this post.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-14-2004, 01:37 AM   #3
peterpeter
Green Mole
 
Join Date: Mar 2004
Posts: 7
Charter,

Thanks for the reply, but unfortunately it didn't help. Any other ideas?
peterpeter is offline   Reply With Quote
Old 04-14-2004, 03:36 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Did you do a fresh index or a reindex?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-15-2004, 01:54 AM   #5
peterpeter
Green Mole
 
Join Date: Mar 2004
Posts: 7
Both, with the same result.
peterpeter is offline   Reply With Quote
Old 04-15-2004, 02:45 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Keeping that removed # out, now try adding [^#]+

In while of phpdigExplore: ([^#]+(([[a-z]{3,5}://

In while of phpdigIndexFile: ([^#]+((http://

The <A NAME="1070721880"></A> shouldn't be matched regardless.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-18-2004, 07:17 AM   #7
peterpeter
Green Mole
 
Join Date: Mar 2004
Posts: 7
Charter,

It still doesn't do the job. Main reason is the size of the files I spider and the amount of HTML tags they contain. And since I have a genealogy site, adding newly found ancestors and their descendants will only lead to increasing file sizes. For the time being I have chosen to generate my genealogy files in two flavours. One with complete functionality and the other without any HTML tag. I use the latter for spidering and afterwards replace it with the correct one.

I believe a final solution might be to spider locally and then copy the local database to the database on the remote server. But this will need some investigation .

For now, thanks for your help.

Peter
peterpeter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
skipping header and footer joelstein How-to Forum 1 10-19-2005 07:55 AM
spidering external links websearch How-to Forum 1 01-11-2005 08:39 AM
not spidering all pages (too many links on page?) mirdin Troubleshooting 2 09-01-2004 06:08 AM
Anything to speed up spidering jinkas Mod Requests 0 08-25-2004 02:07 PM
crawling of only internal links? manute Troubleshooting 1 06-19-2004 05:38 AM


All times are GMT -8. The time now is 07:50 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.