![]() |
Speed up spidering by skipping internal page links
As I explained in this thread, spidering can be very slow due to the existence of (many) internal page links, such as the <A HREF="#1070721880">xxx</A> and <A NAME="1070721880"></A> pair. Since such links don't serve any purpose for the spider functionality, I suggest to skip spidering these links.
Peter |
Hi. Perhaps try removing the # symbol from the two pieces of code shown in this post.
|
Charter,
Thanks for the reply, but unfortunately it didn't help. Any other ideas? |
Hi. Did you do a fresh index or a reindex?
|
Both, with the same result.
|
Hi. Keeping that removed # out, now try adding [^#]+
In while of phpdigExplore: ([^#]+(([[a-z]{3,5}:// In while of phpdigIndexFile: ([^#]+((http:// The <A NAME="1070721880"></A> shouldn't be matched regardless. |
Charter,
It still doesn't do the job. Main reason is the size of the files I spider and the amount of HTML tags they contain. And since I have a genealogy site, adding newly found ancestors and their descendants will only lead to increasing file sizes. For the time being I have chosen to generate my genealogy files in two flavours. One with complete functionality and the other without any HTML tag. I use the latter for spidering and afterwards replace it with the correct one. I believe a final solution might be to spider locally and then copy the local database to the database on the remote server. But this will need some investigation :mad: . For now, thanks for your help. Peter |
All times are GMT -8. The time now is 02:24 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.