PDA

View Full Version : Speed up spidering by skipping internal page links


peterpeter
04-03-2004, 12:24 AM
As I explained in this (http://www.phpdig.net/showthread.php?s=&threadid=702) thread, spidering can be very slow due to the existence of (many) internal page links, such as the <A HREF="#1070721880">xxx</A> and <A NAME="1070721880"></A> pair. Since such links don't serve any purpose for the spider functionality, I suggest to skip spidering these links.

Peter

Charter
04-10-2004, 03:42 PM
Hi. Perhaps try removing the # symbol from the two pieces of code shown in this (http://www.phpdig.net/showthread.php?postid=2141#post2141) post.

peterpeter
04-14-2004, 01:37 AM
Charter,

Thanks for the reply, but unfortunately it didn't help. Any other ideas?

Charter
04-14-2004, 03:36 AM
Hi. Did you do a fresh index or a reindex?

peterpeter
04-15-2004, 01:54 AM
Both, with the same result.

Charter
04-15-2004, 02:45 PM
Hi. Keeping that removed # out, now try adding [^#]+

In while of phpdigExplore: ([^#]+(([[a-z]{3,5}://

In while of phpdigIndexFile: ([^#]+((http://

The <A NAME="1070721880"></A> shouldn't be matched regardless.

peterpeter
04-18-2004, 07:17 AM
Charter,

It still doesn't do the job. Main reason is the size of the files I spider and the amount of HTML tags they contain. And since I have a genealogy site, adding newly found ancestors and their descendants will only lead to increasing file sizes. For the time being I have chosen to generate my genealogy files in two flavours. One with complete functionality and the other without any HTML tag. I use the latter for spidering and afterwards replace it with the correct one.

I believe a final solution might be to spider locally and then copy the local database to the database on the remote server. But this will need some investigation :mad: .

For now, thanks for your help.

Peter