PhpDig.net - Speed up spidering by skipping internal page links

PhpDig.net (http://www.phpdig.net/forum/index.php)

- Mod Requests (http://www.phpdig.net/forum/forumdisplay.php?f=23)

- - Speed up spidering by skipping internal page links (http://www.phpdig.net/forum/showthread.php?t=771)

peterpeter

04-03-2004 12:24 AM

Speed up spidering by skipping internal page links

As I explained in this thread, spidering can be very slow due to the existence of (many) internal page links, such as the <A HREF="#1070721880">xxx</A> and <A NAME="1070721880"></A> pair. Since such links don't serve any purpose for the spider functionality, I suggest to skip spidering these links.

Peter

Charter

04-10-2004 03:42 PM

Hi. Perhaps try removing the # symbol from the two pieces of code shown in this post.

peterpeter

04-14-2004 01:37 AM

Charter,

Thanks for the reply, but unfortunately it didn't help. Any other ideas?

Charter

04-14-2004 03:36 AM

Hi. Did you do a fresh index or a reindex?

peterpeter

04-15-2004 01:54 AM

Both, with the same result.

Charter

04-15-2004 02:45 PM

Hi. Keeping that removed # out, now try adding [^#]+

In while of phpdigExplore: ([^#]+(([[a-z]{3,5}://

In while of phpdigIndexFile: ([^#]+((http://

The <A NAME="1070721880"></A> shouldn't be matched regardless.

peterpeter

04-18-2004 07:17 AM

Charter,

It still doesn't do the job. Main reason is the size of the files I spider and the amount of HTML tags they contain. And since I have a genealogy site, adding newly found ancestors and their descendants will only lead to increasing file sizes. For the time being I have chosen to generate my genealogy files in two flavours. One with complete functionality and the other without any HTML tag. I use the latter for spidering and afterwards replace it with the correct one.

I believe a final solution might be to spider locally and then copy the local database to the database on the remote server. But this will need some investigation :mad: .

For now, thanks for your help.

Peter

All times are GMT -8. The time now is 02:24 AM.