PDA

View Full Version : Yet another indexing problem


sbhikes
04-24-2004, 09:52 PM
I have searched and searched and can't find an answer.

I have lots of pages where I use the same page, such as 'index.php' and add query strings to it.

It seems to get stuck at not being able to tell that a query string is different from another.

For example, I have over 200 items in a database, so my page will be something like 'index.php?id=1', index.php"id=2' etc.

But phpdig has been able to get only
'index.php?display_table'
'index.php?display_list'
'index.php?display_thumbnails'
'index.php?id=86'

I can't get it to be able to see that id=1, id=2, id=3... are different pages. It's like it can only tell the difference if the query strings have different letters, not different numbers.

What can I do?

Oh, and there are no 404 problems or redirects or any of the other things in all the other posts I've looked into. All the links to all the ?id=n pages are all listed on the first page.

sbhikes
04-25-2004, 04:03 PM
I tried some more things, but no matter what I cannot index any pages with similar query string beyond those ones with ?display_all, ?display_table, ?display_list and only one with a longer query string, whichever one it gets to first.

Odd thing is that if the page is a .shtml page and not a .php page I can index everything.

Why is that? Is there anything I can do about that?

vinyl-junkie
04-25-2004, 04:55 PM
One possible way around this, assuming you're on Linux, is to rewrite your URLs so that your dynamic content appears to be static.

I have a whole lot of dynamic content on my own website like this. For example, instead of displaying album content like this:www.napathon.net/TrackList.php?AlbumID=1530 I use my .htaccess file to rewrite this URL like so:www.napathon.net/AlbumID1530.php The rewrite code in my .htaccess file looks like this:
RewriteEngine On
RewriteRule ^AlbumID([0-9]+).php TrackList.php?AlbumID=$1 [L]
I hasten to add that I'm not the world's foremost expert on writing regular expressions, which is what this seemingly gibberish is, so I might not necessarily be able to help you write something for your application. However, perhaps someone else can help with that if you're interested in pursuing it as a solution to your problem.

drywall
05-17-2004, 11:06 AM
I'm coming across the same problem as sbhikes -- it grabs page.php?id=1 but doesn't grab 2, 3, etc...

Vinyl-junkie's workaround sounds like it should work, but I'd prefer not to have to go in and find every GET reference like that and rewrite it into the phpdig-friendly version (only to have it get rewritten back again with Apache behind the scenes).

Seems like this is a genuine bug in phpdig's spidering process (that happens to have an Apache workaround). I don't suppose some kind soul familiar enough with the phpdig spidering code could to try to fix this for real?

drywall
05-17-2004, 11:20 AM
I'd like to expand on this problem a little bit, in case anyone feels like tackling it. I'm indexing a reasonably complicated site and I've noticed that in some cases it's managing to index dynamic pages with different numbers in their GET string, but not others.

I'm not sure about this, but it appears to only be able to grab one per page. For example, on http://www.freepress.net/news/releases.php, it will only spider the first release on the list (ID 17). However, it appear to be spidering several news article pages (which have urls of the form news/article.php?id=XXXX), because it's finding them via separate pages, rather than on a single page as with the press releases.

Or maybe it's dying simply because it stops looking at the releases once it hits the word doc? Not sure... but it's fishy, and frustrating.