PDA

View Full Version : phpdig seems to guess some urls and spider it


manute
04-26-2004, 12:10 PM
hi!

my urls look like this: domain.com/dir1/dir2/something
now phpdig spiders them fine, all right. but it also seems to "guess" new urls. i saw it spidering domain.com/dir1/dir2/ although that isn't linked anywhere.
why is that and how can i stop this?

manute
04-27-2004, 11:29 AM
doesn't anyone have an idea about that? that gives me a stupid lot of duplicate urls, that really sucks.
is there any way that i can tell phpdig to only spider what it gets with a link without "guessing" urls?

vinyl-junkie
04-27-2004, 05:35 PM
Have you verified that the URLs that you think phpDig is "guessing" really don't exist? If so, perhaps you could post the specific URL that you are trying to spider and an excerpt from your spider log of one or two of these bogus URLs.

There's no absolute guarantee that someone will have an answer for you, but posting a little more information might help.

Best wishes. :)

manute
04-28-2004, 03:59 AM
hi!

that's not what i wrote. these urls do exist, but they aren't linked anywhere. and yes, i'm sure about that. :D
here's an example:

http://www.fussball24.de/fussball/115/frauen -> original url linked on the site, spidered well, all right

http://www.fussball24.de/fussball/115 -> url guessed by phpdig, does exist but is exactly the same like the one above.

any ideas?

vinyl-junkie
04-28-2004, 04:28 AM
Do you have any rewrite rules in your .htaccess file that would translate the one URL into the other? Any kind of redirect from one to the other?

While it's true that the pages are identical, the URLs are not. phpDig does not compare pages to each other to see if they have the same content. It only looks for different URLs, makes sure there is no robots exclusion to obey, and indexes them.

manute
04-28-2004, 05:02 AM
no, there's no mod-rewrite, no redirections, but forcetype url-rewriting stuff.
and i just wonder where phpdig gets the url from! in my example the last one isn't linked anywhere, so it must have "guessed" it.
does the spider take urls like domain.com/dir1/dir2, cut off the last dir and spider domain.com/dir1?
it seems to me like that, but i don't like it. how can i stop it?

vinyl-junkie
04-28-2004, 05:03 PM
I'm not familiar with using forcetype, never heard of it until you mentioned it, so I did a little research to familiarize myself with that. It's possible there is something in the way you're doing that which is causing the problem, I don't know.

Someone else may have a different opinion, but I don't personally see how phpDig could be guessing this URL. What I would do is take a hard look at the way the code is written that references this page and see if there is something in it that would cause this URL to appear two different ways.

Also, and this is just a guess since I'm not familiar with the site, but I would try to analyze the spider log and see if I could trace just how you ended up with the same page twice in your index.

I wish I could be of more help. Perhaps someone else will come along with another idea that might solve your problem.

manute
04-29-2004, 01:49 AM
unfortunately i'm not a real php-pro that's why i'm rather not gonna start looking at phpdig's source code too much. ;)
still thank's for your efforts, pat and if anyone else has any ideas, give it to me! :D