indexing dynamic pages [Archive]

View Full Version : indexing dynamic pages

attriel

01-14-2005, 11:23 AM

So, I tossed phpdig onto my dev server, figure I'll see how it goes before worrying about how to hack it onto the deployment servers and their frankensteinien convolutions.

But I immediately run into a problem. I tell it to start indexing at:

http://site.name.here/

And it starts, and it finds the links on that page, (there are 28, I believe) all of the form:
http://site.name.here/view.php?id=13672

Unfortunately, it appears that I'm tossing the # out, and just going to
http://site.name.here/view.php?id=

Since there are roughly 15000 various IDs involved in different sections, indexing 3 pages is suboptimal :/ (index, and 2 variations on the url)

I thought it might be the PHPSESSID, but I flipped that off in the config and it continues stripping, so ... What variable do I want to tune to make it retain those #'s. B/c they're mildly important :o

All links are relative, but I don't imagine that should matter

Thanks

--attriel

(I can't give the link , since it's still a development server and not publicly available anywhere)

attriel

01-14-2005, 01:21 PM

OK, I just spent a while tracing through the code (gotta love print statements). as near as I can tell, this is due to an error in the transfer-encoding : chunked handling.

2d
"><div id="leftimg"><a href="view_rec.php?id=
4
6315
c
"><img src="

This is, (from http://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.4.6) supposed to be handled as:

0x2d (45) bytes of stuff in next chunk, followed by <crlf>
"><div id="leftimg"><a herf="view_rec.php?id=<crlf> is 0x2d, check, add it
0x4 (4) bytes in next chunk
6315<crlf> is 4 bytes, check! add it
0xC (12) bytes in next chunk
"><img src=" is 12 bytes, check! add it.

But what the code seems to be doing (in phpdigGetUrl) is:
2d ; chunk seperator, trim previous of <crlf>
"><div id="leftimg"><a href="view_rec.php?id= add it
4 chunk seperator, trim
6315 chunk seperator, trim
c chunk seperator, trim
"><img src=" add it

Gonna work on fixing up that code some over the weekend, I'll post up a patch for someone to double check, probably monday (unless I decide to sleep finally :o)

--attriel

Charter

01-14-2005, 07:06 PM

The addition of a little counting might be faster than reading and processing the chunks. Try the attached code, for use with v.1.8.6, in place of the phpdigGetUrl function, and let me know how it works.