|
07-29-2004, 06:49 PM | #1 |
Green Mole
Join Date: Jul 2004
Posts: 8
|
PhpDig "clipping" links while spidering
Ok, I consider myself to be moderately skilled at PHP, but this is something that I just don't understand. As PhpDig spiders my site, it looks for links that are clipped versions of links that are all ready there. (This additional processing really slows the script down.) I have attached the results from the most recent spidering so that you can all see and maybe help. Unfortunately, this is still a test site and for security reasons it is only open to employees of the place where I work until we solve some authorization issues (in other words, you can't go see the code to see why everything is happening); however, I can assure you that the links that PhpDig is trying to follow show up neither in the source code nor in the generate HTML code (the entire site is dynamic).
Anyway, on with the problem... In the txt file (and all references are to the txt file), the first error of this kind shows up in the first two 404 errors after spidered page #3. http://uuu.cae.wisc.edu/si does not exist, but is a part of uuu.cae.wisc.edu/site, which is the entryway into the rest of the site. Similar errors appear in the last two 404 errors of spidered page #3 (should be /wikiutils/), the first two 404s of page #5 (again, should be /site/), the first two 404s of page #7 (should be /site/public/), the 404 of page #11 (should be .php), the 404s in the middle of page 15 (should be /help/h****uts/, not /help/han), and in many, many other places. In fact, in the final results over 50+ clipped links were "found." (it is a Wiki-based system, and all pages that don't exist give you a dynamically generate error page offering to help you create as a new page the page you have requested). I know that I've been a little verbose, but the final site will contain 8000+ pages and I would like to be able to squash this error. I just can't figure it out! Could someone please help me? Thank you so much! -jinkas P.S. - I cut a chunk out of the middle of the file to make it the right size for uploading. You can see at the end that the clipping seems to happen with a greater and greater frequency (every 404 from at least page 201 to 297 is caused by this link clipping) P.P.S. - It doesn't seem like the link clipping causes PhpDig to skip real links; all the real links seem to be spidered. It just makes it go much slower. Last edited by jinkas; 07-29-2004 at 06:52 PM. |
07-29-2004, 06:53 PM | #2 |
Green Mole
Join Date: Jul 2004
Posts: 8
|
Attachment
Here's the attachment...I accidentally deleted it from the original post
(Well, ok, I deleted it on purpose, not knowing that I couldn't reupload it ) |
07-30-2004, 11:12 AM | #3 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Check this thread, although a bited dated, and edit the code indicated so that PhpDig finds the links that match your regular expression. Perhaps take out the space and parentheses as those tend to form links from JavaScript, even though they aren't real links.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
07-30-2004, 12:07 PM | #4 |
Green Mole
Join Date: Jul 2004
Posts: 8
|
Thanks. Sorry I was so verbose. I'll give this a try at work on Monday and let you know how it works out.
-jinkas |
08-02-2004, 02:53 AM | #5 |
Green Mole
Join Date: Jul 2004
Posts: 8
|
Ok, I was never very good with regular expressions....I've tried adding my link format, but just can't seem to get it. Could I get a hand?
Links on my site are of the form: http://host /site/section /index.php?title=filename -Host is uuu.cae.wisc.edu for now, but will shortly be changing to www.cae.wisc.edu -Section can be any number of things, right now the only options are "public" and "admin" (will grow to an indefinite number) -Filename can be any number of things, right now there are ~100 test files on the site (will grow to 8000+) Thanks for you help, guys! I really appreciate this! -jinkas P.S. - I don't even know if changing those eregi's will work, since PhpDig isn't skipping over any links. It finds all the links on a page, but also finds from 2-6 (approx.) non-existant links due to the fact that it clips any number of chars off of the end of some links. |
08-02-2004, 03:04 AM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Personally, I don't think the latest 1.8.3 version of PhpDig is 'clipping' links. PhpDig can try to follow non-existent links, but this is likely due to JavaScript. Some people want spaces and parentheses allowed in their links, so JavaScript then can come into play. An earlier 1.8.3 version didn't quite deal with chunk encoding so links in this rendition did get messed. Perhaps, the issue you are experiencing has something to due with this earlier version of 1.8.3, but to be sure, take a read through this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
Thread Tools | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
shows blank page if "Search All" and "exact phrase", timeout? | alokjain9 | Troubleshooting | 2 | 03-07-2006 07:08 AM |
"search depth" and "links per" features | laurentxav | How-to Forum | 1 | 01-12-2005 07:27 PM |
relative links without URI but only "?bla=1" | blueyed | Bug Tracker | 3 | 12-06-2004 02:23 AM |
Problem with indexing "links found : 0" | IAMHHawaii | Troubleshooting | 1 | 09-20-2004 12:06 PM |
Spidering with "links found : 0" | fransdars | Troubleshooting | 4 | 02-02-2004 12:03 AM |