PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   not correct link collecting (http://www.phpdig.net/forum/showthread.php?t=1736)

zaartix 01-19-2005 03:08 AM

Quote:

So + works for your type of [ ] links, right? I'm not sure if you are still having a problem with [ ] type links, but remember to use + in those two regexs.
i think that in 1.8.7 of phpdig all should be work?

working only if link contain only one pair of [] :(

zaartix 01-19-2005 03:12 AM

first regexp doesn't needed becourse site have'nt frames

Charter 01-19-2005 04:09 AM

>> working only if link contain only one pair of []

So it works in example but not with PhpDig? What's a link to a page containing multiple [ ] in its links?

>> first regexp doesn't needed becourse site have'nt frames

Other people might have frames though. ;)

The RFC2732 protocol states in part:
Quote:

Code:

  (3) Add "[" and "]" to the set of 'reserved' characters:

      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | "," | "[" | "]"

  and remove them from the 'unwise' set:

      unwise      = "{" | "}" | "|" | "\" | "^" | "`"


Sometimes using reserved characters in links, other than for their intended purpose, can cause problems as was the case in this thread (colon used outside of <user>:<pass>@<host>:<port> meaning so the PHP parse_url function did not understand).

You might want to consider encoding your URIs according to this rather than use literal square brackets in your links.

zaartix 01-19-2005 07:24 PM

>>So it works in example but not with PhpDig? What's a link to a page containing multiple [ ] in its links?
Yep.
Just try to dig this page:
http://zaartix.ru/krit

Sorry for russian on that page

Charter 01-19-2005 07:40 PM

That page contains tons of links to 404 pages.

zaartix 01-19-2005 09:01 PM

they are all to 404 :)
so phpdig extract not all links from main page

zaartix 01-19-2005 09:08 PM

i'am not upload other pages, only one page.
for what other pages? if phpdig find all links which are on that page and all links are correct, then extractng regexp working right. Is it so?

Charter 01-20-2005 03:07 AM

PhpDig tests links, and if PhpDig gets a 404 from a link, then PhpDig does not index that link. The + works in example, so maybe try setting up an online demo with a few links.

zaartix 01-20-2005 03:23 AM

so, phpdig, when it parsing page, trying to open each of link? on first step? i think, that phpdig extracting all links and paste it in tempspider table. at next step phpdig try to open each of links.
I'am wrong?

Charter 01-20-2005 03:53 AM

Nope, that is not how it works. PhpDig does not insert server response 404s in the tempspider table. With all the links currently returning 404s, the only thing inserted into the tempspider table is the zaartix.ru/krit/ page.

zaartix 01-20-2005 08:51 PM

at now you can try to dig http://zaartix.ru/krit
plz, help to solve this problem

Charter 01-20-2005 09:31 PM

There are no regular links with more than one set of [ ] square brackets in them. :confused:

zaartix 01-21-2005 03:05 AM

There are many levels of pages. Just try to dig all aviable pages, mane different types of links :)
http://zaartix.ru/krit

Charter 01-21-2005 04:33 AM

Here's a one-page test...

Spider:

http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news&mach[3]=79.htm

Results:

Spidering in progress... [Stop spider]
SITE : http://zaartix.ru/
Exclude paths :
- @NONE@
1:http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news&mach[3]=79.htm
(time : 00:00:09)
No link in temporary table
links found : 1
http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news&mach[3]=79.htm
Optimizing tables...
Indexing complete ! [Back] to admin interface.

Charter 01-21-2005 04:55 AM

Here's a multi-page test...

Spider:

http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news.htm

Results:

Spidering in progress... [Stop spider]
SITE : http://zaartix.ru/
Exclude paths :
- @NONE@
1:http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news.htm
(time : 00:00:10)
+ + + + + + + + + + + + + + + + + + + + + +
level 1...
2:http://zaartix.ru/krit/index.php-razdel=price&mach[2]=23.htm
(time : 00:00:34)

3:http://zaartix.ru/krit/index.php-razdel=about&mach[2]=24.htm
(time : 00:00:46)

4:http://zaartix.ru/krit/index.php-razdel=price.htm
(time : 00:01:04)

5:http://zaartix.ru/krit/index.php-razdel=quality&mach[2]=34.htm
(time : 00:01:13)

6:http://zaartix.ru/krit/index.php-razdel=contact&mach[2]=19.htm
(time : 00:01:23)

Duplicate of an existing document
7:http://zaartix.ru/krit/index.php-razdel=price&mach[2]=view.htm
(time : 00:01:40)

8:http://zaartix.ru/krit/index.php-razdel=about&mach[2]=22.htm
(time : 00:01:50)

9:http://zaartix.ru/krit/index.php-razdel=about&mach[2]=21.htm
(time : 00:01:59)

10:http://zaartix.ru/krit/index.htm
(time : 00:02:08)

11:http://zaartix.ru/krit/index.php-razdel=about&mach[2]=20.htm
(time : 00:02:17)

12:http://zaartix.ru/krit/index.php-razdel=price&mach[2]=ost.htm
(time : 00:02:25)

13:http://zaartix.ru/krit/index.php-razdel=price&mach[2]=tech.htm
(time : 00:02:34)

14:http://zaartix.ru/krit/index.php-razdel=price&mach[2]=sert.htm
(time : 00:02:43)

15:http://zaartix.ru/krit/index.php-razdel=quality&mach[2]=27.htm
(time : 00:02:51)

16:http://zaartix.ru/krit/index.php-razdel=quality&mach[2]=32.htm
(time : 00:03:00)

17:http://zaartix.ru/krit/index.php-razdel=quality&mach[2]=33.htm
(time : 00:03:09)

18:http://zaartix.ru/krit/index.php-razdel=contact&mach[2]=16.htm
(time : 00:03:17)

19:http://zaartix.ru/krit/index.php-razdel=contact&mach[2]=17.htm
(time : 00:03:26)

20:http://zaartix.ru/krit/index.php-razdel=contact&mach[2]=vacancies.htm
(time : 00:03:35)

21:http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news&mach[3]=79.htm
(time : 00:03:43)

22:http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news&mach[3]=78.htm
(time : 00:03:51)

23:http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news&mach[3]=2.htm
(time : 00:04:01)

No link in temporary table
links found : 23
http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news.htm
http://zaartix.ru/krit/index.php-razdel=price&mach[2]=23.htm
http://zaartix.ru/krit/index.php-razdel=about&mach[2]=24.htm
http://zaartix.ru/krit/index.php-razdel=price.htm
http://zaartix.ru/krit/index.php-razdel=quality&mach[2]=34.htm
http://zaartix.ru/krit/index.php-razdel=contact&mach[2]=19.htm
http://zaartix.ru/krit/index.php-razdel=price&mach[2]=view.htm
http://zaartix.ru/krit/index.php-razdel=about&mach[2]=22.htm
http://zaartix.ru/krit/index.php-razdel=about&mach[2]=21.htm
http://zaartix.ru/krit/index.htm
http://zaartix.ru/krit/index.php-razdel=about&mach[2]=20.htm
http://zaartix.ru/krit/index.php-razdel=price&mach[2]=ost.htm
http://zaartix.ru/krit/index.php-razdel=price&mach[2]=tech.htm
http://zaartix.ru/krit/index.php-razdel=price&mach[2]=sert.htm
http://zaartix.ru/krit/index.php-razdel=quality&mach[2]=27.htm
http://zaartix.ru/krit/index.php-razdel=quality&mach[2]=32.htm
http://zaartix.ru/krit/index.php-razdel=quality&mach[2]=33.htm
http://zaartix.ru/krit/index.php-razdel=contact&mach[2]=16.htm
http://zaartix.ru/krit/index.php-razdel=contact&mach[2]=17.htm
http://zaartix.ru/krit/index.php-razdel=contact&mach[2]=vacancies.htm
http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news&mach[3]=79.htm
http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news&mach[3]=78.htm
http://zaartix.ru/krit/index.php-razdel=about&mach[2]=news&mach[3]=2.htm
Optimizing tables...
Indexing complete ! [Back] to admin interface.


All times are GMT -8. The time now is 01:46 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.