PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   not correct link collecting (http://www.phpdig.net/forum/showthread.php?t=1736)

zaartix 01-13-2005 09:02 PM

not correct link collecting
 
On my site links are like this:
/index.php?razdel=about&mach[2]=20

But spider gets only /index.php?razdel=about&mach

How to fix it?

Charter 01-13-2005 10:22 PM

There are two regexs in robot_functions.php to edit:

- One
Code:

while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) {
- Two
Code:

while(eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9 ()~-]*))[#\'\" ]?)",$line,$regs)) {
You need to add the [ and ] characters to the following character classes.

- One
Code:

[:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]
- Two
Code:

[:%/?=&;\\,._a-zA-Z0-9 ()~-]
Note, though, that more things may look like links that are not links, like JavaScript and what not.

zaartix 01-13-2005 11:39 PM

THX, man!
- TWO
Code:

[:%/?=&;\\,._a-zA-Z0-9 ()~-]
doesn't working :(
PHP Code:



eregi
("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\[\] ()~-]*))[#\'\" ]?)",$line,$regs); 

or

PHP Code:



eregi
("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9[\] ()~-]*))[#\'\" ]?)",$line,$regs); 

or

PHP Code:



eregi
("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9[] ()~-]*))[#\'\" ]?)",$line,$regs); 


Charter 01-14-2005 01:08 AM

Not working as in it throws an error?

zaartix 01-14-2005 02:47 AM

no, spider gets only /index.php?razdel=about&mach without [] symbols

Charter 01-14-2005 12:44 PM

Okay, I see. The right bracket doesn't like being in a character class.

To get PhpDig to accept [ and ] in links, incorporate the following:
PHP Code:

$link "http://www.domain.com/dir/index.php?razdel=about&mach[2]=20";
$no_one "[:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*";
$no_two "[:%/?=&;\\,._a-zA-Z0-9 ()~-]*";

if (
eregi("($no_one\[?$no_one\]?$no_one)",$link,$regs)) {
    echo 
$regs[1];
}
if (
eregi("($no_two\[?$no_two\]?$no_two)",$link,$regs)) {
    echo 
$regs[1];
}

// both print http://www.domain.com/dir/index.php?razdel=about&mach[2]=20 

For example, you can probably replace both character classes with:
Code:

[:%/?=&;\\,._a-zA-Z0-9|+ ()~-]
And then assign a variable like so:
PHP Code:

$no_brackets "[:%/?=&;\\,._a-zA-Z0-9|+ ()~-]*"

And then use the following:
Code:

($no_brackets\[?$no_brackets\]?$no_brackets)
in place of:
Code:

([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*)
and in place of:
Code:

([:%/?=&;\\,._a-zA-Z0-9 ()~-]*)
in the two regexs.

zaartix 01-14-2005 10:01 PM

thx man for excellent support

zaartix 01-18-2005 08:12 PM

it's working for links like
http://www.domain.com/dir/index.php?razdel=about&mach[2]=20

so what if links will be like
http://www.domain.com/dir/index.php?razdel=about&mach[2]=20&mach[2]=20&mach[2]=20&mach[2]=20

PHP Code:

 $link "http://www.domain.com/dir/index.php?razdel=about&mach[/url][2]=20&mach[2]=20&mach[2]=20&mach[2]=20";
$no_one "[:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*";
$no_two "[:%/?=&;\\,._a-zA-Z0-9 ()~-]*";

if (
eregi("($no_one\[?$no_one\]?$no_one)",$link,$regs)) {
    echo 
$regs[1];
}
if (
eregi("($no_two\[?$no_two\]?$no_two)",$link,$regs)) {
    echo 
$regs[1];
}

// both print [url]http://www.domain.com/dir/index.php?razdel=about&mach[/url][2]=20 

return only this:
http://www.domain.com/dir/index.php?razdel=about&mach[2]=20&mach

Charter 01-18-2005 09:27 PM

($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars)+

zaartix 01-19-2005 01:32 AM

delete this post plz

zaartix 01-19-2005 01:36 AM

delete this post plz

zaartix 01-19-2005 01:55 AM

i'll make small example

zaartix 01-19-2005 02:08 AM

PHP Code:

<?
$line
[] = '<a href="http://www.domain.com/dir/index.php?razdel=about">test1</a>';
$line[] = '<a href="http://www.domain.com/dir/index.php?razdel=about&mach[2]=20">test2</a><table><tr><td></td></tr></table>';
$line[] = '<a href="http://www.domain.com/dir/index.php?razdel=about&mach[2]=20&mach[3]=01">test3</a><table><tr><td></td></tr></table>';
$line[] = '<table><tr><td></td></tr></table><a href="http://www.domain.com/dir/index.php?razdel=about&mach[2]=20&mach[3]=01&mach[4]=02">test4</a>';
$i=0;
$allowed_link_chars "[:%/?=&;\\,._a-zA-Z0-9|+~-]*";
while (
$line[$i]) {
if (
eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars))(#[.a-zA-Z0-9-]*)?[\'\" ]?)",$line[$i],$regs)) {
        echo 
$regs[2]." - example null<br>";
}
if (
eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars)?)(#[.a-zA-Z0-9-]*)?[\'\" ]?)",$line[$i],$regs)) {
        echo 
$regs[2]." - example ?<br>";
}
if (
eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars)*)(#[.a-zA-Z0-9-]*)?[\'\" ]?)",$line[$i],$regs)) {
        echo 
$regs[2]." - example *<br>";
}
if (
eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*(($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars)+))(#[.a-zA-Z0-9-]*)?[\'\" ])",$line[$i],$regs)) {
        echo 
$regs[2]." - example +<br>";
}
$i++;
}
?>

correct results only in '*' and '+' examples

zaartix 01-19-2005 02:14 AM

another trouble:

phpdig get links from this code:
PHP Code:

<script language='Javascript'>
function 
showDetail(code,type)
{
    if (
type=='all') {
        
width=600;
        
height=400;
    } else {
        
width=600;
        
height=200;
    }
    
window.open('/detail.php?mach[1]=ost&mach[2]='+code+'&mach[3]='+type,'_blank','scrollbars, resizable, width='+width+',height='+height+', left=200, top=200');
}
function 
cartAdd(code)
{
    
width=600;
    
height=350;
    
    
window.open('/detail.php?mach[1]=cart&mach[2]=add&mach[3]='+code,'_blank','scrollbars, resizable, width='+width+',height='+height+', left=200, top=200');
}
</script> 

and from this:
PHP Code:

<noindex>
<
a href="http://www.domain.com/dir/index.php?razdel=about">test1</a>
</
noindex


Charter 01-19-2005 03:02 AM

  • So + works for your type of [ ] links, right? I'm not sure if you are still having a problem with [ ] type links, but remember to use + in those two regexs.
  • PhpDig tries to follow simple window.location and window.open JavaScript links, even if the links are like those you posted. There is no nice and simple way to deal with JavaScript, as people can script in different ways. If you don't want PhpDig to deal with JavaScript, then either remove the related window.whatever stuff from the regex, edit the $allowed_link_chars variable, or use the FORBIDDEN_EXTENSIONS constant to exclude links.


All times are GMT -8. The time now is 01:59 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.