PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 01-13-2005, 09:02 PM   #1
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
not correct link collecting

On my site links are like this:
/index.php?razdel=about&mach[2]=20

But spider gets only /index.php?razdel=about&mach

How to fix it?
zaartix is offline   Reply With Quote
Old 01-13-2005, 10:22 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
There are two regexs in robot_functions.php to edit:

- One
Code:
while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) {
- Two
Code:
while(eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9 ()~-]*))[#\'\" ]?)",$line,$regs)) {
You need to add the [ and ] characters to the following character classes.

- One
Code:
[:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]
- Two
Code:
[:%/?=&;\\,._a-zA-Z0-9 ()~-]
Note, though, that more things may look like links that are not links, like JavaScript and what not.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-13-2005, 11:39 PM   #3
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
THX, man!
- TWO
Code:
[:%/?=&;\\,._a-zA-Z0-9 ()~-]
doesn't working
PHP Code:


eregi
("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\[\] ()~-]*))[#\'\" ]?)",$line,$regs); 
or

PHP Code:


eregi
("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9[\] ()~-]*))[#\'\" ]?)",$line,$regs); 
or

PHP Code:


eregi
("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9[] ()~-]*))[#\'\" ]?)",$line,$regs); 

Last edited by zaartix; 01-13-2005 at 11:48 PM.
zaartix is offline   Reply With Quote
Old 01-14-2005, 01:08 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Not working as in it throws an error?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-14-2005, 02:47 AM   #5
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
no, spider gets only /index.php?razdel=about&mach without [] symbols
zaartix is offline   Reply With Quote
Old 01-14-2005, 12:44 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Okay, I see. The right bracket doesn't like being in a character class.

To get PhpDig to accept [ and ] in links, incorporate the following:
PHP Code:
$link "http://www.domain.com/dir/index.php?razdel=about&mach[2]=20";
$no_one "[:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*";
$no_two "[:%/?=&;\\,._a-zA-Z0-9 ()~-]*";

if (
eregi("($no_one\[?$no_one\]?$no_one)",$link,$regs)) {
    echo 
$regs[1];
}
if (
eregi("($no_two\[?$no_two\]?$no_two)",$link,$regs)) {
    echo 
$regs[1];
}

// both print http://www.domain.com/dir/index.php?razdel=about&mach[2]=20 
For example, you can probably replace both character classes with:
Code:
[:%/?=&;\\,._a-zA-Z0-9|+ ()~-]
And then assign a variable like so:
PHP Code:
$no_brackets "[:%/?=&;\\,._a-zA-Z0-9|+ ()~-]*"
And then use the following:
Code:
($no_brackets\[?$no_brackets\]?$no_brackets)
in place of:
Code:
([:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*)
and in place of:
Code:
([:%/?=&;\\,._a-zA-Z0-9 ()~-]*)
in the two regexs.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-14-2005, 10:01 PM   #7
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
thx man for excellent support
zaartix is offline   Reply With Quote
Old 01-18-2005, 08:12 PM   #8
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
it's working for links like
http://www.domain.com/dir/index.php?razdel=about&mach[2]=20

so what if links will be like
http://www.domain.com/dir/index.php?razdel=about&mach[2]=20&mach[2]=20&mach[2]=20&mach[2]=20

PHP Code:
 $link "http://www.domain.com/dir/index.php?razdel=about&mach[/url][2]=20&mach[2]=20&mach[2]=20&mach[2]=20";
$no_one "[:%/?=&;\\,._a-zA-Z0-9\|+ ()~-]*";
$no_two "[:%/?=&;\\,._a-zA-Z0-9 ()~-]*";

if (
eregi("($no_one\[?$no_one\]?$no_one)",$link,$regs)) {
    echo 
$regs[1];
}
if (
eregi("($no_two\[?$no_two\]?$no_two)",$link,$regs)) {
    echo 
$regs[1];
}

// both print [url]http://www.domain.com/dir/index.php?razdel=about&mach[/url][2]=20 
return only this:
http://www.domain.com/dir/index.php?razdel=about&mach[2]=20&mach

Last edited by zaartix; 01-18-2005 at 08:15 PM.
zaartix is offline   Reply With Quote
Old 01-18-2005, 09:27 PM   #9
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars)+
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-19-2005, 01:32 AM   #10
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
delete this post plz

Last edited by zaartix; 01-19-2005 at 02:16 AM.
zaartix is offline   Reply With Quote
Old 01-19-2005, 01:36 AM   #11
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
delete this post plz

Last edited by zaartix; 01-19-2005 at 02:16 AM.
zaartix is offline   Reply With Quote
Old 01-19-2005, 01:55 AM   #12
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
i'll make small example
zaartix is offline   Reply With Quote
Old 01-19-2005, 02:08 AM   #13
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
PHP Code:
<?
$line
[] = '<a href="http://www.domain.com/dir/index.php?razdel=about">test1</a>';
$line[] = '<a href="http://www.domain.com/dir/index.php?razdel=about&mach[2]=20">test2</a><table><tr><td></td></tr></table>';
$line[] = '<a href="http://www.domain.com/dir/index.php?razdel=about&mach[2]=20&mach[3]=01">test3</a><table><tr><td></td></tr></table>';
$line[] = '<table><tr><td></td></tr></table><a href="http://www.domain.com/dir/index.php?razdel=about&mach[2]=20&mach[3]=01&mach[4]=02">test4</a>';
$i=0;
$allowed_link_chars "[:%/?=&;\\,._a-zA-Z0-9|+~-]*";
while (
$line[$i]) {
if (
eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars))(#[.a-zA-Z0-9-]*)?[\'\" ]?)",$line[$i],$regs)) {
        echo 
$regs[2]." - example null<br>";
}
if (
eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars)?)(#[.a-zA-Z0-9-]*)?[\'\" ]?)",$line[$i],$regs)) {
        echo 
$regs[2]." - example ?<br>";
}
if (
eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars)*)(#[.a-zA-Z0-9-]*)?[\'\" ]?)",$line[$i],$regs)) {
        echo 
$regs[2]." - example *<br>";
}
if (
eregi("<a([^>]*href[[:blank:]]*=[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*(($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars)+))(#[.a-zA-Z0-9-]*)?[\'\" ])",$line[$i],$regs)) {
        echo 
$regs[2]." - example +<br>";
}
$i++;
}
?>
correct results only in '*' and '+' examples
zaartix is offline   Reply With Quote
Old 01-19-2005, 02:14 AM   #14
zaartix
Orange Mole
 
Join Date: May 2004
Location: russia, samara
Posts: 56
another trouble:

phpdig get links from this code:
PHP Code:
<script language='Javascript'>
function 
showDetail(code,type)
{
    if (
type=='all') {
        
width=600;
        
height=400;
    } else {
        
width=600;
        
height=200;
    }
    
window.open('/detail.php?mach[1]=ost&mach[2]='+code+'&mach[3]='+type,'_blank','scrollbars, resizable, width='+width+',height='+height+', left=200, top=200');
}
function 
cartAdd(code)
{
    
width=600;
    
height=350;
    
    
window.open('/detail.php?mach[1]=cart&mach[2]=add&mach[3]='+code,'_blank','scrollbars, resizable, width='+width+',height='+height+', left=200, top=200');
}
</script> 
and from this:
PHP Code:
<noindex>
<
a href="http://www.domain.com/dir/index.php?razdel=about">test1</a>
</
noindex

Last edited by zaartix; 01-19-2005 at 02:21 AM.
zaartix is offline   Reply With Quote
Old 01-19-2005, 03:02 AM   #15
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
  • So + works for your type of [ ] links, right? I'm not sure if you are still having a problem with [ ] type links, but remember to use + in those two regexs.
  • PhpDig tries to follow simple window.location and window.open JavaScript links, even if the links are like those you posted. There is no nice and simple way to deal with JavaScript, as people can script in different ways. If you don't want PhpDig to deal with JavaScript, then either remove the related window.whatever stuff from the regex, edit the $allowed_link_chars variable, or use the FORBIDDEN_EXTENSIONS constant to exclude links.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Excluding only one link arena75 How-to Forum 5 10-10-2004 01:46 PM
i need only the link, without the title Fking How-to Forum 1 10-05-2004 05:29 PM
Too many duplicate link, someone help please! warrence Troubleshooting 1 09-07-2004 04:26 PM
don't follow link Onno How-to Forum 1 03-05-2004 09:45 AM
Installation correct? DrKamikaze83 Script Installation 1 02-16-2004 05:56 AM


All times are GMT -8. The time now is 01:09 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.