PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 10-18-2003, 06:01 AM   #1
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
exclude doesn't really work?

hi!

i just found some text that shouldn't be indexed on my site.
i put it into <!-- phpdigExclude --> and <!-- phpdigInclude --> and then reindexed the page.
but now it still finds that page, although the text should have been excluded. does anyone know that problem?
manute is offline   Reply With Quote
Old 10-18-2003, 07:40 PM   #2
Rolandks
Purple Mole
 
Rolandks's Avatar
 
Join Date: Sep 2003
Location: Kassel, Germany
Posts: 119
Try this:
file robot_functions.php, add the instruction "continue;" at the line #777

Read this Thread:

PHP Code:
...
    ...
    else if (
trim($line) == PHPDIG_INCLUDE_COMMENT) {
        
$exclude false;
        continue;
    }
    ...
    ... 
-Roland-
Rolandks is offline   Reply With Quote
Old 10-19-2003, 06:30 AM   #3
GeminiHB
Green Mole
 
Join Date: Oct 2003
Posts: 2
I have the same problem:
Excluded content has been indexed.

The "continue"-bug is fixed. I deleted and reistalled the whole phpdig database. I excluded this example to test the exclude function:

PHP Code:
<!-- phpdigExclude -->langestestwort<!-- phpdigInclude --> 
But the string "langestestwort" is in the keyword table after the next reindexing process.

Maybe someone can help me,

thanks,

Holger
GeminiHB is offline   Reply With Quote
Old 10-19-2003, 06:44 AM   #4
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
same problem with me, the "continue"-thing doesn't seem to fix it.
manute is offline   Reply With Quote
Old 10-19-2003, 07:28 AM   #5
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
PHP Code:
foreach ($file_content as $num => $line) {
    if (
trim($line)) {
        if (
$content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {
            
$exclude true;
        }
        else if (
trim($line) == PHPDIG_INCLUDE_COMMENT) {
            
$exclude false;
            continue;
        }
        
// and so forth 
Hi. The above code in robot_functions.php looks at each line in the file for the PhpDig exclude and include comments. Perhaps try the following.

Instead of having the PhpDig exclude and include comments on one line like so:
Code:
<!-- phpdigExclude -->some stuff<!-- phpdigInclude -->
try putting the PhpDig exclude and include comments on their own separate lines like so:
Code:
<!-- phpdigExclude -->
some stuff
<!-- phpdigInclude -->
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-19-2003, 07:36 AM   #6
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
they are not in one line on my site. that can't be the reason.
manute is offline   Reply With Quote
Old 10-19-2003, 08:51 AM   #7
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Can you check to make sure that your phpdigTestUrl function in robot_functions.php is the same as the one that comes with PhpDig version 1.6.2?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-19-2003, 09:12 AM   #8
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
function phpdigTestUrl($url,$mode='simple',$cookies=array()) {

$components = parse_url($url);
$lm_date = '';
$status = 'NOFILE';
$auth_string = '';
$redirs = 0;
$stop = false;

if (isset($components['host'])) {
$host = $components["host"];
if (isset($components['user']) && isset($components['pass']) &&
$components['user'] && $components['pass']) {
$auth_string = 'Authorization: Basic '.base64_encode($components['user'].':'.$components['pass'])."\n";
}
}
else {
$host = '';
}

if (isset($components['port'])) {
$port = (int)$components["port"];
}
else {
$port = 80;
}

if (isset($components['path'])) {
$path = $components["path"];
}
else {
$path = '';
}

if (isset($components['query'])) {
$query = $components["query"];
}
else {
$query = '';
}

$fp = @fsockopen($host,$port);

if ($port != 80) {
$sport = ":".$port;
}
else {
$sport = "";
}

if (!$fp) {
//host domain not found
$status = "NOHOST";
}
else {
if ($query) {
$path .= "?".$query;
}

$cookiesSendString = phpDigMakeCookies($cookies,$path);

//complete get
$request =
"HEAD $path HTTP/1.1\n"
."Host: $host$sport\n"
.$cookiesSendString
.$auth_string
."Accept: */*\n"
."Accept-Charset: ".PHPDIG_ENCODING."\n"
."Accept-Encoding: identity\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n";

fputs($fp,$request);

//test return code
while (!$stop && !feof($fp)) {
$answer = fgets($fp,8192);

//print $answer."<br>\n";

if (isset($req1) && $req1) {
//close, and open a new connection
//on the new location
fclose($fp);
$fp = fsockopen($host,$port);
if (!$fp) {
//host domain not found
$status = "NOHOST";
break;
}
else {
fputs($fp,$req1);
unset($req1);
$answer = fgets($fp,8192);
}
}

if (ereg("HTTP/[0-9.]+ (([0-9])[0-9]{2})", $answer,$regs)) {
if ($regs[2] == 2 || $regs[2] == 3) {
$code = $regs[2];
}
elseif ($regs[1] >= 401 && $regs[1] <= 403) {
$status = "UNAUTH";
break;
}
else {
$status = "NOFILE";
break;
}
}
else if (eregi("^ *location: *(.*)",$answer,$regs) && $code == 3) {
if ($redirs > 4) {
$stop = true;
$status = "LOOP";
}
$newpath = trim($regs[1]);
$newurl = parse_url($newpath);
//search if relocation is absolute or relative
if (!isset($newurl["host"])
&& isset($newurl["path"])
&& !ereg('^/',$newurl["path"])) {
$path = dirname($path).'/'.$newurl["path"];
}
else {
$path = $newurl["path"];
}
if (!isset($newurl['host']) || !$newurl['host'] || $host == $newurl['host']) {

$cookiesSendString = phpDigMakeCookies($cookies,$path);
$req1 = "HEAD $path HTTP/1.1\n"
."Host: $host$sport\n"
.$cookiesSendString
.$auth_string
."Accept: */*\n"
."Accept-Charset: ".PHPDIG_ENCODING."\n"
."Accept-Encoding: identity\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n";
}
else {
$stop = true;
$status = "NEWHOST";
$host = $newurl['host'];
}
}
//parse cookies
elseif (eregi("Set-Cookie: *(([^=]+)=[^; ]+) *(; *path=([^; ]+))* *(; *domain=([^; ]+))*",$answer,$regs)) {
$cookies[$regs[2]] = array('string'=>$regs[1],'path'=>$regs[4],'domain'=>$regs[6]);
}
//Parse content-type header
elseif (eregi("Content-Type: *([a-z]+)/([a-z.-]+)",$answer,$regs)) {
if ($regs[1] == "text") {
switch ($regs[2]) {
case 'plain':
$status = 'PLAINTEXT';
break;
case 'html':
$status = 'HTML';
break;
default :
$status = "NOFILE";
$stop = true;
}
}
else if ($regs[1] == "application") {
if ($regs[2] == 'msword' && PHPDIG_INDEX_MSWORD == true) {
$status = "MSWORD";
}
else if ($regs[2] == 'pdf' && PHPDIG_INDEX_PDF == true) {
$status = "PDF";
}
else if ($regs[2] == 'vnd.ms-excel' && PHPDIG_INDEX_MSEXCEL == true) {
$status = "MSEXCEL";
}
else {
$status = "NOFILE";
$stop = true;
}
}
else {
$status = "NOFILE";
$stop = true;
}
}
elseif (eregi('Last-Modified: *([a-z0-9,: ]+)',$answer,$regs)) {
//search last-modified header
$lm_date = $regs[1];
}

if (!eregi('[a-z0-9]+',$answer)) {
$stop = true;
}

}
@fclose($fp);
}

//returns variable or array
if ($mode == 'date') {
return compact('status', 'lm_date', 'path', 'host', 'cookies');
}
else {
return $status;
}
}


that's it. and i haven't changed anything on it, so i guess it should be the correct one.
manute is offline   Reply With Quote
Old 10-19-2003, 11:58 AM   #9
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Yep, that looks correct. Let's try echoing out some stuff.

In robot_functions.php, right before:
PHP Code:
foreach ($file_content as $num => $line) { 
put the following:
PHP Code:
// start echo stuff
$ijk_cnt 0;
foreach (
$file_content as $num => $line) {
    if (
trim($line)) {
        if (
$content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {
            echo 
"Content type is HTML and PhpDig exclude comment found.<br>";
        }
        elseif (
trim($line) == PHPDIG_INCLUDE_COMMENT) {
            echo 
"PhpDig include comment found.<br>";
        }
        else {
            if (
$ijk_cnt == 0) {            
                echo 
"Content type is: " $content_type ".<br>";
                
$ijk_cnt++;
            }
        }
    }
    else {
        echo 
"Trim line is false.<br><br>";
    }
}
exit();
// end echo stuff 
Now try to crawl a demo page with PhpDig exclude and include comments. What are your results for this?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-19-2003, 02:41 PM   #10
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
okay. that's what i got:

Content type is: HTML.
Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

PhpDig include comment found.
Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.


does that help you anything?
manute is offline   Reply With Quote
Old 10-19-2003, 02:57 PM   #11
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
The results of the echo test tell me that the PhpDig include comment is being found, but that the PhpDig exclude comment before that is not being found.

Assuming that the PhpDig exclude comment is on one line by itself, maybe there is a typo in the config file. Can you check what PHPDIG_EXCLUDE_COMMENT is set to in the config file? Does this match what is being used in the files that you want to crawl?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-19-2003, 03:27 PM   #12
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
yeah i checked that already. there's no mistake in this.
and the other funny thing is: if it would really be this way, that it only finds the exclude but afterwards not the include, then why has it indexed anything at all? strange.
manute is offline   Reply With Quote
Old 10-19-2003, 06:41 PM   #13
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
This is strange. It finds the include, so things should get indexed, but it doesn't find the exclude. With exclude before include, and each on their own line, it seems that:
PHP Code:
if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) { 
is coming out false, even though the content type is html and there is no typo. How about change:
PHP Code:
if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) { 
to the following:
PHP Code:
if (trim($line) == PHPDIG_EXCLUDE_COMMENT) { 
and try to crawl a demo page again.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 10-19-2003, 07:02 PM   #14
manute
Orange Mole
 
manute's Avatar
 
Join Date: Oct 2003
Location: hamburg, germany
Posts: 52
same thing again. it only finds the include. i'd mistaken that in my last post, so it's actually not that strange, because it does only find the include and that's already a good explanation for why it doesn't exclude what i want to get excluded.
and yes, the include- and exclude-comments are not in the same line.

now i tried another thing. i made a file test.php with nothing but

<!-- phpdigExclude -->
word
<!-- phpdigInclude -->

in it. and now the result of the spider:

Content type is HTML and PhpDig exclude comment found.
Content type is: HTML.
PhpDig include comment found.

now that seems to be the way it should be - right?
the one i tried before was quite a big page. and something in that must have caused the error. now that could become quite a needle in a haystack...
any ideas? wanna see the html?
manute is offline   Reply With Quote
Old 10-19-2003, 07:40 PM   #15
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Yep, that test.php file is the way it should be, so maybe it is something with the big page like you say. You are right that the include and exclude should not be on the same line.

Also, the include and exclude need to be on lines by themselves. Maybe try editing the big page in a text only editor to make sure that the include and exclude comments are on lines by themselves and no soft wrapping is going on there.

If you want, just post a snippet of the html around the exclude comment, like +/- 10 or so lines.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New Exclude Option josegringo How-to Forum 2 02-17-2005 03:48 PM
Can't exclude few pages mleray Troubleshooting 2 11-19-2004 01:25 AM
exclude metatags tomas How-to Forum 5 08-15-2004 04:22 PM
Exclude list? antun How-to Forum 5 03-10-2004 12:38 PM
exclude after spidering baskamer Troubleshooting 2 03-01-2004 03:17 AM


All times are GMT -8. The time now is 01:42 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.