PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   exclude doesn't really work? (http://www.phpdig.net/forum/showthread.php?t=157)

manute 10-18-2003 05:01 AM

exclude doesn't really work?
 
hi!

i just found some text that shouldn't be indexed on my site.
i put it into <!-- phpdigExclude --> and <!-- phpdigInclude --> and then reindexed the page.
but now it still finds that page, although the text should have been excluded. does anyone know that problem?

Rolandks 10-18-2003 06:40 PM

Try this:
file robot_functions.php, add the instruction "continue;" at the line #777

Read this Thread:

PHP Code:

...
    ...
    else if (
trim($line) == PHPDIG_INCLUDE_COMMENT) {
        
$exclude false;
        continue;
    }
    ...
    ... 

-Roland-

GeminiHB 10-19-2003 05:30 AM

I have the same problem:
Excluded content has been indexed.

The "continue"-bug is fixed. I deleted and reistalled the whole phpdig database. I excluded this example to test the exclude function:

PHP Code:

<!-- phpdigExclude -->langestestwort<!-- phpdigInclude --> 

But the string "langestestwort" is in the keyword table after the next reindexing process.

Maybe someone can help me,

thanks,

Holger

manute 10-19-2003 05:44 AM

same problem with me, the "continue"-thing doesn't seem to fix it.

Charter 10-19-2003 06:28 AM

PHP Code:

foreach ($file_content as $num => $line) {
    if (
trim($line)) {
        if (
$content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {
            
$exclude true;
        }
        else if (
trim($line) == PHPDIG_INCLUDE_COMMENT) {
            
$exclude false;
            continue;
        }
        
// and so forth 

Hi. The above code in robot_functions.php looks at each line in the file for the PhpDig exclude and include comments. Perhaps try the following.

Instead of having the PhpDig exclude and include comments on one line like so:
Code:

<!-- phpdigExclude -->some stuff<!-- phpdigInclude -->
try putting the PhpDig exclude and include comments on their own separate lines like so:
Code:

<!-- phpdigExclude -->
some stuff
<!-- phpdigInclude -->


manute 10-19-2003 06:36 AM

they are not in one line on my site. that can't be the reason.

Charter 10-19-2003 07:51 AM

Hi. Can you check to make sure that your phpdigTestUrl function in robot_functions.php is the same as the one that comes with PhpDig version 1.6.2?

manute 10-19-2003 08:12 AM

function phpdigTestUrl($url,$mode='simple',$cookies=array()) {

$components = parse_url($url);
$lm_date = '';
$status = 'NOFILE';
$auth_string = '';
$redirs = 0;
$stop = false;

if (isset($components['host'])) {
$host = $components["host"];
if (isset($components['user']) && isset($components['pass']) &&
$components['user'] && $components['pass']) {
$auth_string = 'Authorization: Basic '.base64_encode($components['user'].':'.$components['pass'])."\n";
}
}
else {
$host = '';
}

if (isset($components['port'])) {
$port = (int)$components["port"];
}
else {
$port = 80;
}

if (isset($components['path'])) {
$path = $components["path"];
}
else {
$path = '';
}

if (isset($components['query'])) {
$query = $components["query"];
}
else {
$query = '';
}

$fp = @fsockopen($host,$port);

if ($port != 80) {
$sport = ":".$port;
}
else {
$sport = "";
}

if (!$fp) {
//host domain not found
$status = "NOHOST";
}
else {
if ($query) {
$path .= "?".$query;
}

$cookiesSendString = phpDigMakeCookies($cookies,$path);

//complete get
$request =
"HEAD $path HTTP/1.1\n"
."Host: $host$sport\n"
.$cookiesSendString
.$auth_string
."Accept: */*\n"
."Accept-Charset: ".PHPDIG_ENCODING."\n"
."Accept-Encoding: identity\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n";

fputs($fp,$request);

//test return code
while (!$stop && !feof($fp)) {
$answer = fgets($fp,8192);

//print $answer."<br>\n";

if (isset($req1) && $req1) {
//close, and open a new connection
//on the new location
fclose($fp);
$fp = fsockopen($host,$port);
if (!$fp) {
//host domain not found
$status = "NOHOST";
break;
}
else {
fputs($fp,$req1);
unset($req1);
$answer = fgets($fp,8192);
}
}

if (ereg("HTTP/[0-9.]+ (([0-9])[0-9]{2})", $answer,$regs)) {
if ($regs[2] == 2 || $regs[2] == 3) {
$code = $regs[2];
}
elseif ($regs[1] >= 401 && $regs[1] <= 403) {
$status = "UNAUTH";
break;
}
else {
$status = "NOFILE";
break;
}
}
else if (eregi("^ *location: *(.*)",$answer,$regs) && $code == 3) {
if ($redirs > 4) {
$stop = true;
$status = "LOOP";
}
$newpath = trim($regs[1]);
$newurl = parse_url($newpath);
//search if relocation is absolute or relative
if (!isset($newurl["host"])
&& isset($newurl["path"])
&& !ereg('^/',$newurl["path"])) {
$path = dirname($path).'/'.$newurl["path"];
}
else {
$path = $newurl["path"];
}
if (!isset($newurl['host']) || !$newurl['host'] || $host == $newurl['host']) {

$cookiesSendString = phpDigMakeCookies($cookies,$path);
$req1 = "HEAD $path HTTP/1.1\n"
."Host: $host$sport\n"
.$cookiesSendString
.$auth_string
."Accept: */*\n"
."Accept-Charset: ".PHPDIG_ENCODING."\n"
."Accept-Encoding: identity\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n";
}
else {
$stop = true;
$status = "NEWHOST";
$host = $newurl['host'];
}
}
//parse cookies
elseif (eregi("Set-Cookie: *(([^=]+)=[^; ]+) *(; *path=([^; ]+))* *(; *domain=([^; ]+))*",$answer,$regs)) {
$cookies[$regs[2]] = array('string'=>$regs[1],'path'=>$regs[4],'domain'=>$regs[6]);
}
//Parse content-type header
elseif (eregi("Content-Type: *([a-z]+)/([a-z.-]+)",$answer,$regs)) {
if ($regs[1] == "text") {
switch ($regs[2]) {
case 'plain':
$status = 'PLAINTEXT';
break;
case 'html':
$status = 'HTML';
break;
default :
$status = "NOFILE";
$stop = true;
}
}
else if ($regs[1] == "application") {
if ($regs[2] == 'msword' && PHPDIG_INDEX_MSWORD == true) {
$status = "MSWORD";
}
else if ($regs[2] == 'pdf' && PHPDIG_INDEX_PDF == true) {
$status = "PDF";
}
else if ($regs[2] == 'vnd.ms-excel' && PHPDIG_INDEX_MSEXCEL == true) {
$status = "MSEXCEL";
}
else {
$status = "NOFILE";
$stop = true;
}
}
else {
$status = "NOFILE";
$stop = true;
}
}
elseif (eregi('Last-Modified: *([a-z0-9,: ]+)',$answer,$regs)) {
//search last-modified header
$lm_date = $regs[1];
}

if (!eregi('[a-z0-9]+',$answer)) {
$stop = true;
}

}
@fclose($fp);
}

//returns variable or array
if ($mode == 'date') {
return compact('status', 'lm_date', 'path', 'host', 'cookies');
}
else {
return $status;
}
}


that's it. and i haven't changed anything on it, so i guess it should be the correct one.

Charter 10-19-2003 10:58 AM

Yep, that looks correct. Let's try echoing out some stuff.

In robot_functions.php, right before:
PHP Code:

foreach ($file_content as $num => $line) { 

put the following:
PHP Code:

// start echo stuff
$ijk_cnt 0;
foreach (
$file_content as $num => $line) {
    if (
trim($line)) {
        if (
$content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {
            echo 
"Content type is HTML and PhpDig exclude comment found.<br>";
        }
        elseif (
trim($line) == PHPDIG_INCLUDE_COMMENT) {
            echo 
"PhpDig include comment found.<br>";
        }
        else {
            if (
$ijk_cnt == 0) {            
                echo 
"Content type is: " $content_type ".<br>";
                
$ijk_cnt++;
            }
        }
    }
    else {
        echo 
"Trim line is false.<br><br>";
    }
}
exit();
// end echo stuff 

Now try to crawl a demo page with PhpDig exclude and include comments. What are your results for this?

manute 10-19-2003 01:41 PM

okay. that's what i got:

Content type is: HTML.
Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

PhpDig include comment found.
Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.


does that help you anything?

Charter 10-19-2003 01:57 PM

The results of the echo test tell me that the PhpDig include comment is being found, but that the PhpDig exclude comment before that is not being found.

Assuming that the PhpDig exclude comment is on one line by itself, maybe there is a typo in the config file. Can you check what PHPDIG_EXCLUDE_COMMENT is set to in the config file? Does this match what is being used in the files that you want to crawl?

manute 10-19-2003 02:27 PM

yeah i checked that already. there's no mistake in this.
and the other funny thing is: if it would really be this way, that it only finds the exclude but afterwards not the include, then why has it indexed anything at all? strange.

Charter 10-19-2003 05:41 PM

This is strange. It finds the include, so things should get indexed, but it doesn't find the exclude. With exclude before include, and each on their own line, it seems that:
PHP Code:

if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) { 

is coming out false, even though the content type is html and there is no typo. How about change:
PHP Code:

if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) { 

to the following:
PHP Code:

if (trim($line) == PHPDIG_EXCLUDE_COMMENT) { 

and try to crawl a demo page again.

manute 10-19-2003 06:02 PM

same thing again. it only finds the include. i'd mistaken that in my last post, so it's actually not that strange, because it does only find the include and that's already a good explanation for why it doesn't exclude what i want to get excluded.
and yes, the include- and exclude-comments are not in the same line.

now i tried another thing. i made a file test.php with nothing but

<!-- phpdigExclude -->
word
<!-- phpdigInclude -->

in it. and now the result of the spider:

Content type is HTML and PhpDig exclude comment found.
Content type is: HTML.
PhpDig include comment found.

now that seems to be the way it should be - right?
the one i tried before was quite a big page. and something in that must have caused the error. now that could become quite a needle in a haystack...
any ideas? wanna see the html?

Charter 10-19-2003 06:40 PM

Yep, that test.php file is the way it should be, so maybe it is something with the big page like you say. You are right that the include and exclude should not be on the same line.

Also, the include and exclude need to be on lines by themselves. Maybe try editing the big page in a text only editor to make sure that the include and exclude comments are on lines by themselves and no soft wrapping is going on there.

If you want, just post a snippet of the html around the exclude comment, like +/- 10 or so lines.


All times are GMT -8. The time now is 01:23 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.