![]() |
exclude doesn't really work?
hi!
i just found some text that shouldn't be indexed on my site. i put it into <!-- phpdigExclude --> and <!-- phpdigInclude --> and then reindexed the page. but now it still finds that page, although the text should have been excluded. does anyone know that problem? |
Try this:
file robot_functions.php, add the instruction "continue;" at the line #777 Read this Thread: PHP Code:
|
I have the same problem:
Excluded content has been indexed. The "continue"-bug is fixed. I deleted and reistalled the whole phpdig database. I excluded this example to test the exclude function: PHP Code:
Maybe someone can help me, thanks, Holger |
same problem with me, the "continue"-thing doesn't seem to fix it.
|
PHP Code:
Instead of having the PhpDig exclude and include comments on one line like so: Code:
<!-- phpdigExclude -->some stuff<!-- phpdigInclude --> Code:
<!-- phpdigExclude --> |
they are not in one line on my site. that can't be the reason.
|
Hi. Can you check to make sure that your phpdigTestUrl function in robot_functions.php is the same as the one that comes with PhpDig version 1.6.2?
|
function phpdigTestUrl($url,$mode='simple',$cookies=array()) {
$components = parse_url($url); $lm_date = ''; $status = 'NOFILE'; $auth_string = ''; $redirs = 0; $stop = false; if (isset($components['host'])) { $host = $components["host"]; if (isset($components['user']) && isset($components['pass']) && $components['user'] && $components['pass']) { $auth_string = 'Authorization: Basic '.base64_encode($components['user'].':'.$components['pass'])."\n"; } } else { $host = ''; } if (isset($components['port'])) { $port = (int)$components["port"]; } else { $port = 80; } if (isset($components['path'])) { $path = $components["path"]; } else { $path = ''; } if (isset($components['query'])) { $query = $components["query"]; } else { $query = ''; } $fp = @fsockopen($host,$port); if ($port != 80) { $sport = ":".$port; } else { $sport = ""; } if (!$fp) { //host domain not found $status = "NOHOST"; } else { if ($query) { $path .= "?".$query; } $cookiesSendString = phpDigMakeCookies($cookies,$path); //complete get $request = "HEAD $path HTTP/1.1\n" ."Host: $host$sport\n" .$cookiesSendString .$auth_string ."Accept: */*\n" ."Accept-Charset: ".PHPDIG_ENCODING."\n" ."Accept-Encoding: identity\n" ."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n"; fputs($fp,$request); //test return code while (!$stop && !feof($fp)) { $answer = fgets($fp,8192); //print $answer."<br>\n"; if (isset($req1) && $req1) { //close, and open a new connection //on the new location fclose($fp); $fp = fsockopen($host,$port); if (!$fp) { //host domain not found $status = "NOHOST"; break; } else { fputs($fp,$req1); unset($req1); $answer = fgets($fp,8192); } } if (ereg("HTTP/[0-9.]+ (([0-9])[0-9]{2})", $answer,$regs)) { if ($regs[2] == 2 || $regs[2] == 3) { $code = $regs[2]; } elseif ($regs[1] >= 401 && $regs[1] <= 403) { $status = "UNAUTH"; break; } else { $status = "NOFILE"; break; } } else if (eregi("^ *location: *(.*)",$answer,$regs) && $code == 3) { if ($redirs > 4) { $stop = true; $status = "LOOP"; } $newpath = trim($regs[1]); $newurl = parse_url($newpath); //search if relocation is absolute or relative if (!isset($newurl["host"]) && isset($newurl["path"]) && !ereg('^/',$newurl["path"])) { $path = dirname($path).'/'.$newurl["path"]; } else { $path = $newurl["path"]; } if (!isset($newurl['host']) || !$newurl['host'] || $host == $newurl['host']) { $cookiesSendString = phpDigMakeCookies($cookies,$path); $req1 = "HEAD $path HTTP/1.1\n" ."Host: $host$sport\n" .$cookiesSendString .$auth_string ."Accept: */*\n" ."Accept-Charset: ".PHPDIG_ENCODING."\n" ."Accept-Encoding: identity\n" ."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n"; } else { $stop = true; $status = "NEWHOST"; $host = $newurl['host']; } } //parse cookies elseif (eregi("Set-Cookie: *(([^=]+)=[^; ]+) *(; *path=([^; ]+))* *(; *domain=([^; ]+))*",$answer,$regs)) { $cookies[$regs[2]] = array('string'=>$regs[1],'path'=>$regs[4],'domain'=>$regs[6]); } //Parse content-type header elseif (eregi("Content-Type: *([a-z]+)/([a-z.-]+)",$answer,$regs)) { if ($regs[1] == "text") { switch ($regs[2]) { case 'plain': $status = 'PLAINTEXT'; break; case 'html': $status = 'HTML'; break; default : $status = "NOFILE"; $stop = true; } } else if ($regs[1] == "application") { if ($regs[2] == 'msword' && PHPDIG_INDEX_MSWORD == true) { $status = "MSWORD"; } else if ($regs[2] == 'pdf' && PHPDIG_INDEX_PDF == true) { $status = "PDF"; } else if ($regs[2] == 'vnd.ms-excel' && PHPDIG_INDEX_MSEXCEL == true) { $status = "MSEXCEL"; } else { $status = "NOFILE"; $stop = true; } } else { $status = "NOFILE"; $stop = true; } } elseif (eregi('Last-Modified: *([a-z0-9,: ]+)',$answer,$regs)) { //search last-modified header $lm_date = $regs[1]; } if (!eregi('[a-z0-9]+',$answer)) { $stop = true; } } @fclose($fp); } //returns variable or array if ($mode == 'date') { return compact('status', 'lm_date', 'path', 'host', 'cookies'); } else { return $status; } } that's it. and i haven't changed anything on it, so i guess it should be the correct one. |
Yep, that looks correct. Let's try echoing out some stuff.
In robot_functions.php, right before: PHP Code:
PHP Code:
|
okay. that's what i got:
Content type is: HTML. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. PhpDig include comment found. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. Trim line is false. does that help you anything? |
The results of the echo test tell me that the PhpDig include comment is being found, but that the PhpDig exclude comment before that is not being found.
Assuming that the PhpDig exclude comment is on one line by itself, maybe there is a typo in the config file. Can you check what PHPDIG_EXCLUDE_COMMENT is set to in the config file? Does this match what is being used in the files that you want to crawl? |
yeah i checked that already. there's no mistake in this.
and the other funny thing is: if it would really be this way, that it only finds the exclude but afterwards not the include, then why has it indexed anything at all? strange. |
This is strange. It finds the include, so things should get indexed, but it doesn't find the exclude. With exclude before include, and each on their own line, it seems that:
PHP Code:
PHP Code:
PHP Code:
|
same thing again. it only finds the include. i'd mistaken that in my last post, so it's actually not that strange, because it does only find the include and that's already a good explanation for why it doesn't exclude what i want to get excluded.
and yes, the include- and exclude-comments are not in the same line. now i tried another thing. i made a file test.php with nothing but <!-- phpdigExclude --> word <!-- phpdigInclude --> in it. and now the result of the spider: Content type is HTML and PhpDig exclude comment found. Content type is: HTML. PhpDig include comment found. now that seems to be the way it should be - right? the one i tried before was quite a big page. and something in that must have caused the error. now that could become quite a needle in a haystack... any ideas? wanna see the html? |
Yep, that test.php file is the way it should be, so maybe it is something with the big page like you say. You are right that the include and exclude should not be on the same line.
Also, the include and exclude need to be on lines by themselves. Maybe try editing the big page in a text only editor to make sure that the include and exclude comments are on lines by themselves and no soft wrapping is going on there. If you want, just post a snippet of the html around the exclude comment, like +/- 10 or so lines. |
All times are GMT -8. The time now is 08:11 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.