PDA

View Full Version : exclude doesn't really work?


manute
10-18-2003, 06:01 AM
hi!

i just found some text that shouldn't be indexed on my site.
i put it into <!-- phpdigExclude --> and <!-- phpdigInclude --> and then reindexed the page.
but now it still finds that page, although the text should have been excluded. does anyone know that problem?

Rolandks
10-18-2003, 07:40 PM
Try this:
file robot_functions.php, add the instruction "continue;" at the line #777

Read this Thread: (http://www.phpdig.net/showthread.php?s=&threadid=67)


...
...
else if (trim($line) == PHPDIG_INCLUDE_COMMENT) {
$exclude = false;
continue;
}
...
...


-Roland-

GeminiHB
10-19-2003, 06:30 AM
I have the same problem:
Excluded content has been indexed.

The "continue"-bug is fixed. I deleted and reistalled the whole phpdig database. I excluded this example to test the exclude function:

<!-- phpdigExclude -->langestestwort<!-- phpdigInclude -->

But the string "langestestwort" is in the keyword table after the next reindexing process.

Maybe someone can help me,

thanks,

Holger

manute
10-19-2003, 06:44 AM
same problem with me, the "continue"-thing doesn't seem to fix it.

Charter
10-19-2003, 07:28 AM
foreach ($file_content as $num => $line) {
if (trim($line)) {
if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {
$exclude = true;
}
else if (trim($line) == PHPDIG_INCLUDE_COMMENT) {
$exclude = false;
continue;
}
// and so forth

Hi. The above code in robot_functions.php looks at each line in the file for the PhpDig exclude and include comments. Perhaps try the following.

Instead of having the PhpDig exclude and include comments on one line like so:

<!-- phpdigExclude -->some stuff<!-- phpdigInclude -->

try putting the PhpDig exclude and include comments on their own separate lines like so:

<!-- phpdigExclude -->
some stuff
<!-- phpdigInclude -->

manute
10-19-2003, 07:36 AM
they are not in one line on my site. that can't be the reason.

Charter
10-19-2003, 08:51 AM
Hi. Can you check to make sure that your phpdigTestUrl function in robot_functions.php is the same as the one that comes with PhpDig version 1.6.2?

manute
10-19-2003, 09:12 AM
function phpdigTestUrl($url,$mode='simple',$cookies=array()) {

$components = parse_url($url);
$lm_date = '';
$status = 'NOFILE';
$auth_string = '';
$redirs = 0;
$stop = false;

if (isset($components['host'])) {
$host = $components["host"];
if (isset($components['user']) && isset($components['pass']) &&
$components['user'] && $components['pass']) {
$auth_string = 'Authorization: Basic '.base64_encode($components['user'].':'.$components['pass'])."\n";
}
}
else {
$host = '';
}

if (isset($components['port'])) {
$port = (int)$components["port"];
}
else {
$port = 80;
}

if (isset($components['path'])) {
$path = $components["path"];
}
else {
$path = '';
}

if (isset($components['query'])) {
$query = $components["query"];
}
else {
$query = '';
}

$fp = @fsockopen($host,$port);

if ($port != 80) {
$sport = ":".$port;
}
else {
$sport = "";
}

if (!$fp) {
//host domain not found
$status = "NOHOST";
}
else {
if ($query) {
$path .= "?".$query;
}

$cookiesSendString = phpDigMakeCookies($cookies,$path);

//complete get
$request =
"HEAD $path HTTP/1.1\n"
."Host: $host$sport\n"
.$cookiesSendString
.$auth_string
."Accept: */*\n"
."Accept-Charset: ".PHPDIG_ENCODING."\n"
."Accept-Encoding: identity\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n";

fputs($fp,$request);

//test return code
while (!$stop && !feof($fp)) {
$answer = fgets($fp,8192);

//print $answer."<br>\n";

if (isset($req1) && $req1) {
//close, and open a new connection
//on the new location
fclose($fp);
$fp = fsockopen($host,$port);
if (!$fp) {
//host domain not found
$status = "NOHOST";
break;
}
else {
fputs($fp,$req1);
unset($req1);
$answer = fgets($fp,8192);
}
}

if (ereg("HTTP/[0-9.]+ (([0-9])[0-9]{2})", $answer,$regs)) {
if ($regs[2] == 2 || $regs[2] == 3) {
$code = $regs[2];
}
elseif ($regs[1] >= 401 && $regs[1] <= 403) {
$status = "UNAUTH";
break;
}
else {
$status = "NOFILE";
break;
}
}
else if (eregi("^ *location: *(.*)",$answer,$regs) && $code == 3) {
if ($redirs > 4) {
$stop = true;
$status = "LOOP";
}
$newpath = trim($regs[1]);
$newurl = parse_url($newpath);
//search if relocation is absolute or relative
if (!isset($newurl["host"])
&& isset($newurl["path"])
&& !ereg('^/',$newurl["path"])) {
$path = dirname($path).'/'.$newurl["path"];
}
else {
$path = $newurl["path"];
}
if (!isset($newurl['host']) || !$newurl['host'] || $host == $newurl['host']) {

$cookiesSendString = phpDigMakeCookies($cookies,$path);
$req1 = "HEAD $path HTTP/1.1\n"
."Host: $host$sport\n"
.$cookiesSendString
.$auth_string
."Accept: */*\n"
."Accept-Charset: ".PHPDIG_ENCODING."\n"
."Accept-Encoding: identity\n"
."User-Agent: PhpDig/".PHPDIG_VERSION." (PHP; MySql)\n\n";
}
else {
$stop = true;
$status = "NEWHOST";
$host = $newurl['host'];
}
}
//parse cookies
elseif (eregi("Set-Cookie: *(([^=]+)=[^; ]+) *(; *path=([^; ]+))* *(; *domain=([^; ]+))*",$answer,$regs)) {
$cookies[$regs[2]] = array('string'=>$regs[1],'path'=>$regs[4],'domain'=>$regs[6]);
}
//Parse content-type header
elseif (eregi("Content-Type: *([a-z]+)/([a-z.-]+)",$answer,$regs)) {
if ($regs[1] == "text") {
switch ($regs[2]) {
case 'plain':
$status = 'PLAINTEXT';
break;
case 'html':
$status = 'HTML';
break;
default :
$status = "NOFILE";
$stop = true;
}
}
else if ($regs[1] == "application") {
if ($regs[2] == 'msword' && PHPDIG_INDEX_MSWORD == true) {
$status = "MSWORD";
}
else if ($regs[2] == 'pdf' && PHPDIG_INDEX_PDF == true) {
$status = "PDF";
}
else if ($regs[2] == 'vnd.ms-excel' && PHPDIG_INDEX_MSEXCEL == true) {
$status = "MSEXCEL";
}
else {
$status = "NOFILE";
$stop = true;
}
}
else {
$status = "NOFILE";
$stop = true;
}
}
elseif (eregi('Last-Modified: *([a-z0-9,: ]+)',$answer,$regs)) {
//search last-modified header
$lm_date = $regs[1];
}

if (!eregi('[a-z0-9]+',$answer)) {
$stop = true;
}

}
@fclose($fp);
}

//returns variable or array
if ($mode == 'date') {
return compact('status', 'lm_date', 'path', 'host', 'cookies');
}
else {
return $status;
}
}


that's it. and i haven't changed anything on it, so i guess it should be the correct one.

Charter
10-19-2003, 11:58 AM
Yep, that looks correct. Let's try echoing out some stuff.

In robot_functions.php, right before:

foreach ($file_content as $num => $line) {

put the following:

// start echo stuff
$ijk_cnt = 0;
foreach ($file_content as $num => $line) {
if (trim($line)) {
if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {
echo "Content type is HTML and PhpDig exclude comment found.<br>";
}
elseif (trim($line) == PHPDIG_INCLUDE_COMMENT) {
echo "PhpDig include comment found.<br>";
}
else {
if ($ijk_cnt == 0) {
echo "Content type is: " . $content_type . ".<br>";
$ijk_cnt++;
}
}
}
else {
echo "Trim line is false.<br><br>";
}
}
exit();
// end echo stuff

Now try to crawl a demo page with PhpDig exclude and include comments. What are your results for this?

manute
10-19-2003, 02:41 PM
okay. that's what i got:

Content type is: HTML.
Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

PhpDig include comment found.
Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.

Trim line is false.


does that help you anything?

Charter
10-19-2003, 02:57 PM
The results of the echo test tell me that the PhpDig include comment is being found, but that the PhpDig exclude comment before that is not being found.

Assuming that the PhpDig exclude comment is on one line by itself, maybe there is a typo in the config file. Can you check what PHPDIG_EXCLUDE_COMMENT is set to in the config file? Does this match what is being used in the files that you want to crawl?

manute
10-19-2003, 03:27 PM
yeah i checked that already. there's no mistake in this.
and the other funny thing is: if it would really be this way, that it only finds the exclude but afterwards not the include, then why has it indexed anything at all? strange.

Charter
10-19-2003, 06:41 PM
This is strange. It finds the include, so things should get indexed, but it doesn't find the exclude. With exclude before include, and each on their own line, it seems that:

if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {

is coming out false, even though the content type is html and there is no typo. How about change:

if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {

to the following:

if (trim($line) == PHPDIG_EXCLUDE_COMMENT) {

and try to crawl a demo page again.

manute
10-19-2003, 07:02 PM
same thing again. it only finds the include. i'd mistaken that in my last post, so it's actually not that strange, because it does only find the include and that's already a good explanation for why it doesn't exclude what i want to get excluded.
and yes, the include- and exclude-comments are not in the same line.

now i tried another thing. i made a file test.php with nothing but

<!-- phpdigExclude -->
word
<!-- phpdigInclude -->

in it. and now the result of the spider:

Content type is HTML and PhpDig exclude comment found.
Content type is: HTML.
PhpDig include comment found.

now that seems to be the way it should be - right?
the one i tried before was quite a big page. and something in that must have caused the error. now that could become quite a needle in a haystack...
any ideas? wanna see the html?

Charter
10-19-2003, 07:40 PM
Yep, that test.php file is the way it should be, so maybe it is something with the big page like you say. You are right that the include and exclude should not be on the same line.

Also, the include and exclude need to be on lines by themselves. Maybe try editing the big page in a text only editor to make sure that the include and exclude comments are on lines by themselves and no soft wrapping is going on there.

If you want, just post a snippet of the html around the exclude comment, like +/- 10 or so lines.

manute
10-20-2003, 03:20 AM
hmmm, i only use text-editors. but now i finally got it!
it really happened, that the exclude was not alone in its line.
that's a little complicated to explain.
it was because my index.php always puts a header-file, a main-part-file and a footer-file together using include.
and the exclude looked fine in the first line of one of the main-part-files, but after the index.php put it together with header and footer the last line from the header file was put together in the same line with the exclude comment. that's why it didn't work.
damn, that little mistake cost quite some time.
but thanks a lot for your help charter and roland, i would have never found that alone...