PhpDig.net - Update Index taking 11 hours....

PhpDig.net (http://www.phpdig.net/forum/index.php)

- Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)

- - Update Index taking 11 hours.... (http://www.phpdig.net/forum/showthread.php?t=405)

Update Index taking 11 hours....

In order to eliminate a problem we were having with the way the spider indexed our files, we told PHPDIG to deal with session IDs named "&MoodleSession". This seemed to fix the original problem of having a lot of pages indexed when there were in actuality very few. Another problem has cropped up, however, after changing this setting.

When we run an update-index, it takes about 11 hours to complete, claiming to have gone through about 14,000 pages (rather than the usual 2,000, the website has ~200 unique pages). Is there a better way to ignore session IDs that are passed through the URI?

Is the Session ID at fault here, or could it be something else?

Hi. Is there a session ID showing up when you crawl? Can you provide a snippet of output from the crawl?

Please look at the attached file. I just did a GROUP BY query on the spider table, and found that after removing the MoodleSession addition from the filename that each result appears three times. (The attached file is HTML with a TXT extension).

Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.

In this function, it is the following query that checks for a duplicate:

PHP Code:


		
			
$query_double = "SELECT spider_id FROM ".PHPDIG_DB_PREFIX."spider WHERE site_id='$site_id' AND md5 = '$md5'";

As you are crawling the same site/folder, it is the $md5 variable that is checking for duplicate results. The $md5 variable is as follows:

PHP Code:


		
			
$md5 = md5($titre_resume.$page_desc['content'].$text[$max_chunk]).'_'.$tempfilesize;

Briefly, the variables in the $md5 variable are as follows:

$titre_resume // page title
$page_desc['content'] // meta tag description
$text[$max_chunk] // last chunk of page text
$tempfilesize // temp file size

As the pages are creating different $md5 variables, they are not seen as duplicates.

Try using the PHPDIG_EXCLUDE_COMMENT and PHPDIG_INCLUDE_COMMENT values from the config file, each on their own line (with PHP use \n if necessary), and surround the portion of dynamic content in each page like so:

Code:

<!-- phpdigExclude -->

dynamic content, for

example, code for a

rotating banner

would go in here

<!-- phpdigInclude -->

The problem is not that the pages are seen as duplicated, the problem is that they are NOT seen as duplicates. For some reason, PHPDig is assuming that the "MoodleSession" is part of the file name, therefore it is including the same file multiple times.

[site]/help.php?file=program.html&module=resource&MoodleSession=8784c2715aeb4eb5ba 81cbd12b475972
is seen as DIFFERENT from

[site]/help.php?file=program.html&module=resource&MoodleSession=6915eacadf8e168266 a793884ec15cb0

despite the fact that they are the same page with a different MoodleSession string appended to the end.

I am assuming that the reason it sees them as different files is because of the MoodleSession thing. Is there a way to make PHPDig ignore this part of the "file"?

>> As the pages are creating different $md5 variables, they are not seen as duplicates.

Hi. If you are using the following:

PHP Code:


		
			
define('PHPDIG_SESSID_REMOVE',true);     // remove SIDS from indexed URLS

define('PHPDIG_SESSID_VAR','MoodleSession'); // name of the SID variable

and the same page shows up multiple times in a crawl, then refresh the page in your browser. Does something on the page change, like a banner?

If so, then try using the PHPDIG_EXCLUDE_COMMENT and PHPDIG_INCLUDE_COMMENT values to surround the part of the page that is changing upon refresh.

That works just fine when doing the initial indexing, though it's leading a trailing "&" character at the end of each file stored. When performing an update, however, after the page labels "level 1", it begins to show the MoodleSessions again, and it repeats the same files over. (it's still running after 10 minutes; the initial indexing took about 4 minutes).

It is NOT adding entries to the spider table.

Attached are the results as of this writing (it's still going).

It's STILL going...

For some reason, when doing an update, PHPDig still uses the MoodleSession in the path while DISPLAYING the results to the screen, whereas it is not actually writing these to the database. In some of the files, PHPDig is using the file name as the "title", therefore making the MD5 different (I'm assuming). Is there any way to make PHPDig ignore the MoodleSession TOTALLY (including making it the title, etc)?

Edit: I looked at the tempspider DB, and it DOES show the MoodleSession stuff.

Hi. The tempspider table holds links for processing, and the display shows links that are being crawled. This information doesn't necessary reflect the exact end results. Using fresh database tables, are there session IDs in sites table or in the search results after a crawl? You can try taking $titre_resume out of the $md5 variable, but it sounds like you are using the options currently available to handle session IDs.

So we (pardon all the we's and I's; there are two of us working on this) figured out what the problem was.

For some reason, PHPDig was not REALLY getting the MoodleSession out of the URI.

This caused the script to dig for a REALLY long time, not only with updating, but also with the initial index (1.5 hours for ~229 REAL file permutations; 11+ hours for the update).

I made a function called "adamize" that chops out the MoodleSession part of the URI altogether, so that PHPDig does not even know it exists. Ever.

I placed the function below at the bottom of the "robot_functions.php" page, and then made a reference to it at line 172 (in my file) in function phpdigRewriteUrl, under the "settype($eval,'string')" line:

PHP Code:


		
			
$eval = adamize($eval);

and for the bottom:

PHP Code:


		
			
function adamize($file)

{

//    Determines whether or not "MoodleSession" occurs within

//    a given filename (or actually a returned URI)

    if(strstr($file,"MoodleSession"))

    {

//    The position in which MoodleSession is found.

        $starts = strpos($file,"MoodleSession");



//    Because MooodleSession (for me) ALWAYS will occur at the

//    end of a URI, just get the beginning of the string until

//    the occurance of MoodleSession.

        $file = substr($file,0,$starts);



//    Get the last character of the new file string.

        $ends = substr($file,strlen($file)-1);

    

//    Figure out if the end of the string is a ? or a &.

//    If it is either one, just cut it out of the string.    

        if($ends == "?" || $ends == "&")

            $file = substr($file,0,strlen($file)-1);



//    Return the new "file" variable, with MoodleSession cut out.

        return $file;

    }



//    If the filename doesn't have Moodle Session inside it,

//    simply return the file variable as it was.

    else

        return $file;

}

If there any questions, post here.

Hi. Thanks, good idea. You might also try the following.

In phpdigRewriteUrl, comment out your adamize function call, and replace the following:

Code:

// parse and remove quotes

$url = preg_replace('/[\0]/is','',$url); // remove null byte

$url = preg_replace('/[\\']/is','',$url); // remove single quote

$url = preg_replace('/["]/is','',$url); // remove double quote

$url = preg_replace('/[\\\\]/is','',$url); // remove backslash

$url = @parse_url(str_replace('\\'"','',$eval));

if (!isset($url['path'])) {

     $url['path'] = '';

}

with the following:

Code:

// parse and remove quotes

$eval = preg_replace('/[\0]/is','',$eval); // remove null byte

$eval = preg_replace('/[\\']/is','',$eval); // remove single quote

$eval = preg_replace('/["]/is','',$eval); // remove double quote

$eval = preg_replace('/[\\\\]/is','',$eval); // remove backslash



if (PHPDIG_SESSID_REMOVE) {

    $eval = ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$eval);

    $eval = str_replace("&&","&",$eval);

    $eval = eregi_replace("[?][&]","?",$eval);

    $eval = eregi_replace("&$","",$eval);

}



$url = @parse_url(str_replace('\\'"','',$eval));

if (!isset($url['path'])) {

     $url['path'] = '';

}

This should do two things: (a) fix the typo of using $url instead of $eval in the function and (b) make it so any PHPDIG_SESSID_VAR is stripped from the URL regardless of placement if PHPDIG_SESSID_REMOVE is set to true.

is it possible to set more then one sessid's to ignore?
like

define('PHPDIG_SESSID_VAR','MoodleSession,sessid,id');

Hi. Assuming version 1.8.0, use the robot_functions.php file attached in this thread.

This is untested, but in the robot_funtions.php file there are two places to edit.

First, replace the following:

PHP Code:


		
			
if (PHPDIG_SESSID_REMOVE) {

    $eval = ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$eval);

    $eval = str_replace("&&","&",$eval);

    $eval = eregi_replace("[?][&]","?",$eval);

    $eval = eregi_replace("&$","",$eval);

}

with the following:

PHP Code:


		
			
if (PHPDIG_SESSID_REMOVE) {

    $my_test_comma = stristr(PHPDIG_SESSID_VAR,",");

    if ($my_test_comma !== FALSE) {

        $my_test_comma_array = explode(",",PHPDIG_SESSID_VAR);

        $my_test_comma_count = count($my_test_comma_array);

        for ($i=0; $i<$my_test_comma_count; $i++) {

            $eval = ereg_replace($my_test_comma_array[$i].'=[a-z0-9]*','',$eval);

            $eval = str_replace("&amp;amp;&amp;amp;","&amp;amp;",$eval);

            $eval = str_replace("?&amp;amp;","?",$eval);

            $eval = eregi_replace("&amp;amp;$","",$eval);

        }

    }

    else {

        $eval = ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$eval);

        $eval = str_replace("&amp;amp;&amp;amp;","&amp;amp;",$eval);

        $eval = str_replace("?&amp;amp;","?",$eval);

        $eval = eregi_replace("&amp;amp;$","",$eval);

    }

}

Second, replace the following:

PHP Code:


		
			
if (PHPDIG_SESSID_REMOVE) {

    $file = ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$file);

    $file = str_replace("&&","&",$file);

    $file = eregi_replace("[?][&]","?",$file);

    $file = eregi_replace("&$","",$file);

}

with the following:

PHP Code:


		
			
if (PHPDIG_SESSID_REMOVE) {

    $my_test_comma = stristr(PHPDIG_SESSID_VAR,",");

    if ($my_test_comma !== FALSE) {

        $my_test_comma_array = explode(",",PHPDIG_SESSID_VAR);

        $my_test_comma_count = count($my_test_comma_array);

        for ($i=0; $i<$my_test_comma_count; $i++) {

            $file = ereg_replace($my_test_comma_array[$i].'=[a-z0-9]*','',$file);

            $file = str_replace("&amp;amp;&amp;amp;","&amp;amp;",$file);

            $file = str_replace("?&amp;amp;","?",$file);

            $file = eregi_replace("&amp;amp;$","",$file);

        }

    }

    else {

        $file = ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$file);

        $file = str_replace("&amp;amp;&amp;amp;","&amp;amp;",$file);

        $file = str_replace("?&amp;amp;","?",$file);

        $file = eregi_replace("&amp;amp;$","",$file);

    }

}

Then use define('PHPDIG_SESSID_VAR','ID1,ID2,ID3'); in the config.php file, separating each session ID by a comma if there is more than one.

Of course, you could define such a function to avoid the repeat in code. In any case, remember to remove any "word" wrapping in the above code.

i replaced my old robots_function.php with the one u said...
but i cant find any of the lines i have to replace :/

Hi. If you downloaded the robot_functions.php file in this thread, the code is in there. Just search on PHPDIG_SESSID_REMOVE to find the two places to edit.