PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 01-16-2004, 04:31 PM   #1
tester
Green Mole
 
Join Date: Jan 2004
Posts: 10
Update Index taking 11 hours....

In order to eliminate a problem we were having with the way the spider indexed our files, we told PHPDIG to deal with session IDs named "&MoodleSession". This seemed to fix the original problem of having a lot of pages indexed when there were in actuality very few. Another problem has cropped up, however, after changing this setting.

When we run an update-index, it takes about 11 hours to complete, claiming to have gone through about 14,000 pages (rather than the usual 2,000, the website has ~200 unique pages). Is there a better way to ignore session IDs that are passed through the URI?

Is the Session ID at fault here, or could it be something else?
tester is offline   Reply With Quote
Old 01-18-2004, 07:36 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Is there a session ID showing up when you crawl? Can you provide a snippet of output from the crawl?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-20-2004, 09:17 AM   #3
tester
Green Mole
 
Join Date: Jan 2004
Posts: 10
Please look at the attached file. I just did a GROUP BY query on the spider table, and found that after removing the MoodleSession addition from the filename that each result appears three times. (The attached file is HTML with a TXT extension).
Attached Files
File Type: txt spider_php.txt (172.9 KB, 34 views)
tester is offline   Reply With Quote
Old 01-20-2004, 09:43 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.

In this function, it is the following query that checks for a duplicate:
PHP Code:
$query_double "SELECT spider_id FROM ".PHPDIG_DB_PREFIX."spider WHERE site_id='$site_id' AND md5 = '$md5'"
As you are crawling the same site/folder, it is the $md5 variable that is checking for duplicate results. The $md5 variable is as follows:
PHP Code:
$md5 md5($titre_resume.$page_desc['content'].$text[$max_chunk]).'_'.$tempfilesize
Briefly, the variables in the $md5 variable are as follows:
  1. $titre_resume // page title
  2. $page_desc['content'] // meta tag description
  3. $text[$max_chunk] // last chunk of page text
  4. $tempfilesize // temp file size
As the pages are creating different $md5 variables, they are not seen as duplicates.

Try using the PHPDIG_EXCLUDE_COMMENT and PHPDIG_INCLUDE_COMMENT values from the config file, each on their own line (with PHP use \n if necessary), and surround the portion of dynamic content in each page like so:
Code:
<!-- phpdigExclude -->
dynamic content, for
example, code for a
rotating banner
would go in here
<!-- phpdigInclude -->
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-20-2004, 09:59 AM   #5
tester
Green Mole
 
Join Date: Jan 2004
Posts: 10
The problem is not that the pages are seen as duplicated, the problem is that they are NOT seen as duplicates. For some reason, PHPDig is assuming that the "MoodleSession" is part of the file name, therefore it is including the same file multiple times.

[site]/help.php?file=program.html&module=resource&MoodleSession=8784c2715aeb4eb5ba 81cbd12b475972
is seen as DIFFERENT from

[site]/help.php?file=program.html&module=resource&MoodleSession=6915eacadf8e168266 a793884ec15cb0

despite the fact that they are the same page with a different MoodleSession string appended to the end.

I am assuming that the reason it sees them as different files is because of the MoodleSession thing. Is there a way to make PHPDig ignore this part of the "file"?
tester is offline   Reply With Quote
Old 01-20-2004, 10:11 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
>> As the pages are creating different $md5 variables, they are not seen as duplicates.

Hi. If you are using the following:
PHP Code:
define('PHPDIG_SESSID_REMOVE',true);     // remove SIDS from indexed URLS
define('PHPDIG_SESSID_VAR','MoodleSession'); // name of the SID variable 
and the same page shows up multiple times in a crawl, then refresh the page in your browser. Does something on the page change, like a banner?

If so, then try using the PHPDIG_EXCLUDE_COMMENT and PHPDIG_INCLUDE_COMMENT values to surround the part of the page that is changing upon refresh.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-20-2004, 10:44 AM   #7
tester
Green Mole
 
Join Date: Jan 2004
Posts: 10
That works just fine when doing the initial indexing, though it's leading a trailing "&" character at the end of each file stored. When performing an update, however, after the page labels "level 1", it begins to show the MoodleSessions again, and it repeats the same files over. (it's still running after 10 minutes; the initial indexing took about 4 minutes).

It is NOT adding entries to the spider table.

Attached are the results as of this writing (it's still going).
Attached Files
File Type: txt spider_php.txt (27.7 KB, 22 views)
tester is offline   Reply With Quote
Old 01-20-2004, 12:24 PM   #8
tester
Green Mole
 
Join Date: Jan 2004
Posts: 10
It's STILL going...

For some reason, when doing an update, PHPDig still uses the MoodleSession in the path while DISPLAYING the results to the screen, whereas it is not actually writing these to the database. In some of the files, PHPDig is using the file name as the "title", therefore making the MD5 different (I'm assuming). Is there any way to make PHPDig ignore the MoodleSession TOTALLY (including making it the title, etc)?

Edit: I looked at the tempspider DB, and it DOES show the MoodleSession stuff.

Last edited by tester; 01-20-2004 at 12:29 PM.
tester is offline   Reply With Quote
Old 01-20-2004, 02:59 PM   #9
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The tempspider table holds links for processing, and the display shows links that are being crawled. This information doesn't necessary reflect the exact end results. Using fresh database tables, are there session IDs in sites table or in the search results after a crawl? You can try taking $titre_resume out of the $md5 variable, but it sounds like you are using the options currently available to handle session IDs.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-21-2004, 10:01 AM   #10
tester
Green Mole
 
Join Date: Jan 2004
Posts: 10
Thumbs up SOLUTION FOUND

So we (pardon all the we's and I's; there are two of us working on this) figured out what the problem was.

For some reason, PHPDig was not REALLY getting the MoodleSession out of the URI.

This caused the script to dig for a REALLY long time, not only with updating, but also with the initial index (1.5 hours for ~229 REAL file permutations; 11+ hours for the update).

I made a function called "adamize" that chops out the MoodleSession part of the URI altogether, so that PHPDig does not even know it exists. Ever.

I placed the function below at the bottom of the "robot_functions.php" page, and then made a reference to it at line 172 (in my file) in function phpdigRewriteUrl, under the "settype($eval,'string')" line:

PHP Code:
$eval adamize($eval); 
and for the bottom:

PHP Code:
function adamize($file)
{
//    Determines whether or not "MoodleSession" occurs within
//    a given filename (or actually a returned URI)
    
if(strstr($file,"MoodleSession"))
    {
//    The position in which MoodleSession is found.
        
$starts strpos($file,"MoodleSession");

//    Because MooodleSession (for me) ALWAYS will occur at the
//    end of a URI, just get the beginning of the string until
//    the occurance of MoodleSession.
        
$file substr($file,0,$starts);

//    Get the last character of the new file string.
        
$ends substr($file,strlen($file)-1);
    
//    Figure out if the end of the string is a ? or a &.
//    If it is either one, just cut it out of the string.    
        
if($ends == "?" || $ends == "&")
            
$file substr($file,0,strlen($file)-1);

//    Return the new "file" variable, with MoodleSession cut out.
        
return $file;
    }

//    If the filename doesn't have Moodle Session inside it,
//    simply return the file variable as it was.
    
else
        return 
$file;

If there any questions, post here.
tester is offline   Reply With Quote
Old 01-21-2004, 10:57 AM   #11
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Thanks, good idea. You might also try the following.

In phpdigRewriteUrl, comment out your adamize function call, and replace the following:
Code:
// parse and remove quotes
$url = preg_replace('/[\0]/is','',$url); // remove null byte
$url = preg_replace('/[\\']/is','',$url); // remove single quote
$url = preg_replace('/["]/is','',$url); // remove double quote
$url = preg_replace('/[\\\\]/is','',$url); // remove backslash
$url = @parse_url(str_replace('\\'"','',$eval));
if (!isset($url['path'])) {
     $url['path'] = '';
}
with the following:
Code:
// parse and remove quotes
$eval = preg_replace('/[\0]/is','',$eval); // remove null byte
$eval = preg_replace('/[\\']/is','',$eval); // remove single quote
$eval = preg_replace('/["]/is','',$eval); // remove double quote
$eval = preg_replace('/[\\\\]/is','',$eval); // remove backslash

if (PHPDIG_SESSID_REMOVE) {
    $eval = ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$eval);
    $eval = str_replace("&&","&",$eval);
    $eval = eregi_replace("[?][&]","?",$eval);
    $eval = eregi_replace("&$","",$eval);
}

$url = @parse_url(str_replace('\\'"','',$eval));
if (!isset($url['path'])) {
     $url['path'] = '';
}
This should do two things: (a) fix the typo of using $url instead of $eval in the function and (b) make it so any PHPDIG_SESSID_VAR is stripped from the URL regardless of placement if PHPDIG_SESSID_REMOVE is set to true.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-21-2004, 08:37 PM   #12
shinji
Green Mole
 
Join Date: Jan 2004
Location: Hamm/NRW/Germany
Posts: 26
is it possible to set more then one sessid's to ignore?
like

define('PHPDIG_SESSID_VAR','MoodleSession,sessid,id');
shinji is offline   Reply With Quote
Old 01-22-2004, 12:49 PM   #13
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Assuming version 1.8.0, use the robot_functions.php file attached in this thread.

This is untested, but in the robot_funtions.php file there are two places to edit.

First, replace the following:
PHP Code:
if (PHPDIG_SESSID_REMOVE) {
    
$eval ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$eval);
    
$eval str_replace("&&","&",$eval);
    
$eval eregi_replace("[?][&]","?",$eval);
    
$eval eregi_replace("&$","",$eval);

with the following:
PHP Code:
if (PHPDIG_SESSID_REMOVE) {
    
$my_test_comma stristr(PHPDIG_SESSID_VAR,",");
    if (
$my_test_comma !== FALSE) {
        
$my_test_comma_array explode(",",PHPDIG_SESSID_VAR);
        
$my_test_comma_count count($my_test_comma_array);
        for (
$i=0$i<$my_test_comma_count$i++) {
            
$eval ereg_replace($my_test_comma_array[$i].'=[a-z0-9]*','',$eval);
            
$eval str_replace("&amp;amp;&amp;amp;","&amp;amp;",$eval);
            
$eval str_replace("?&amp;amp;","?",$eval);
            
$eval eregi_replace("&amp;amp;$","",$eval);
        }
    }
    else {
        
$eval ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$eval);
        
$eval str_replace("&amp;amp;&amp;amp;","&amp;amp;",$eval);
        
$eval str_replace("?&amp;amp;","?",$eval);
        
$eval eregi_replace("&amp;amp;$","",$eval);
    }

Second, replace the following:
PHP Code:
if (PHPDIG_SESSID_REMOVE) {
    
$file ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$file);
    
$file str_replace("&&","&",$file);
    
$file eregi_replace("[?][&]","?",$file);
    
$file eregi_replace("&$","",$file);

with the following:
PHP Code:
if (PHPDIG_SESSID_REMOVE) {
    
$my_test_comma stristr(PHPDIG_SESSID_VAR,",");
    if (
$my_test_comma !== FALSE) {
        
$my_test_comma_array explode(",",PHPDIG_SESSID_VAR);
        
$my_test_comma_count count($my_test_comma_array);
        for (
$i=0$i<$my_test_comma_count$i++) {
            
$file ereg_replace($my_test_comma_array[$i].'=[a-z0-9]*','',$file);
            
$file str_replace("&amp;amp;&amp;amp;","&amp;amp;",$file);
            
$file str_replace("?&amp;amp;","?",$file);
            
$file eregi_replace("&amp;amp;$","",$file);
        }
    }
    else {
        
$file ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$file);
        
$file str_replace("&amp;amp;&amp;amp;","&amp;amp;",$file);
        
$file str_replace("?&amp;amp;","?",$file);
        
$file eregi_replace("&amp;amp;$","",$file);
    }

Then use define('PHPDIG_SESSID_VAR','ID1,ID2,ID3'); in the config.php file, separating each session ID by a comma if there is more than one.

Of course, you could define such a function to avoid the repeat in code. In any case, remember to remove any "word" wrapping in the above code.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-23-2004, 04:45 AM   #14
shinji
Green Mole
 
Join Date: Jan 2004
Location: Hamm/NRW/Germany
Posts: 26
i replaced my old robots_function.php with the one u said...
but i cant find any of the lines i have to replace :/
shinji is offline   Reply With Quote
Old 01-23-2004, 10:10 AM   #15
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. If you downloaded the robot_functions.php file in this thread, the code is in there. Just search on PHPDIG_SESSID_REMOVE to find the two places to edit.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
I cannot update my website humanitaire.ws How-to Forum 7 01-19-2005 09:00 AM
Taking Requests Charter Mod Requests 26 05-04-2004 10:23 AM
google update heiko IPs, SEs, & UAs 4 04-17-2004 04:10 PM
Update Documentation Charter Feedback & News 6 01-19-2004 10:11 AM
Index update question Gecko Troubleshooting 9 10-11-2003 04:35 AM


All times are GMT -8. The time now is 05:24 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.