|
01-16-2004, 04:31 PM | #1 |
Green Mole
Join Date: Jan 2004
Posts: 10
|
Update Index taking 11 hours....
In order to eliminate a problem we were having with the way the spider indexed our files, we told PHPDIG to deal with session IDs named "&MoodleSession". This seemed to fix the original problem of having a lot of pages indexed when there were in actuality very few. Another problem has cropped up, however, after changing this setting.
When we run an update-index, it takes about 11 hours to complete, claiming to have gone through about 14,000 pages (rather than the usual 2,000, the website has ~200 unique pages). Is there a better way to ignore session IDs that are passed through the URI? Is the Session ID at fault here, or could it be something else? |
01-18-2004, 07:36 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Is there a session ID showing up when you crawl? Can you provide a snippet of output from the crawl?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-20-2004, 09:17 AM | #3 |
Green Mole
Join Date: Jan 2004
Posts: 10
|
Please look at the attached file. I just did a GROUP BY query on the spider table, and found that after removing the MoodleSession addition from the filename that each result appears three times. (The attached file is HTML with a TXT extension).
|
01-20-2004, 09:43 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.
In this function, it is the following query that checks for a duplicate: PHP Code:
PHP Code:
Try using the PHPDIG_EXCLUDE_COMMENT and PHPDIG_INCLUDE_COMMENT values from the config file, each on their own line (with PHP use \n if necessary), and surround the portion of dynamic content in each page like so: Code:
<!-- phpdigExclude --> dynamic content, for example, code for a rotating banner would go in here <!-- phpdigInclude -->
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-20-2004, 09:59 AM | #5 |
Green Mole
Join Date: Jan 2004
Posts: 10
|
The problem is not that the pages are seen as duplicated, the problem is that they are NOT seen as duplicates. For some reason, PHPDig is assuming that the "MoodleSession" is part of the file name, therefore it is including the same file multiple times.
[site]/help.php?file=program.html&module=resource&MoodleSession=8784c2715aeb4eb5ba 81cbd12b475972 is seen as DIFFERENT from [site]/help.php?file=program.html&module=resource&MoodleSession=6915eacadf8e168266 a793884ec15cb0 despite the fact that they are the same page with a different MoodleSession string appended to the end. I am assuming that the reason it sees them as different files is because of the MoodleSession thing. Is there a way to make PHPDig ignore this part of the "file"? |
01-20-2004, 10:11 AM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
>> As the pages are creating different $md5 variables, they are not seen as duplicates.
Hi. If you are using the following: PHP Code:
If so, then try using the PHPDIG_EXCLUDE_COMMENT and PHPDIG_INCLUDE_COMMENT values to surround the part of the page that is changing upon refresh.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-20-2004, 10:44 AM | #7 |
Green Mole
Join Date: Jan 2004
Posts: 10
|
That works just fine when doing the initial indexing, though it's leading a trailing "&" character at the end of each file stored. When performing an update, however, after the page labels "level 1", it begins to show the MoodleSessions again, and it repeats the same files over. (it's still running after 10 minutes; the initial indexing took about 4 minutes).
It is NOT adding entries to the spider table. Attached are the results as of this writing (it's still going). |
01-20-2004, 12:24 PM | #8 |
Green Mole
Join Date: Jan 2004
Posts: 10
|
It's STILL going...
For some reason, when doing an update, PHPDig still uses the MoodleSession in the path while DISPLAYING the results to the screen, whereas it is not actually writing these to the database. In some of the files, PHPDig is using the file name as the "title", therefore making the MD5 different (I'm assuming). Is there any way to make PHPDig ignore the MoodleSession TOTALLY (including making it the title, etc)? Edit: I looked at the tempspider DB, and it DOES show the MoodleSession stuff. Last edited by tester; 01-20-2004 at 12:29 PM. |
01-20-2004, 02:59 PM | #9 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. The tempspider table holds links for processing, and the display shows links that are being crawled. This information doesn't necessary reflect the exact end results. Using fresh database tables, are there session IDs in sites table or in the search results after a crawl? You can try taking $titre_resume out of the $md5 variable, but it sounds like you are using the options currently available to handle session IDs.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-21-2004, 10:01 AM | #10 |
Green Mole
Join Date: Jan 2004
Posts: 10
|
SOLUTION FOUND
So we (pardon all the we's and I's; there are two of us working on this) figured out what the problem was.
For some reason, PHPDig was not REALLY getting the MoodleSession out of the URI. This caused the script to dig for a REALLY long time, not only with updating, but also with the initial index (1.5 hours for ~229 REAL file permutations; 11+ hours for the update). I made a function called "adamize" that chops out the MoodleSession part of the URI altogether, so that PHPDig does not even know it exists. Ever. I placed the function below at the bottom of the "robot_functions.php" page, and then made a reference to it at line 172 (in my file) in function phpdigRewriteUrl, under the "settype($eval,'string')" line: PHP Code:
PHP Code:
|
01-21-2004, 10:57 AM | #11 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Thanks, good idea. You might also try the following.
In phpdigRewriteUrl, comment out your adamize function call, and replace the following: Code:
// parse and remove quotes $url = preg_replace('/[\0]/is','',$url); // remove null byte $url = preg_replace('/[\\']/is','',$url); // remove single quote $url = preg_replace('/["]/is','',$url); // remove double quote $url = preg_replace('/[\\\\]/is','',$url); // remove backslash $url = @parse_url(str_replace('\\'"','',$eval)); if (!isset($url['path'])) { $url['path'] = ''; } Code:
// parse and remove quotes $eval = preg_replace('/[\0]/is','',$eval); // remove null byte $eval = preg_replace('/[\\']/is','',$eval); // remove single quote $eval = preg_replace('/["]/is','',$eval); // remove double quote $eval = preg_replace('/[\\\\]/is','',$eval); // remove backslash if (PHPDIG_SESSID_REMOVE) { $eval = ereg_replace(PHPDIG_SESSID_VAR.'=[a-z0-9]*','',$eval); $eval = str_replace("&&","&",$eval); $eval = eregi_replace("[?][&]","?",$eval); $eval = eregi_replace("&$","",$eval); } $url = @parse_url(str_replace('\\'"','',$eval)); if (!isset($url['path'])) { $url['path'] = ''; }
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-21-2004, 08:37 PM | #12 |
Green Mole
Join Date: Jan 2004
Location: Hamm/NRW/Germany
Posts: 26
|
is it possible to set more then one sessid's to ignore?
like define('PHPDIG_SESSID_VAR','MoodleSession,sessid,id'); |
01-22-2004, 12:49 PM | #13 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Assuming version 1.8.0, use the robot_functions.php file attached in this thread.
This is untested, but in the robot_funtions.php file there are two places to edit. First, replace the following: PHP Code:
PHP Code:
PHP Code:
PHP Code:
Of course, you could define such a function to avoid the repeat in code. In any case, remember to remove any "word" wrapping in the above code.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-23-2004, 04:45 AM | #14 |
Green Mole
Join Date: Jan 2004
Location: Hamm/NRW/Germany
Posts: 26
|
i replaced my old robots_function.php with the one u said...
but i cant find any of the lines i have to replace :/ |
01-23-2004, 10:10 AM | #15 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. If you downloaded the robot_functions.php file in this thread, the code is in there. Just search on PHPDIG_SESSID_REMOVE to find the two places to edit.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
Thread Tools | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
I cannot update my website | humanitaire.ws | How-to Forum | 7 | 01-19-2005 09:00 AM |
Taking Requests | Charter | Mod Requests | 26 | 05-04-2004 10:23 AM |
google update | heiko | IPs, SEs, & UAs | 4 | 04-17-2004 04:10 PM |
Update Documentation | Charter | Feedback & News | 6 | 01-19-2004 10:11 AM |
Index update question | Gecko | Troubleshooting | 9 | 10-11-2003 04:35 AM |