![]() |
Update Index taking 11 hours....
In order to eliminate a problem we were having with the way the spider indexed our files, we told PHPDIG to deal with session IDs named "&MoodleSession". This seemed to fix the original problem of having a lot of pages indexed when there were in actuality very few. Another problem has cropped up, however, after changing this setting.
When we run an update-index, it takes about 11 hours to complete, claiming to have gone through about 14,000 pages (rather than the usual 2,000, the website has ~200 unique pages). Is there a better way to ignore session IDs that are passed through the URI? Is the Session ID at fault here, or could it be something else? |
Hi. Is there a session ID showing up when you crawl? Can you provide a snippet of output from the crawl?
|
1 Attachment(s)
Please look at the attached file. I just did a GROUP BY query on the spider table, and found that after removing the MoodleSession addition from the filename that each result appears three times. (The attached file is HTML with a TXT extension).
|
Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.
In this function, it is the following query that checks for a duplicate: PHP Code:
PHP Code:
Try using the PHPDIG_EXCLUDE_COMMENT and PHPDIG_INCLUDE_COMMENT values from the config file, each on their own line (with PHP use \n if necessary), and surround the portion of dynamic content in each page like so: Code:
<!-- phpdigExclude --> |
The problem is not that the pages are seen as duplicated, the problem is that they are NOT seen as duplicates. For some reason, PHPDig is assuming that the "MoodleSession" is part of the file name, therefore it is including the same file multiple times.
[site]/help.php?file=program.html&module=resource&MoodleSession=8784c2715aeb4eb5ba 81cbd12b475972 is seen as DIFFERENT from [site]/help.php?file=program.html&module=resource&MoodleSession=6915eacadf8e168266 a793884ec15cb0 despite the fact that they are the same page with a different MoodleSession string appended to the end. I am assuming that the reason it sees them as different files is because of the MoodleSession thing. Is there a way to make PHPDig ignore this part of the "file"? |
>> As the pages are creating different $md5 variables, they are not seen as duplicates.
Hi. If you are using the following: PHP Code:
If so, then try using the PHPDIG_EXCLUDE_COMMENT and PHPDIG_INCLUDE_COMMENT values to surround the part of the page that is changing upon refresh. |
1 Attachment(s)
That works just fine when doing the initial indexing, though it's leading a trailing "&" character at the end of each file stored. When performing an update, however, after the page labels "level 1", it begins to show the MoodleSessions again, and it repeats the same files over. (it's still running after 10 minutes; the initial indexing took about 4 minutes).
It is NOT adding entries to the spider table. Attached are the results as of this writing (it's still going). |
It's STILL going...
For some reason, when doing an update, PHPDig still uses the MoodleSession in the path while DISPLAYING the results to the screen, whereas it is not actually writing these to the database. In some of the files, PHPDig is using the file name as the "title", therefore making the MD5 different (I'm assuming). Is there any way to make PHPDig ignore the MoodleSession TOTALLY (including making it the title, etc)? Edit: I looked at the tempspider DB, and it DOES show the MoodleSession stuff. |
Hi. The tempspider table holds links for processing, and the display shows links that are being crawled. This information doesn't necessary reflect the exact end results. Using fresh database tables, are there session IDs in sites table or in the search results after a crawl? You can try taking $titre_resume out of the $md5 variable, but it sounds like you are using the options currently available to handle session IDs.
|
SOLUTION FOUND
So we (pardon all the we's and I's; there are two of us working on this) figured out what the problem was.
For some reason, PHPDig was not REALLY getting the MoodleSession out of the URI. This caused the script to dig for a REALLY long time, not only with updating, but also with the initial index (1.5 hours for ~229 REAL file permutations; 11+ hours for the update). I made a function called "adamize" that chops out the MoodleSession part of the URI altogether, so that PHPDig does not even know it exists. Ever. I placed the function below at the bottom of the "robot_functions.php" page, and then made a reference to it at line 172 (in my file) in function phpdigRewriteUrl, under the "settype($eval,'string')" line: PHP Code:
PHP Code:
|
Hi. Thanks, good idea. You might also try the following.
In phpdigRewriteUrl, comment out your adamize function call, and replace the following: Code:
// parse and remove quotes Code:
// parse and remove quotes |
is it possible to set more then one sessid's to ignore?
like define('PHPDIG_SESSID_VAR','MoodleSession,sessid,id'); |
Hi. Assuming version 1.8.0, use the robot_functions.php file attached in this thread.
This is untested, but in the robot_funtions.php file there are two places to edit. First, replace the following: PHP Code:
PHP Code:
PHP Code:
PHP Code:
Of course, you could define such a function to avoid the repeat in code. In any case, remember to remove any "word" wrapping in the above code. |
i replaced my old robots_function.php with the one u said...
but i cant find any of the lines i have to replace :/ |
Hi. If you downloaded the robot_functions.php file in this thread, the code is in there. Just search on PHPDIG_SESSID_REMOVE to find the two places to edit.
|
All times are GMT -8. The time now is 05:54 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.