Issue Number | 4279 |
---|---|
Summary | CDR Scheduler failing |
Created | 2017-06-15 10:28:53 |
Issue Type | Bug |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2017-07-27 15:39:25 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.210094 |
Error message vary. Most frequent is:
: (_mssql.MSSQLDriverException) Not connected to any MS SQL server DBAPIError
We met with David Do on Friday afternoon. He was under the impression that the problems with the scheduler are the cause of the publishing slowdown (though I tried to provide evidence against that theory). He has shut down the service. I just attached an annotated version of the scheduler logs from PROD. Here's the high-level summary (this is as of July 3):
===============================================================================
SUMMARY
===============================================================================
Thu Dec 15 Scheduler installed; WSM tries to start multiple instances at once
Sat Dec 17 Scheduler running with a single instance
Sat Dec 17 Scheduler restarts 3 times starting at 7:40 pm; shutdown not logged
Tue Jan 10 Exception thrown: "Not connected to any MS SQL Server"
Sat Jan 21 Scheduler restarted at 9 pm; still no shutdown logging
Tue Mar 07 Scheduler restarted at 3:44 pm; no shutdown logging
Wed Mar 15 Einstein deployed (with zombie cleanup); scheduler restarts itself
Fri Mar 17 Job for translation queue notification (TQN) installed
Mon Mar 20 TQN job runs successfully as scheduled at 5 am
Mon Mar 20 Job to bounce the CDR Publishing service is installed (disabled)
Sat Mar 25 Scheduler restarted twice at 8:40 pm; still no shutdown logging
Mon Mar 27 TQN job runs successfully as scheduled at 5 am
Mon Apr 03 TQN job runs successfully as scheduled at 5 am
Fri Apr 07 Weekly publishing job installed and run; fails (DB unavailable)
Mon Apr 10 TQN job starts but interrupted by scheduler restart; marked FAILED
Mon Apr 17 Exception "Not connected to any MS SQL Server" at 5 am (TQN job)
Mon Apr 17 Scheduler restart fails first admin connection; 2nd attempt succeeds
Sat Apr 22 Scheduler restarts 6:04 pm; no shutdown logging
Sun Apr 23 Scheduler restarts 2:11 am; no shutdown logging
Mon Apr 24 Exception "Not connected to any MS SQL Server" at 5 am (TQN job)
Mon Apr 24 Gov Delivery report installed (during which scheduler restarted)
Mon May 01 Exception "Not connected to any MS SQL Server" at 5 am (TQN job)
Mon May 01 Trial tracking table refresh job installed
Sat May 06 Scheduler restarts 8:11 pm; no shutdown logging
Mon May 08 TQN job runs successfully as scheduled at 5 am
Tue May 09 Scheduler restarts 5:57 pm; no shutdown logging
Tue May 09 Scheduler restarts 2x, starting 7:32 pm; no shutdown logging
Mon May 15 TQN job runs successfully as scheduled at 5 am
Wed May 17 Pub queue check (PQC) installed; runs every 2 minutes
Sat May 20 MSSQLDatabaseException: "SHUTDOWN ..." at 7:16 pm
Sat May 20 Scheduler restarts (twice) and resumes processing (11:14 pm)
Mon May 22 TQN job runs successfully as scheduled at 5 am
Mon May 22 Scheduler logs unknown database failure but continues
Mon May 29 TQN job runs successfully as scheduled at 5 am
Tue May 30 Nightly glossifier job installed; no restarts logged
Thu Jun 01 Scheduler loses db connection and stops processing
Fri Jun 09 CBIIT starts 2 scheduler instances; 1 keeps trying to restart
Mon Jun 12 TQN job runs successfully as scheduled at 5 am
Mon Jun 12 Second instance stops logging restarts every 2-3 seconds
Mon Jun 19 TQN job runs successfully as scheduled at 5 am
Mon Jun 26 TQN job runs successfully as scheduled at 5 am
Wed Jun 28 Scheduler restarts itself; WCM tries to launch 2nd instance
Thu Jun 29 Scheduler loses db connect; WCM keeps trying to launch another
Fri Jun 30 CBIIT shuts down the service
Here's a list of the exceptions thrown by the scheduler on PROD to date:
2017-01-10 10:31:38.653 scheduler_manager.get_jobs() for web front page
2017-04-07 21:04:54.887 "not connected ..." submitting publishing job
2017-04-17 05:00:00.039 "not connected ..." in jobstore.get_due_jobs()
2017-04-17 05:00:00.028 "not connected ..." in jobstore.get_due_jobs()
2017-05-01 05:00:00.019 "not connected ..." in jobstore.get_due_jobs()
2017-05-20 19:16:00.049 "SHUTDOWN ..." in jobstore.get_due_jobs()
2017-06-01 13:44:00.239 "not connected ..." in jobstore.get_due_jobs()
2017-06-24 19:02:02.061 "not connected ..." in jobstore.get_due_jobs()
2017-06-09 14:52:03.005 scheduler_manager.get_jobs() for web front page
2017-06-29 07:42:00.059 "not connected ..." in jobstore.get_due_jobs()
2017-06-30 14:36:18.808 scheduler_manager.get_jobs() for web front page
2017-06-30 14:36:31.801 scheduler_manager.get_jobs() for web front page
Possible paths to a solution include:
find a way to induce sqlalchemy to be more resilient in the face of transient db connection failures
get CBIIT to stop the CDR services before shutting down the DB server and restarting them after the DB server is available again
try another database access module
override jobstore.get_due_jobs()
(a partial
solution)
I have added a page to Collaborate describing at an appropriately high level of abstraction how the core of the scheduler works. I also have another page which has way too much detail about the processing flow when the scheduler is running.
I may have come up with a way to make the scheduler survive the database outages. I tried replacing nssm with the Python win32 service support modules, but they actually don't do as good a job as nssm does at detecting when the service has stopped responding (at least not at the present time). There is a patch addressing this problem floating around, submitted to the python-win32 mailing list a few months ago, but I don't see evidence that anyone is doing anything with it. I dug into the source code for nssm, and I have come to the conclusion that it is an excellent tool (ignoring some of the coding style issues, line obscenely long source code lines, some of them over 400 characters wide – who's got a monitor that big?). It actually does take over the functionality of the WCM, with much more finely grained control over the service, particularly in the area of failure recovery.
My plan of attack is multi-pronged, and includes:
Hook into sqlalchemy to detect when the database has gone away and take appropriate action
Upgrade nssm from 2.16 to the current stable release (2.24)
Get the service executable out of the private user area for the CBIIT employee who installed the service
Take advantage of the rich control over the service supported by nssm
Investigate running under the ncicdr service account (as we do for the CDR service) instead of the local service account
Work with CBIIT to ensure that Windows file/directory permissions are correctly configured
Turn off autoreload inside tornado; this appears to be at least part of the cause of the redundant process instantiations
Add more logging to the tornado loop
Work with CBIIT to get them to stop the scheduler service when they're going to take the database off line, and to restart it when the database service has been restored
For some of these changes I may need to override more classes/methods than I would like, at least in the short run. For example, the way ndscheduler creates its HTTP handlers, the only way I can suppress the tornado autoreload (which wakes up every 500 milliseconds to check to see if any of the dozens of loaded modules has been modified, and if so to exit the process and start another, completely befuddling the service managers) is to override the constructor for the ndscheduler SchedulerServer class. I have offered to come up with a patch which would let their own constructor accept an optional dictionary of custom tornado settings to avoid this ugliness, but I haven't heard anything back from the ndscheduler maintainers yet.
Here's a first draft of the script I'm planning to use for installing the service (I see JIRA's code formatter is broken again).
SET NSSM=D:\cdr\Bin\nssm.exe
%NSSM% install CDRScheduler D:\Python\Python.exe
%NSSM% set CDRScheduler AppDirectory D:\cdr\Scheduler
%NSSM% set CDRScheduler AppParameters D:\cdr\Scheduler\cdr_scheduler.py
%NSSM% set CDRScheduler AppEnvironmentExtra PYTHONPATH=D:\cdr\lib\Python NDSCHEDULER_SETTINGS_MODULE=settings
%NSSM% set CDRScheduler DisplayName CDR Scheduler
%NSSM% set CDRScheduler Description Manages CDR tasks
%NSSM% set CDRScheduler AppNoConsole 1
%NSSM% set CDRScheduler AppExit Default Restart
%NSSM% set CDRScheduler AppThrottle 60000
%NSSM% set CDRScheduler Start SERVICE_AUTO_START
%NSSM% set CDRScheduler DependOnService Cdr
%NSSM% set CDRScheduler AppStdout D:\cdr\log\tconn-service.log
%NSSM% set CDRScheduler AppStderr D:\cdr\log\tconn-service.log
%NSSM% set CDRScheduler AppStdoutCreationDisposition 4
%NSSM% set CDRScheduler AppStderrCreationDisposition 4
%NSSM% set CDRScheduler AppRotateFiles 1
%NSSM% set CDRScheduler AppRotateOnline 0
%NSSM% set CDRScheduler AppRotateSeconds 2592000
%NSSM% set CDRScheduler AppRotateBytes 10000000
I have tested the core features of my proposed approach on my systems in Arlington, and I will be installing the modifications on DEV over the next couple of days.
I have deployed the modifications to DEV, QA, and STAGE. If they continue running successfully while I am out over the following week, I will work with CBIIT to get them deployed to PROD and get the service turned back on.
This landed on production last night.
File Name | Posted | User |
---|---|---|
scheduler-history.txt | 2017-07-03 14:42:53 | Kline, Bob (NIH/NCI) [C] |
Elapsed: 0:00:00.001352