PDQ Issues

Issue Number	4279
Summary	CDR Scheduler failing
Created	2017-06-15 10:28:53
Issue Type	Bug
Submitted By	Kline, Bob (NIH/NCI) [C]
Assigned To	Kline, Bob (NIH/NCI) [C]
Status	Closed
Resolved	2017-07-27 15:39:25
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.210094

Description

Error message vary. Most frequent is:

DBAPIError: (_mssql.MSSQLDriverException) Not connected to any MS SQL server

Comment entered 2017-07-03 14:46:28 by Kline, Bob (NIH/NCI) [C]

We met with David Do on Friday afternoon. He was under the impression that the problems with the scheduler are the cause of the publishing slowdown (though I tried to provide evidence against that theory). He has shut down the service. I just attached an annotated version of the scheduler logs from PROD. Here's the high-level summary (this is as of July 3):

===============================================================================
SUMMARY
===============================================================================
Thu Dec 15  Scheduler installed; WSM tries to start multiple instances at once
Sat Dec 17  Scheduler running with a single instance
Sat Dec 17  Scheduler restarts 3 times starting at 7:40 pm; shutdown not logged
Tue Jan 10  Exception thrown: "Not connected to any MS SQL Server"
Sat Jan 21  Scheduler restarted at 9 pm; still no shutdown logging
Tue Mar 07  Scheduler restarted at 3:44 pm; no shutdown logging
Wed Mar 15  Einstein deployed (with zombie cleanup); scheduler restarts itself
Fri Mar 17  Job for translation queue notification (TQN) installed
Mon Mar 20  TQN job runs successfully as scheduled at 5 am
Mon Mar 20  Job to bounce the CDR Publishing service is installed (disabled)
Sat Mar 25  Scheduler restarted twice at 8:40 pm; still no shutdown logging
Mon Mar 27  TQN job runs successfully as scheduled at 5 am
Mon Apr 03  TQN job runs successfully as scheduled at 5 am
Fri Apr 07  Weekly publishing job installed and run; fails (DB unavailable)
Mon Apr 10  TQN job starts but interrupted by scheduler restart; marked FAILED
Mon Apr 17  Exception "Not connected to any MS SQL Server" at 5 am (TQN job)
Mon Apr 17  Scheduler restart fails first admin connection; 2nd attempt succeeds
Sat Apr 22  Scheduler restarts 6:04 pm; no shutdown logging
Sun Apr 23  Scheduler restarts 2:11 am; no shutdown logging
Mon Apr 24  Exception "Not connected to any MS SQL Server" at 5 am (TQN job)
Mon Apr 24  Gov Delivery report installed (during which scheduler restarted)
Mon May 01  Exception "Not connected to any MS SQL Server" at 5 am (TQN job)
Mon May 01  Trial tracking table refresh job installed
Sat May 06  Scheduler restarts 8:11 pm; no shutdown logging
Mon May 08  TQN job runs successfully as scheduled at 5 am
Tue May 09  Scheduler restarts 5:57 pm; no shutdown logging
Tue May 09  Scheduler restarts 2x, starting 7:32 pm; no shutdown logging
Mon May 15  TQN job runs successfully as scheduled at 5 am
Wed May 17  Pub queue check (PQC) installed; runs every 2 minutes
Sat May 20  MSSQLDatabaseException: "SHUTDOWN ..." at 7:16 pm
Sat May 20  Scheduler restarts (twice) and resumes processing (11:14 pm)
Mon May 22  TQN job runs successfully as scheduled at 5 am
Mon May 22  Scheduler logs unknown database failure but continues
Mon May 29  TQN job runs successfully as scheduled at 5 am
Tue May 30  Nightly glossifier job installed; no restarts logged
Thu Jun 01  Scheduler loses db connection and stops processing
Fri Jun 09  CBIIT starts 2 scheduler instances; 1 keeps trying to restart
Mon Jun 12  TQN job runs successfully as scheduled at 5 am
Mon Jun 12  Second instance stops logging restarts every 2-3 seconds
Mon Jun 19  TQN job runs successfully as scheduled at 5 am
Mon Jun 26  TQN job runs successfully as scheduled at 5 am
Wed Jun 28  Scheduler restarts itself; WCM tries to launch 2nd instance
Thu Jun 29  Scheduler loses db connect; WCM keeps trying to launch another
Fri Jun 30  CBIIT shuts down the service

Comment entered 2017-07-03 14:52:14 by Kline, Bob (NIH/NCI) [C]

Here's a list of the exceptions thrown by the scheduler on PROD to date:

2017-01-10 10:31:38.653 scheduler_manager.get_jobs() for web front page
2017-04-07 21:04:54.887 "not connected ..." submitting publishing job
2017-04-17 05:00:00.039 "not connected ..." in jobstore.get_due_jobs()
2017-04-17 05:00:00.028 "not connected ..." in jobstore.get_due_jobs()
2017-05-01 05:00:00.019 "not connected ..." in jobstore.get_due_jobs()
2017-05-20 19:16:00.049 "SHUTDOWN ..." in jobstore.get_due_jobs()
2017-06-01 13:44:00.239 "not connected ..." in jobstore.get_due_jobs()
2017-06-24 19:02:02.061 "not connected ..." in jobstore.get_due_jobs()
2017-06-09 14:52:03.005 scheduler_manager.get_jobs() for web front page
2017-06-29 07:42:00.059 "not connected ..." in jobstore.get_due_jobs()
2017-06-30 14:36:18.808 scheduler_manager.get_jobs() for web front page
2017-06-30 14:36:31.801 scheduler_manager.get_jobs() for web front page

Possible paths to a solution include:

find a way to induce sqlalchemy to be more resilient in the face of transient db connection failures
get CBIIT to stop the CDR services before shutting down the DB server and restarting them after the DB server is available again
try another database access module
override jobstore.get_due_jobs() (a partial solution)

Comment entered 2017-07-03 14:57:24 by Kline, Bob (NIH/NCI) [C]

I have added a page to Collaborate describing at an appropriately high level of abstraction how the core of the scheduler works. I also have another page which has way too much detail about the processing flow when the scheduler is running.

Comment entered 2017-07-05 17:49:26 by Kline, Bob (NIH/NCI) [C]

I may have come up with a way to make the scheduler survive the database outages. I tried replacing nssm with the Python win32 service support modules, but they actually don't do as good a job as nssm does at detecting when the service has stopped responding (at least not at the present time). There is a patch addressing this problem floating around, submitted to the python-win32 mailing list a few months ago, but I don't see evidence that anyone is doing anything with it. I dug into the source code for nssm, and I have come to the conclusion that it is an excellent tool (ignoring some of the coding style issues, line obscenely long source code lines, some of them over 400 characters wide – who's got a monitor that big?). It actually does take over the functionality of the WCM, with much more finely grained control over the service, particularly in the area of failure recovery.

My plan of attack is multi-pronged, and includes:

Hook into sqlalchemy to detect when the database has gone away and take appropriate action
Upgrade nssm from 2.16 to the current stable release (2.24)
Get the service executable out of the private user area for the CBIIT employee who installed the service
Take advantage of the rich control over the service supported by nssm
Investigate running under the ncicdr service account (as we do for the CDR service) instead of the local service account
Work with CBIIT to ensure that Windows file/directory permissions are correctly configured
Turn off autoreload inside tornado; this appears to be at least part of the cause of the redundant process instantiations
Add more logging to the tornado loop
Work with CBIIT to get them to stop the scheduler service when they're going to take the database off line, and to restart it when the database service has been restored

For some of these changes I may need to override more classes/methods than I would like, at least in the short run. For example, the way ndscheduler creates its HTTP handlers, the only way I can suppress the tornado autoreload (which wakes up every 500 milliseconds to check to see if any of the dozens of loaded modules has been modified, and if so to exit the process and start another, completely befuddling the service managers) is to override the constructor for the ndscheduler SchedulerServer class. I have offered to come up with a patch which would let their own constructor accept an optional dictionary of custom tornado settings to avoid this ugliness, but I haven't heard anything back from the ndscheduler maintainers yet.

Here's a first draft of the script I'm planning to use for installing the service (I see JIRA's code formatter is broken again).

SET NSSM=D:\cdr\Bin\nssm.exe
%NSSM% install CDRScheduler D:\Python\Python.exe
%NSSM% set CDRScheduler AppDirectory D:\cdr\Scheduler
%NSSM% set CDRScheduler AppParameters D:\cdr\Scheduler\cdr_scheduler.py
%NSSM% set CDRScheduler AppEnvironmentExtra PYTHONPATH=D:\cdr\lib\Python NDSCHEDULER_SETTINGS_MODULE=settings
%NSSM% set CDRScheduler DisplayName CDR Scheduler
%NSSM% set CDRScheduler Description Manages CDR tasks
%NSSM% set CDRScheduler AppNoConsole 1
%NSSM% set CDRScheduler AppExit Default Restart
%NSSM% set CDRScheduler AppThrottle 60000
%NSSM% set CDRScheduler Start SERVICE_AUTO_START
%NSSM% set CDRScheduler DependOnService Cdr
%NSSM% set CDRScheduler AppStdout D:\cdr\log\tconn-service.log
%NSSM% set CDRScheduler AppStderr D:\cdr\log\tconn-service.log
%NSSM% set CDRScheduler AppStdoutCreationDisposition 4
%NSSM% set CDRScheduler AppStderrCreationDisposition 4
%NSSM% set CDRScheduler AppRotateFiles 1
%NSSM% set CDRScheduler AppRotateOnline 0
%NSSM% set CDRScheduler AppRotateSeconds 2592000
%NSSM% set CDRScheduler AppRotateBytes 10000000

I have tested the core features of my proposed approach on my systems in Arlington, and I will be installing the modifications on DEV over the next couple of days.

Comment entered 2017-07-12 16:41:07 by Kline, Bob (NIH/NCI) [C]

I have deployed the modifications to DEV, QA, and STAGE. If they continue running successfully while I am out over the following week, I will work with CBIIT to get them deployed to PROD and get the service turned back on.

Comment entered 2017-07-27 15:39:25 by Kline, Bob (NIH/NCI) [C]

This landed on production last night.

Attachments

File Name	Posted	User
scheduler-history.txt	2017-07-03 14:42:53	Kline, Bob (NIH/NCI) [C]

Elapsed: 0:00:00.001879

CDR Tickets