Enhanced web based management tools for the CDR

JIRA OCECDR-3950

1 Introduction
2 Requirements
3 List of Possible/Desirable features

1 Introduction

These notes constitute a list of ideas garnered from looking at documentation on the web for job scheduling systems. The list of features are "desirable" features, not requirements. We don't know yet which of these are actually required. It's likely that very few are absolutely required, but they give us a potential checklist to use in comparing off the shelf scheduling systems.

2 Requirements

2.1 Add software to the CDR to simplify batch job management

2.2 Reduce workload on CBIIT

2.2.1 Currently we

Figure out exactly what to do

Create a script or set of directions

Create an issue for CBIIT

Wait for CBIIT to implement it

Check results, often requiring CBIIT's help

2.2.2 We would like as much as possible to be able to

do this without the double efforts and delays inherent in the existing process.

2.3 Work within existing and future security constraints

3 List of Possible/Desirable features

3.1 Security

3.1.1 Require two factor identification to access system?

3.1.2 Log each user who enters or modifies a job

3.1.3 Specify users/groups with permission on jobs

On per job basis

On per account basis
e.g. only user in group X can start job that runs under account Y

3.1.4 Does not require Administrator privileges

3.2 Licensing

3.2.1 Open-source is best

3.3 Good documentation

3.4 Portability

3.4.1 Linux or Windows

3.4.2 MySQL or SQL Server

3.4.3 Usable on multiple projects, not just the CDR

3.5 Interfaces

3.5.1 Web interface via https

3.5.2 Command line interface

May not be used often but,
- Useful because it allows us to script changes

3.6 Capability

3.6.1 Start a job

3.6.2 Launch any kind of job

Executable binary

Python script

C# program

SQL script
- Like sqlc or sqlq in drush

Bash shell script

Windows command file

PHP script?

Java program

3.6.3 User can restart a job that failed

Probably requires that job be written for that

3.6.4 Cancel a job that has not run yet

3.6.5 Suspend a running job

Sends a signal to job to suspend itself
- We build in checks for signals to our jobs

3.6.6 Kill a running job

3.6.7 Schedule a job for later execution

3.6.8 Schedule for regular execution

3.6.9 Chain jobs

Link multiple jobs in a chain

Each job returns status information for success or failure

When one completes successfully, next one starts

Chains can include jobs or other chains
- Detect and prevent looping?

3.6.10 Enter/override parameters for a job

3.6.11 Set OS job priority

3.6.12 Control over concurrency

Specify that a job must not be run concurrently
- Maybe that's the default

Limit the number of concurrent batch jobs

3.6.13 Run jobs that missed times due to system down?

Ideally, a job control parameter

3.6.14 Log rotation

Or we can use FileSweeper

3.6.15 Email status information

Distribution lists
- Custom distribution lists for different jobs
- General list for all jobs?
- Add/delete addrs from list via web interface
- Special lists for failed jobs?
  - e.g., if job fails, notify …
  - If job succeeds notify smaller group, or no one

3.6.16 Built in sftp client? server?

Might simplify some efforts

3.6.17 Ability to communicate across multiple servers

Idea is:
Start a job, e.g., in the CDR. When it's done, cause data to be sent to a Linux server. When copy is complete, start a job on the Linux server.

3.6.18 API?

Enables us to script job management

3.6.19 Import/export controls

Ideally
- Generate a schedule control table on one tier
- Export/import it to other tiers
- Allow certain changes to be made globally
  - e.g., replace a server name, db name, etc. globally

3.6.20 Job monitoring

See what's running
- Job name, ID, link to control info
- When started, time elapsed
- Parameters passed
- Resources used?
  - CPU
  - Disk reads / writes
  - Network reads / writes
- Current status
  - Succeeded
  - Failed
  - Running
  - Suspended
  - Waiting on dependency
  - Idle
    - Running, but nothing happening
  - Killed

Set limits on job
- If runs longer than X
  - Notify user
- If runs longer than Y
  - Suspend the job
  - Kill the job

3.6.21 Dependency checking / job triggering

Dependencies
- Specify minimum disk (or memory?) availability
  before starting a job
  - Scheduler won't schedule job if resources are insufficient
  - Logs and sends email when job can't start
- Resource locking
  - e.g., Have a way to specify that some resource must be open
  - Example in Windows, find out if some process has a file open
- Specify other jobs that must be complete before starting this
- Specify other jobs that can't be running when starting this

Triggering
Similar to dependency checking but controls what causes a job to start instead of what prevents a job from starting
- Completion of a job starts another
  - Chaining is one way to do this, but not the only way
- Receipt of an email
- Appearance of a file or database row
- Success or failure code from a script
- API call from a job

3.6.22 Crash management

System records what it's doing when doing it

Checks records when starting
- Detects that jobs X, Y, and Z were in progress
- Understands that a crash occurred
- Logs info, emails info
- Allows per job controls over what happens in crash recovery?
  - Restart job automatically?
  - Skip job?
  - Stop all processing until human input?
    - Alert humans, pause for instructions
  - Or just stop one job until human input

Performs cleanups of crash data
- Maybe automatically?
- Maybe only when human says to proceed?

3.6.23 Reports

Kinds of reports
- Individual job run
  - Datetime of start, end
  - Status
  - Parameters used
  - Directory in which it started
  - Environment when it started
  - Account under which it ran
- Job reports
  - By job
  - By date range
  - By category
- Job control tables
  - Lists of jobs in each table
  - Lists of jobs/chains in each chain
  - Links to detailed info about each one
- User reports
  - Distribution lists
  - Number of emails sent?

Information available for reporting
- List of job runs
- Start and end times, duration
- Success
- Statistics
  - Example
    - Publishing ran 26 times in May
    - Average run time = 02:16:21 (H:M:S)
    - Max run time = 15:52:11 at 05:15:2016
    - Min run time = 00:00:41 at 05:13:2016
    - Successes = 24
    - Failures = 1
    - Jobs killed = 1