Enhanced web based management tools for the CDR

JIRA OCECDR-3950

Table of Contents

1 Introduction

These notes constitute a list of ideas garnered from looking at documentation on the web for job scheduling systems. The list of features are "desirable" features, not requirements. We don't know yet which of these are actually required. It's likely that very few are absolutely required, but they give us a potential checklist to use in comparing off the shelf scheduling systems.

2 Requirements

2.1 Add software to the CDR to simplify batch job management

2.2 Reduce workload on CBIIT

2.2.1 Currently we

  • Figure out exactly what to do
  • Create a script or set of directions
  • Create an issue for CBIIT
  • Wait for CBIIT to implement it
  • Check results, often requiring CBIIT's help

2.2.2 We would like as much as possible to be able to

do this without the double efforts and delays inherent in the existing process.

2.3 Work within existing and future security constraints

3 List of Possible/Desirable features

3.1 Security

3.1.1 Require two factor identification to access system?

3.1.2 Log each user who enters or modifies a job

3.1.3 Specify users/groups with permission on jobs

  • On per job basis
  • On per account basis
    e.g. only user in group X can start job that runs under account Y

3.1.4 Does not require Administrator privileges

3.2 Licensing

3.2.1 Open-source is best

3.3 Good documentation

3.4 Portability

3.4.1 Linux or Windows

3.4.2 MySQL or SQL Server

3.4.3 Usable on multiple projects, not just the CDR

3.5 Interfaces

3.5.1 Web interface via https

3.5.2 Command line interface

  • May not be used often but,
    • Useful because it allows us to script changes

3.6 Capability

3.6.1 Start a job

3.6.2 Launch any kind of job

  • Executable binary
  • Python script
  • C# program
  • SQL script
    • Like sqlc or sqlq in drush
  • Bash shell script
  • Windows command file
  • PHP script?
  • Java program

3.6.3 User can restart a job that failed

  • Probably requires that job be written for that

3.6.4 Cancel a job that has not run yet

3.6.5 Suspend a running job

  • Sends a signal to job to suspend itself
    • We build in checks for signals to our jobs

3.6.6 Kill a running job

3.6.7 Schedule a job for later execution

3.6.8 Schedule for regular execution

3.6.9 Chain jobs

  • Link multiple jobs in a chain
  • Each job returns status information for success or failure
  • When one completes successfully, next one starts
  • Chains can include jobs or other chains
    • Detect and prevent looping?

3.6.10 Enter/override parameters for a job

3.6.11 Set OS job priority

3.6.12 Control over concurrency

  • Specify that a job must not be run concurrently
    • Maybe that's the default
  • Limit the number of concurrent batch jobs

3.6.13 Run jobs that missed times due to system down?

  • Ideally, a job control parameter

3.6.14 Log rotation

  • Or we can use FileSweeper

3.6.15 Email status information

  • Distribution lists
    • Custom distribution lists for different jobs
    • General list for all jobs?
    • Add/delete addrs from list via web interface
    • Special lists for failed jobs?
      • e.g., if job fails, notify …
      • If job succeeds notify smaller group, or no one

3.6.16 Built in sftp client? server?

  • Might simplify some efforts

3.6.17 Ability to communicate across multiple servers

  • Idea is:
    Start a job, e.g., in the CDR. When it's done, cause data to be sent to a Linux server. When copy is complete, start a job on the Linux server.

3.6.18 API?

  • Enables us to script job management

3.6.19 Import/export controls

  • Ideally
    • Generate a schedule control table on one tier
    • Export/import it to other tiers
    • Allow certain changes to be made globally
      • e.g., replace a server name, db name, etc. globally

3.6.20 Job monitoring

  • See what's running
    • Job name, ID, link to control info
    • When started, time elapsed
    • Parameters passed
    • Resources used?
      • CPU
      • Disk reads / writes
      • Network reads / writes
    • Current status
      • Succeeded
      • Failed
      • Running
      • Suspended
      • Waiting on dependency
      • Idle
        • Running, but nothing happening
      • Killed
  • Set limits on job
    • If runs longer than X
      • Notify user
    • If runs longer than Y
      • Suspend the job
      • Kill the job

3.6.21 Dependency checking / job triggering

  • Dependencies
    • Specify minimum disk (or memory?) availability
      before starting a job
      • Scheduler won't schedule job if resources are insufficient
      • Logs and sends email when job can't start
    • Resource locking
      • e.g., Have a way to specify that some resource must be open
      • Example in Windows, find out if some process has a file open
    • Specify other jobs that must be complete before starting this
    • Specify other jobs that can't be running when starting this
  • Triggering
    Similar to dependency checking but controls what causes a job to start instead of what prevents a job from starting
    • Completion of a job starts another
      • Chaining is one way to do this, but not the only way
    • Receipt of an email
    • Appearance of a file or database row
    • Success or failure code from a script
    • API call from a job

3.6.22 Crash management

  • System records what it's doing when doing it
  • Checks records when starting
    • Detects that jobs X, Y, and Z were in progress
    • Understands that a crash occurred
    • Logs info, emails info
    • Allows per job controls over what happens in crash recovery?
      • Restart job automatically?
      • Skip job?
      • Stop all processing until human input?
        • Alert humans, pause for instructions
      • Or just stop one job until human input
  • Performs cleanups of crash data
    • Maybe automatically?
    • Maybe only when human says to proceed?

3.6.23 Reports

  • Kinds of reports
    • Individual job run
      • Datetime of start, end
      • Status
      • Parameters used
      • Directory in which it started
      • Environment when it started
      • Account under which it ran
    • Job reports
      • By job
      • By date range
      • By category
    • Job control tables
      • Lists of jobs in each table
      • Lists of jobs/chains in each chain
      • Links to detailed info about each one
    • User reports
      • Distribution lists
      • Number of emails sent?
  • Information available for reporting
    • List of job runs
    • Start and end times, duration
    • Success
    • Statistics
      • Example
        • Publishing ran 26 times in May
        • Average run time = 02:16:21 (H:M:S)
        • Max run time = 15:52:11 at 05:15:2016
        • Min run time = 00:00:41 at 05:13:2016
        • Successes = 24
        • Failures = 1
        • Jobs killed = 1

Date: 2016-02-22T21:21-0500

Author: Alan Meyer

Org version 7.9.3f with Emacs version 24

Validate XHTML 1.0