Issue Number | 4552 |
---|---|
Summary | [Internal] Clean-up after Publish Job Failure |
Created | 2018-11-26 12:48:28 |
Issue Type | Improvement |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2019-06-10 17:12:44 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.236613 |
When a publishing job is cancelled due to maximum publishing time
exceeded, we should update the publishing job's status in the pub_proc
table to set the status to 'Failure' instead of keeping it 'In
process'.
This change would prevent us from having to manually set the publishing
job status.
The publishing program SubmitPubJob.py has been updated to set the status to 'Failure' automatically if a job doesn't finish within the time allowed.
--git a/Publishing/SubmitPubJob.py b/Publishing/SubmitPubJob.py
diff ..69e9628 100755
index dc8a08f--- a/Publishing/SubmitPubJob.py
+++ b/Publishing/SubmitPubJob.py
@@ -149,6 +149,72 @@ def checkJobStatus(jobId):
return row
+# ---------------------------------------------------------------
+# Function to set the job status to failure
+# On occasion, the publishing job fails to finish (network
+# connectivity issues?) and will get cancelled once the max time
+# allowed is reached.
+# This function sets the job status to 'Failure' so that the job
+# status isn't preventing new jobs from being processed because
+# only one single job (per job type) is allowed to run at a time.
+#
+# If testing this function for a job that *not* actually failed be
+# prepared that the status gets set to 'Failure' but will be
+# updated again - possibly to 'Success' - at the end of the
+# "not-really-failed" publishing job.
+# ---------------------------------------------------------------
+def statusPubJobFailure(jobId):
+ # Defensive programming.
+ tries = MAX_RETRIES
+
+ while tries:
+ try:
+ conn = cdrdb.connect("cdr")
+ cursor = conn.cursor()
+ cursor.execute("""\
+ SELECT id, status, started, completed, messages
+ FROM pub_proc
+ WHERE id = %d""" % int(jobId), timeout = 300)
+ row = cursor.fetchone()
+ l.write("Job%d status: %s" % (row[0], row[1]), stdout=True)
+
+ # We can stop trying now, we got it.
+ tries = 0
+
+ except cdrdb.Error, info:
+ l.write("*** Failure connecting to DB ***", stdout=True)
+ l.write("*** Unable to set job status to 'Failure'.", stdout=True)
+ l.write("*** PubJob%d: %s" % ( int(jobId), info[1][0]), stdout=True)
+ waitSecs = (MAX_RETRIES + 1 - tries) * RETRY_MULTIPLIER
+ l.write(" RETRY: %d retries left; waiting %f seconds" % (tries,
+ waitSecs))
+ time.sleep(waitSecs)
+ tries -= 1
+
+ # Setting the job status to 'Failure' rather than leaving it as
+ # 'In process'. That way a new job won't fail until the job
+ # status has been manually updated.
+ # -------------------------------------------------------------
+ try:
+ cursor.execute("""\
+ UPDATE pub_proc
+ SET status = 'Failure'
+ WHERE id = %d
+ AND status = 'In process'""" % int(jobId), timeout=300)
+
+ conn.commit()
+ except cdrdb.Error, info:
+ l.write("*** Failure updating job status ***", stdout=True)
+ l.write("*** Manually set the job status to 'Failure'.", stdout=True)
+ l.write("*** PubJob%d: %s" % ( int(jobId), info[1][0]), stdout=True)
+
+ if not row:
+ raise Exception("*** (3) Tried to connect %d times. No Pub Job-ID." %
+ MAX_RETRIES)
+
+ return row
+
+
--------------------------------------------------------------
# .
# Function to find the job ID of the push job--------------------------------------------------------------
# @@ -370,6 +436,7 @@ The publishing job failed. It did not finish within the maximum time
.
allowed"""
sendFailureMessage(subject, msgBody)
+ statusPubJobFailure(submit[0])
sys.exit(1)
# Once the publishing job completed with status Success
The file SubmitPubJob.py located in the directory Publishing has been updated on DEV.
The branch has been created off of the new joule branch: [cdr4552-pub-failure 4fd6cc6]
The changes have been merged into our release branch Joule:
I prefer not to test this change in production because it would require to force a regular publishing job to fail. It's more likely that this new function will be triggered on the lower tiers anyway where a publishing failure is more likely to occur.
Closing ticket.
Elapsed: 0:00:00.001520