PDQ Issues

Issue Number	4552
Summary	[Internal] Clean-up after Publish Job Failure
Created	2018-11-26 12:48:28
Issue Type	Improvement
Submitted By	Englisch, Volker (NIH/NCI) [C]
Assigned To	Englisch, Volker (NIH/NCI) [C]
Status	Closed
Resolved	2019-06-10 17:12:44
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.236613

Description

When a publishing job is cancelled due to maximum publishing time exceeded, we should update the publishing job's status in the pub_proc table to set the status to 'Failure' instead of keeping it 'In process'.
This change would prevent us from having to manually set the publishing job status.

Comment entered 2019-04-10 15:59:41 by Englisch, Volker (NIH/NCI) [C]

The publishing program SubmitPubJob.py has been updated to set the status to 'Failure' automatically if a job doesn't finish within the time allowed.

diff --git a/Publishing/SubmitPubJob.py b/Publishing/SubmitPubJob.py
index dc8a08f..69e9628 100755
--- a/Publishing/SubmitPubJob.py
+++ b/Publishing/SubmitPubJob.py
@@ -149,6 +149,72 @@ def checkJobStatus(jobId):
     return row
 
 
+# ---------------------------------------------------------------
+# Function to set the job status to failure
+# On occasion, the publishing job fails to finish (network
+# connectivity issues?) and will get cancelled once the max time
+# allowed is reached.
+# This function sets the job status to 'Failure' so that the job
+# status isn't preventing new jobs from being processed because
+# only one single job (per job type) is allowed to run at a time.
+#
+# If testing this function for a job that *not* actually failed be
+# prepared that the status gets set to 'Failure' but will be
+# updated again - possibly to 'Success' - at the end of the
+# "not-really-failed" publishing job.
+# ---------------------------------------------------------------
+def statusPubJobFailure(jobId):
+    # Defensive programming.
+    tries = MAX_RETRIES
+
+    while tries:
+        try:
+            conn = cdrdb.connect("cdr")
+            cursor = conn.cursor()
+            cursor.execute("""\
+                SELECT id, status, started, completed, messages
+                  FROM pub_proc
+                 WHERE id = %d""" % int(jobId), timeout = 300)
+            row = cursor.fetchone()
+            l.write("Job%d status: %s" % (row[0], row[1]), stdout=True)
+
+            # We can stop trying now, we got it.
+            tries = 0
+
+        except cdrdb.Error, info:
+            l.write("*** Failure connecting to DB ***", stdout=True)
+            l.write("*** Unable to set job status to 'Failure'.", stdout=True)
+            l.write("*** PubJob%d: %s" % ( int(jobId), info[1][0]), stdout=True)
+            waitSecs = (MAX_RETRIES + 1 - tries) * RETRY_MULTIPLIER
+            l.write("    RETRY: %d retries left; waiting %f seconds" % (tries,
+                                                               waitSecs))
+            time.sleep(waitSecs)
+            tries -= 1
+
+    # Setting the job status to 'Failure' rather than leaving it as
+    # 'In process'.  That way a new job won't fail until the job
+    # status has been manually updated.
+    # -------------------------------------------------------------
+    try:
+        cursor.execute("""\
+            UPDATE pub_proc
+               SET status = 'Failure'
+            WHERE id = %d
+               AND status = 'In process'""" % int(jobId), timeout=300)
+
+        conn.commit()
+    except cdrdb.Error, info:
+        l.write("*** Failure updating job status ***", stdout=True)
+        l.write("*** Manually set the job status to 'Failure'.", stdout=True)
+        l.write("*** PubJob%d: %s" % ( int(jobId), info[1][0]), stdout=True)
+
+    if not row:
+        raise Exception("*** (3) Tried to connect %d times. No Pub Job-ID." %
+                        MAX_RETRIES)
+
+    return row
+
+
 # --------------------------------------------------------------
 # Function to find the job ID of the push job.
 # --------------------------------------------------------------
@@ -370,6 +436,7 @@ The publishing job failed.  It did not finish within the maximum time
 allowed.
 """
                 sendFailureMessage(subject, msgBody)
+                statusPubJobFailure(submit[0])
                 sys.exit(1)
 
         # Once the publishing job completed with status Success

Comment entered 2019-06-10 17:12:32 by Englisch, Volker (NIH/NCI) [C]

The file SubmitPubJob.py located in the directory Publishing has been updated on DEV.

The branch has been created off of the new joule branch: [cdr4552-pub-failure 4fd6cc6]

Comment entered 2019-06-21 10:40:16 by Englisch, Volker (NIH/NCI) [C]

The changes have been merged into our release branch Joule:

https://github.com/NCIOCPL/cdr-publishing/commit/4fd6cc6

Comment entered 2019-08-29 11:33:51 by Englisch, Volker (NIH/NCI) [C]

I prefer not to test this change in production because it would require to force a regular publishing job to fail. It's more likely that this new function will be triggered on the lower tiers anyway where a publishing failure is more likely to occur.

Closing ticket.

Elapsed: 0:00:00.001435

CDR Tickets