CDR Tickets

Issue Number 3675
Summary Failed Hot-fix Blocks Publishing
Created 2013-11-07 11:39:39
Issue Type Bug
Submitted By Englisch, Volker (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2014-01-08 14:23:49
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.114724
Description

A job submitted to Gatekeeper receives a status of 'Verifying' in the CDR. The status is typically being updated to 'Success' once all documents (or at least one) finished the data load on GK. If the batch of documents doesn't complete during the time allotted for checking with GK the job status will be marked as 'Stalled'.
We submitted a hot-fix publishing job with a single summary that failed on GK therefore setting the job status to 'Failure'. The following job from the CDR then compares the last successful job-ID with the last job-ID that Gatekeeper records. When these two job-IDs between the CDR and GK differ, publishing stops.

The problem is that for a failed job on Gatekeeper the last job-ID on GK doesn't match the last successful job-ID in the CDR.

We decided to implement a flag that allows a push job to continue pushing documents to GK even in the event of mismatched job-ID if it has been identified that it's safe to do so.

Comment entered 2013-12-28 12:31:31 by Englisch, Volker (NIH/NCI) [C]

We ran into another (uncommon) situation that needs to be addressed for future updates.
If a publishing job creates no documents to be pushed to Gatekeeper (as happened a part of the publishing job over Christmas) the push job will finish successfully without pushing a single document. The following publishing job will then fail because Gatekeeper reports the previous successful push job (Job11324) while the last successful push job on the CDR side is reported as Job11326.
In addition, the push job 11326 is not listed as a successful job on the Publishing Activities list.

In light of these problems I will increase the priority of this bug as it's causing problems with publishing.

Comment entered 2014-01-08 14:23:49 by Englisch, Volker (NIH/NCI) [C]

I made changes to the code to fix the two problems we recently ran into:

  1. A publishing job that failed due to a failure on the Gatekeeper would block all following publishing jobs and

  2. A publishing job that didn't send incremental updates to GK would block all following publishing jobs.

Fixing the code took about 45 minutes, testing the result lasted many hours. :-(

For the first issue I created a new job parameter 'IgnoreGKJobIDMismatch' which is set to 'No' by default. Once the job failed to push documents to GK we're now able to manually push the documents (after double-checking that the mismatch of the job IDs is caused by a failed GK processing job) by setting this parameter to 'Yes'.

For the second issue the function getLastCgJobId() is no longer including successful jobs for which no documents have been send to Gatekeeper.

The following documents have been updated:

  • cdrpub.py

  • CDR178.xml

Comment entered 2014-01-08 14:27:34 by Englisch, Volker (NIH/NCI) [C]

This has been successfully tested on DEV. I'm not sure if we need to test this again on QA other than making sure I didn't break publishing.

Comment entered 2014-01-08 14:37:00 by Englisch, Volker (NIH/NCI) [C]

The following files have been updated on DEV:

  • R12255: CDR178.xml

  • R12254: cdrpub.py

Comment entered 2014-01-29 13:09:23 by Englisch, Volker (NIH/NCI) [C]

The publishing document has been updated on QA and the nightly publishing job ran OK last night.

Verified on QA.

Comment entered 2014-02-10 15:34:08 by Englisch, Volker (NIH/NCI) [C]

I don't want to "test" this change on PROD as it would require publishing documents that need to fail processing on Gatekeeper.
I much rather reopen this ticket again in case we'll find out that the modifications didn't solve the problem. For now I'm closing this ticket.

Elapsed: 0:00:00.001299