Issue Number | 3675 |
---|---|
Summary | Failed Hot-fix Blocks Publishing |
Created | 2013-11-07 11:39:39 |
Issue Type | Bug |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2014-01-08 14:23:49 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.114724 |
A job submitted to Gatekeeper receives a status of 'Verifying' in the
CDR. The status is typically being updated to 'Success' once all
documents (or at least one) finished the data load on GK. If the batch
of documents doesn't complete during the time allotted for checking with
GK the job status will be marked as 'Stalled'.
We submitted a hot-fix publishing job with a single summary that failed
on GK therefore setting the job status to 'Failure'. The following job
from the CDR then compares the last successful job-ID with the last
job-ID that Gatekeeper records. When these two job-IDs between the CDR
and GK differ, publishing stops.
The problem is that for a failed job on Gatekeeper the last job-ID on GK doesn't match the last successful job-ID in the CDR.
We decided to implement a flag that allows a push job to continue pushing documents to GK even in the event of mismatched job-ID if it has been identified that it's safe to do so.
We ran into another (uncommon) situation that needs to be addressed
for future updates.
If a publishing job creates no documents to be pushed to Gatekeeper (as
happened a part of the publishing job over Christmas) the push job will
finish successfully without pushing a single document. The following
publishing job will then fail because Gatekeeper reports the previous
successful push job (Job11324) while the last successful push job on the
CDR side is reported as Job11326.
In addition, the push job 11326 is not listed as a successful job on the
Publishing Activities list.
In light of these problems I will increase the priority of this bug as it's causing problems with publishing.
I made changes to the code to fix the two problems we recently ran into:
A publishing job that failed due to a failure on the Gatekeeper would block all following publishing jobs and
A publishing job that didn't send incremental updates to GK would block all following publishing jobs.
Fixing the code took about 45 minutes, testing the result lasted many hours. :-(
For the first issue I created a new job parameter 'IgnoreGKJobIDMismatch' which is set to 'No' by default. Once the job failed to push documents to GK we're now able to manually push the documents (after double-checking that the mismatch of the job IDs is caused by a failed GK processing job) by setting this parameter to 'Yes'.
For the second issue the function getLastCgJobId() is no longer including successful jobs for which no documents have been send to Gatekeeper.
The following documents have been updated:
cdrpub.py
CDR178.xml
This has been successfully tested on DEV. I'm not sure if we need to test this again on QA other than making sure I didn't break publishing.
The following files have been updated on DEV:
R12255: CDR178.xml
R12254: cdrpub.py
The publishing document has been updated on QA and the nightly publishing job ran OK last night.
Verified on QA.
I don't want to "test" this change on PROD as it would require
publishing documents that need to fail processing on Gatekeeper.
I much rather reopen this ticket again in case we'll find out that the
modifications didn't solve the problem. For now I'm closing this
ticket.
Elapsed: 0:00:00.001299