Issue Number | 3951 |
---|---|
Summary | Change publishing error threshhold for summaries and protocol documents |
Created | 2015-08-13 09:01:17 |
Issue Type | Task |
Submitted By | henryec |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2016-04-11 13:50:26 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.167432 |
Currently the publishing system allows a job to continue even if a certain number of errors have occurred; it is set to 500 errors during the weekly publishing job, 250 errors during the nightly job.
Lakshmi recommends that the system is updated to allow 0 errors in Summary and Protocol documents during publishing. If there is 1 or more errors, we need to republish to make the correction
We need to work through the approach together before development, as there is a communications aspect to the issue (for example, notifying distribution partners).
Lakshmi recommends that the system is updated to allow 0 errors in Summary and Protocol documents during publishing. If there is 1 or more errors, we need to republish to make the correction
With this approach I recommend to finish publishing the document type before we fail the job in order to get the total count of errors. If there were two unrelated types of errors we want to find those as part of a single run.
We will also need to find a way to start publishing jobs ourselves without the involvement of CBIIT since a failed publishing job will almost certainly have to be restarted after hours.
I suggest that we apply Lakshmi's suggested change only to publishing job types which do not allow the operator to specify individual documents (that is, don't abort hotfix jobs when one summary or clinical trial document fails).
We will also need to find a way to start publishing jobs ourselves without the involvement of CBIIT ...
Our publishing system already allows that. If you mean something else, we'll need a separate ticket for it.
In any case, since this is a higher-profile issue (request comes from Lakshmi) and there are a number of recommendations in the comments, we should probably discuss the options with Erika to nail down the requirements details.
We will also need to find a way to start publishing jobs ourselves without the involvement of CBIIT ...
Yes, I'm thinking of work performed by the scheduled Jobmaster Nightly, Jobmaster Weekly, and the Jobmaster 911 jobs. Starting a CDR publishing job, i.e. Interim-Export, and sending it to Gatekeeper is only part of the battle. Packing the data up and sending to the FTP server is another portion of the publishing process that currently cannot be re-started without the involvement of CBIIT.
New ticket created: https://tracker.nci.nih.gov/browse/OCECDR-4002.
Estimate is 5
Here's what I propose.
Add two new SubsetParameter blocks to the nightly and weekly
export jobs:{code:xml}
<SubsetParameter>
<ParmName>MaxSummaryErrors</ParmName>
<ParmValue>0</ParmValue>
</SubsetParameter>
<SubsetParameter>
<ParmName>MaxProtocolErrors</ParmName>
<ParmValue>0</ParmValue>
</SubsetParameter>
, add new attributes: {code:python}
# In the Publish constructor.__summaryErrorCount = 0
self.__protocolErrorCount = 0 self
In the Publish.__checkProblems() method, increment the summary/protocol error count as appropriate.
At the bottom of the Publish.publish() method, check to see if either of the new thresholds have been specified and exceeded, and if so mark the job as failed.
I would have preferred to have used a more generic representation of
the thresholds, something like:{code:xml}
<PerDocTypeErrorThreshold>
<DocType>Summary</DocType>
<MaxErrors>0</MaxErrors>
</PerDocTypeErrorThreshold>
Haven't heard back from Lakshmi, so I'm pushing this down into iteration 2.
Following up on the discussions of my proposal above with Blair and Volker, I have decided to go with the cleaner solution:
Add the following optional block to the
PublishingSystem
schema (at the end of the
SystemSubset
block, immediately following the
SubsetOptions
block:{code:xml}
<PerDoctypeErrorThresholds>
<PerDoctypeErrorThreshold>
<Doctype>Summary</Doctype>
<MaxErrors>0</MaxErrors>
</PerDoctypeErrorThreshold>
<PerDoctypeErrorThreshold>
<Doctype>CTGovProtocol</Doctype>
<MaxErrors>0</MaxErrors>
</PerDoctypeErrorThreshold>
</PerDoctypeErrorThresholds>
, add two new attributes:{code:python}
# In the Publish constructor.__perDoctypeErrors = {}
self.__perDoctypeMaxErrors = {} self
In the code which parses options from the publishing subset node,
populate the __perDoctypeMaxErrors
attribute as appropriate
from the publishing control document's configuration of the current
publishing job.
In the checkProblems()
method (looks like a JIRA bug
is preventing me from putting the underscores in front of the method
name here, even though it works everywhere else in this comment),
populate the __perDoctypeErrors
attribute as
appropriate.
At the bottom of the Publish.publish()
method, check
to see whether any per--document type error thresholds have been
exceeded, and if so mark the job as failed. This would come immediately
before self.__updateFirstPub()
so that the code to set the
first_pub
column will know not to do so for this job if we
are marking the job as having failed.
Answers to questions from the prior comment:
Is this approach reasonable? We believe so.
Do we need to do this for the push jobs? No, the push jobs have their own mechanism for dealing with such failures.
Is Publish.__updateFirstPub()
doing the right thing
if the job is marked as failed? As far as I
can tell, yes.
Should we use subset options (which can't be overridden on a per-job basis) instead of subset parameters (which can be overridden for a specific job)? This new approach means there will not be a way to override the thresholds from the UI for requesting publishing jobs. We agreed that this is the best approach. The threshold can be modified by storing a new version of the publishing control document. It's not as easy to do it this way, but that makes it harder to override the thresholds inappropriately, which is a Good Thing.
Any other questions I should be asking? :-) This question is still valid.
Did I capture our discussions accurately? Further suggestions? I'll work on this in my sandbox, but not do any commits until we're into iteration 2.
I believe this is accurate.
This has been implemented and installed on DEV. I have done some preliminary testing (with Volker's assistance), and we'll have more testing when he runs a weekly publishing job on DEV for another ticket in the release.
I have fixed the problems with the original approach. I also backed out the "Ask" choice for the PublishIfWarnings option after discussion with the team (I kept the option, which is set to "No" for hotfix-remove jobs).
~volker: Can you do another weekly publishing test on DEV sometime soon?
Weekly publishing test failed as expected.
I can confirm that this ticket is working as implemented on PROD. The publishing job fails once a single summary or protocol failure has been detected.
Closing ticket.
Elapsed: 0:00:00.001359