PDQ Issues

Issue Number	3951
Summary	Change publishing error threshhold for summaries and protocol documents
Created	2015-08-13 09:01:17
Issue Type	Task
Submitted By	henryec
Assigned To	Kline, Bob (NIH/NCI) [C]
Status	Closed
Resolved	2016-04-11 13:50:26
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.167432

Description

Currently the publishing system allows a job to continue even if a certain number of errors have occurred; it is set to 500 errors during the weekly publishing job, 250 errors during the nightly job.

Lakshmi recommends that the system is updated to allow 0 errors in Summary and Protocol documents during publishing. If there is 1 or more errors, we need to republish to make the correction

We need to work through the approach together before development, as there is a communications aspect to the issue (for example, notifying distribution partners).

Comment entered 2015-08-13 10:56:03 by Englisch, Volker (NIH/NCI) [C]

Lakshmi recommends that the system is updated to allow 0 errors in Summary and Protocol documents during publishing. If there is 1 or more errors, we need to republish to make the correction

With this approach I recommend to finish publishing the document type before we fail the job in order to get the total count of errors. If there were two unrelated types of errors we want to find those as part of a single run.

Comment entered 2015-08-13 10:57:43 by Englisch, Volker (NIH/NCI) [C]

We will also need to find a way to start publishing jobs ourselves without the involvement of CBIIT since a failed publishing job will almost certainly have to be restarted after hours.

Comment entered 2015-08-13 12:09:44 by Kline, Bob (NIH/NCI) [C]

I suggest that we apply Lakshmi's suggested change only to publishing job types which do not allow the operator to specify individual documents (that is, don't abort hotfix jobs when one summary or clinical trial document fails).

Comment entered 2015-11-19 08:08:48 by Kline, Bob (NIH/NCI) [C]

We will also need to find a way to start publishing jobs ourselves without the involvement of CBIIT ...

Our publishing system already allows that. If you mean something else, we'll need a separate ticket for it.

In any case, since this is a higher-profile issue (request comes from Lakshmi) and there are a number of recommendations in the comments, we should probably discuss the options with Erika to nail down the requirements details.

Comment entered 2015-11-19 08:27:43 by Englisch, Volker (NIH/NCI) [C]

We will also need to find a way to start publishing jobs ourselves without the involvement of CBIIT ...

Yes, I'm thinking of work performed by the scheduled Jobmaster Nightly, Jobmaster Weekly, and the Jobmaster 911 jobs. Starting a CDR publishing job, i.e. Interim-Export, and sending it to Gatekeeper is only part of the battle. Packing the data up and sending to the FTP server is another portion of the publishing process that currently cannot be re-started without the involvement of CBIIT.

Comment entered 2015-11-19 08:51:29 by Kline, Bob (NIH/NCI) [C]

New ticket created: https://tracker.nci.nih.gov/browse/OCECDR-4002.

Comment entered 2016-02-18 11:00:00 by Learn, Blair (NIH/NCI) [C]

Estimate is 5

Comment entered 2016-03-23 16:05:08 by Kline, Bob (NIH/NCI) [C]

Here's what I propose.

Add two new SubsetParameter blocks to the nightly and weekly export jobs:{code:xml}
<SubsetParameter>
<ParmName>MaxSummaryErrors</ParmName>
<ParmValue>0</ParmValue>
</SubsetParameter>
<SubsetParameter>
<ParmName>MaxProtocolErrors</ParmName>
<ParmValue>0</ParmValue>
</SubsetParameter>
```
# In the Publish constructor, add new attributes: {code:python}
        self.__summaryErrorCount    = 0
        self.__protocolErrorCount   = 0
```
In the Publish.__checkProblems() method, increment the summary/protocol error count as appropriate.
At the bottom of the Publish.publish() method, check to see if either of the new thresholds have been specified and exceeded, and if so mark the job as failed.

I would have preferred to have used a more generic representation of the thresholds, something like:{code:xml}
<PerDocTypeErrorThreshold>
<DocType>Summary</DocType>
<MaxErrors>0</MaxErrors>
</PerDocTypeErrorThreshold>

Comment entered 2016-03-25 10:20:51 by Kline, Bob (NIH/NCI) [C]

Haven't heard back from Lakshmi, so I'm pushing this down into iteration 2.

Comment entered 2016-03-25 15:18:54 by Kline, Bob (NIH/NCI) [C]

Following up on the discussions of my proposal above with Blair and Volker, I have decided to go with the cleaner solution:

Add the following optional block to the PublishingSystem schema (at the end of the SystemSubset block, immediately following the SubsetOptions block:{code:xml}
<PerDoctypeErrorThresholds>
<PerDoctypeErrorThreshold>
<Doctype>Summary</Doctype>
<MaxErrors>0</MaxErrors>
</PerDoctypeErrorThreshold>
<PerDoctypeErrorThreshold>
<Doctype>CTGovProtocol</Doctype>
<MaxErrors>0</MaxErrors>
</PerDoctypeErrorThreshold>
</PerDoctypeErrorThresholds>
```
# In the Publish constructor, add two new attributes:{code:python}
        self.__perDoctypeErrors    = {}
        self.__perDoctypeMaxErrors = {}
```
In the code which parses options from the publishing subset node, populate the __perDoctypeMaxErrors attribute as appropriate from the publishing control document's configuration of the current publishing job.
In the checkProblems() method (looks like a JIRA bug is preventing me from putting the underscores in front of the method name here, even though it works everywhere else in this comment), populate the __perDoctypeErrors attribute as appropriate.
At the bottom of the Publish.publish() method, check to see whether any per--document type error thresholds have been exceeded, and if so mark the job as failed. This would come immediately before self.__updateFirstPub() so that the code to set the first_pub column will know not to do so for this job if we are marking the job as having failed.

Answers to questions from the prior comment:

Is this approach reasonable? We believe so.
Do we need to do this for the push jobs? No, the push jobs have their own mechanism for dealing with such failures.
Is Publish.__updateFirstPub() doing the right thing if the job is marked as failed? As far as I can tell, yes.
Should we use subset options (which can't be overridden on a per-job basis) instead of subset parameters (which can be overridden for a specific job)? This new approach means there will not be a way to override the thresholds from the UI for requesting publishing jobs. We agreed that this is the best approach. The threshold can be modified by storing a new version of the publishing control document. It's not as easy to do it this way, but that makes it harder to override the thresholds inappropriately, which is a Good Thing.
Any other questions I should be asking? :-) This question is still valid.

Did I capture our discussions accurately? Further suggestions? I'll work on this in my sandbox, but not do any commits until we're into iteration 2.

Comment entered 2016-03-25 15:31:19 by Learn, Blair (NIH/NCI) [C]

I believe this is accurate.

Comment entered 2016-04-05 15:00:12 by Kline, Bob (NIH/NCI) [C]

This has been implemented and installed on DEV. I have done some preliminary testing (with Volker's assistance), and we'll have more testing when he runs a weekly publishing job on DEV for another ticket in the release.

Comment entered 2016-04-11 13:50:26 by Kline, Bob (NIH/NCI) [C]

I have fixed the problems with the original approach. I also backed out the "Ask" choice for the PublishIfWarnings option after discussion with the team (I kept the option, which is set to "No" for hotfix-remove jobs).

~volker: Can you do another weekly publishing test on DEV sometime soon?

Comment entered 2016-04-12 09:23:14 by Kline, Bob (NIH/NCI) [C]

Weekly publishing test failed as expected.

Comment entered 2016-06-06 12:37:13 by Englisch, Volker (NIH/NCI) [C]

I can confirm that this ticket is working as implemented on PROD. The publishing job fails once a single summary or protocol failure has been detected.

Closing ticket.

Elapsed: 0:00:00.001394

CDR Tickets