Issue Number | 3948 |
---|---|
Summary | CTGovProtocol Publishing Error |
Created | 2015-08-05 18:22:25 |
Issue Type | Bug |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2015-09-17 13:45:56 |
Resolution | Cannot Reproduce |
Path | /home/bkline/backups/jira/ocecdr/issue.166853 |
On Friday, 31st and Monday, 3rd we've seen CTGovProtocol filter
failures. The failure on Monday reports:
"XSLT error: code: 2 msgtype:error code:2 module:Sablotron
URI:cdrutil:/denormalizeTerm/CDR0000040544 line:14 node:attribute
'encoding' msg:XML parser error 5: unclosed token"
and we've seen the same message on Friday for 217 documents:
"XSLT error: code: 2 msgtype:error code:2 module:Sablotron
URI:cdrutil:/denormalizeTerm/CDR0000040539 line:1 node:attribute
'encoding' msg:XML parser error 5: unclosed token"
The documents have been checked out and reprocessed without any problems and it is currently unclear why we're seeing these messages.
I've checked through the code invoked by cdrutil:/denormalizeTerm and I'm down to the following theories about how the errors may have occurred:
There is a bug in the term denormalization logic.
This is possible but seems unlikely to me because we've processed the
very same records before and after the errors in very similar, perhaps
even identical, circumstances without seeing the errors.
There is a bug in the multi-threaded cache synchronization.
This seems more likely to me. It is conceivable that such a bug would
not show up unless two threads attempted to process term
denormalizations at the same time, probably using the same records. With
ten threads running, such an event could happen more often than in the
old days when we published with two or four threads.
If this is the case, we won't be able to reliably reproduce the error,
but we can prevent it by turning off term caching, which is less useful
with only 3,000 protocols to process than it was when there were
20-30,000. Or we can instrument the server to get a better understanding
of where the bug might be. My recent walk-through of the code didn't
find any problems.
There were transient network or database errors on the two days
and two records on which the errors occurred.
While there were 217 documents that failed on July 31, all 217 appear to
have been due to one single failure in one single record.
There was a memory buffer overflow or incorrectly initialized
pointer that cause part of the cache to be corrupted.
This could have occurred in the term denormalization software but could
also have occurred outside it.
None of these will be easy to reproduce and may be impossible to do so. None will be easy to find.
Volker has suggested that we should wait and see if happens again. Given that huge resources could be spent on this with small chance of reward, that seems reasonable to me. If they do recur, the next, cheapest, step, is probably to turn off caching and see if that makes the problem go away.
Volker has suggested that we should wait and see if happens again.
For the past 6 weeks I've been monitoring the publishing jobs on all four tiers. We haven't had a single recurrence of the original problem. I would put this ticket on hold (or close it) and adjust the publishing document, as discussed, to stop publishing if a protocol or summary document fails on PROD.
Hasn't happened since the initial outbreak.
Elapsed: 0:00:00.001338