PDQ Issues

Issue Number	3948
Summary	CTGovProtocol Publishing Error
Created	2015-08-05 18:22:25
Issue Type	Bug
Submitted By	Englisch, Volker (NIH/NCI) [C]
Assigned To	Englisch, Volker (NIH/NCI) [C]
Status	Closed
Resolved	2015-09-17 13:45:56
Resolution	Cannot Reproduce
Path	/home/bkline/backups/jira/ocecdr/issue.166853

Description

On Friday, 31st and Monday, 3rd we've seen CTGovProtocol filter failures. The failure on Monday reports:
"XSLT error: code: 2 msgtype:error code:2 module:Sablotron URI:cdrutil:/denormalizeTerm/CDR0000040544 line:14 node:attribute 'encoding' msg:XML parser error 5: unclosed token"

and we've seen the same message on Friday for 217 documents:
"XSLT error: code: 2 msgtype:error code:2 module:Sablotron URI:cdrutil:/denormalizeTerm/CDR0000040539 line:1 node:attribute 'encoding' msg:XML parser error 5: unclosed token"

The documents have been checked out and reprocessed without any problems and it is currently unclear why we're seeing these messages.

Comment entered 2015-08-13 20:44:59 by alan

I've checked through the code invoked by cdrutil:/denormalizeTerm and I'm down to the following theories about how the errors may have occurred:

There is a bug in the term denormalization logic.
This is possible but seems unlikely to me because we've processed the very same records before and after the errors in very similar, perhaps even identical, circumstances without seeing the errors.

There is a bug in the multi-threaded cache synchronization.
This seems more likely to me. It is conceivable that such a bug would not show up unless two threads attempted to process term denormalizations at the same time, probably using the same records. With ten threads running, such an event could happen more often than in the old days when we published with two or four threads.
If this is the case, we won't be able to reliably reproduce the error, but we can prevent it by turning off term caching, which is less useful with only 3,000 protocols to process than it was when there were 20-30,000. Or we can instrument the server to get a better understanding of where the bug might be. My recent walk-through of the code didn't find any problems.

There were transient network or database errors on the two days and two records on which the errors occurred.
While there were 217 documents that failed on July 31, all 217 appear to have been due to one single failure in one single record.

There was a memory buffer overflow or incorrectly initialized pointer that cause part of the cache to be corrupted.
This could have occurred in the term denormalization software but could also have occurred outside it.

None of these will be easy to reproduce and may be impossible to do so. None will be easy to find.

Volker has suggested that we should wait and see if happens again. Given that huge resources could be spent on this with small chance of reward, that seems reasonable to me. If they do recur, the next, cheapest, step, is probably to turn off caching and see if that makes the problem go away.

Comment entered 2015-09-16 14:35:23 by Englisch, Volker (NIH/NCI) [C]

Volker has suggested that we should wait and see if happens again.

For the past 6 weeks I've been monitoring the publishing jobs on all four tiers. We haven't had a single recurrence of the original problem. I would put this ticket on hold (or close it) and adjust the publishing document, as discussed, to stop publishing if a protocol or summary document fails on PROD.

Comment entered 2015-09-17 13:45:56 by Kline, Bob (NIH/NCI) [C]

Hasn't happened since the initial outbreak.

Elapsed: 0:00:00.002842

CDR Tickets