Issue Number | 3751 |
---|---|
Summary | cdr._addRepDocComment() replaces more than it should |
Created | 2014-04-02 09:09:45 |
Issue Type | Bug |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | alan |
Status | Closed |
Resolved | 2014-10-09 13:41:05 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.121850 |
The _addRepDocComment() function doesn't restrict its changes to the
CdrDocCtl block. See, for example, CDR0000257549, which had the element
{code:xml}
<DocComment><xsl:value-of select =
"$newComment"/></DocComment>
{code:xml}
replaced with <DocComment>Syncing filters with svn 2014-04-01 (RMK)</DocComment>
in the body of the actual CDR document.
I will fix the document, but you can see the damage by comparing version 2 of the document with version 1 in the CDR.
I looked at this and see what went wrong. I've made some notes on fixing it. The fix doesn't look too difficult but I have several higher priority tasks at the moment, so I won't get to this for a while.
I created a version of _addRepDocComment() that fixes the problem, still using regular expressions, but using expressions that won't be fooled by the additional CdrDocCtl element found in the filter that broke the old version.
It all worked fine and the performance was fine on the documents that would normally be passed to it, but the performance was truly abominable for a document that did not have a CdrDocCtl wrapper. That's an illegal case, but I'd like to have a reasonable check for it that doesn't take forever when the case obtains.
I might want to switch to a version that uses an XML parser, or do a limited pattern match that only looks at the first one or two hundred characters to find out whether there is a CdrDocCtl present.
I noted that a non-greedy regex is also MUCH slower than a greedy one - which is not something I expected.
So I've decided to put this aside for now and come back to it when I get back from vacation.
...the performance was truly abominable for a document that did not have a CdrDocCtl wrapper....
Why not just wrap the code in a test like this:
if "<CdrDocCtl" in doc_xml: ...
I'm thinking about something like that but I'd want to test it first and also think about the odd possibility of a document without a CdrDocCtl wrapper but with a reference to it in the body of the document.
It would probably only mean that in the one or two filter documents where this can occur, it could take a full half second to process the doc (because they're very small) instead of the milliseconds it would otherwise take - and the error would still be caught. But I was tired and wanted to think about this with a clearer head and without the time pressure of committing a change just before I left.
I have rewritten the function using lxml. It required the use of a capability that we hadn't previously used in lxml, namely using a parser that leaves CDATA alone. But, fortunately, lxml has that capability.
I made a slight change in behavior. I now accept unicode or utf-8, but always return utf-8. Before the caller had to supply utf-8.
I tested with a half dozen documents using a standalone test driver. Some had existing DocComments, some had two of them (one in the CDATA section), some had none. One had no DocCtl wrapper - raising the expected exception.
Marking the issue resolved as fixed.
This is on production.
Elapsed: 0:00:00.001445