Issue Number | 3744 |
---|---|
Summary | Truncated Display of Term with Ampersand Character |
Created | 2014-03-14 11:08:39 |
Issue Type | Bug |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | alan |
Status | Closed |
Resolved | 2014-04-01 10:35:14 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.119784 |
When a term is entered in a CTGovProtocol document, the PreferredTerm
name gets displayed in the XMetaL document after the document gets saved
(via the fast denormalization). For a term that includes an ampersand
character, the Term name gets truncated after the ampersand.
An example for such a term is "FD&C blue no. 2" which gets displayed
after the save as "FD&".
Everything is still working correctly.
This had been discovered while working on ticket OCECDR-3717. Please refer to the comments for that issue.
I'm adding the last comment from Alan on the ticket OCECDR-3717 where this issue had been identified:
Assuming that I haven't done a double conversion, i.e., calling entConvert() twice on the same string, I would guess that someone noticed the problem in the past and wrote code to work around it. The workaround now causes a double conversion.
I'd be tempted to change software so that double conversions can't be done. For example, in entConvert(), we could check what follows the '&' and not convert it if it is followed by "amp;" We'd have to do the same thing in the fast denormalization code, or wherever it's happening now.
There is a potential problem however in that it is conceivable that there are times when a double conversion is actually what we want.
At this point I'd like us to think out the whole chain of conversions from the point that a CIAT user enters "FD&C" to the point when the term is published to cancer.gov. Where should the entity conversions really occur? Is it possible to reduce those to a single place? What will break if we change software so that that's the only place where conversions occur?
Bob made a number of fixes in the CdrServer to perform entity conversions on title strings before they are sent out as part of DocTitle text content.
I think the idea is exactly right but I've found a number of places where entity conversion may already have been done - leading to possible double conversions. I've also found at least one place where a plain text (i.e., unconverted) title from the database is sent out as part of a different element other than a DocTitle, and two places where other (non-title) database strings are sent out without conversion.
So the problem is a little deeper than any of us realized.
I'm going to work this afternoon to do the following:
Try to ensure that all outputs of non-XML text from the database go through entity conversion at the point of construction of XML containing them.
Try to ensure that no entity conversion is performed at the point where data is extracted from the database.
My reasoning here is that all data processed in the CdrServer will be handled in a standardized and predictable way. If it's saved in a variable, it's plain (unconverted) text. If it's inserted into XML, it's converted at the point of insertion. If the same variable gets inserted into multiple strings, we'll convert it multiple times rather than violate this rule - trading off a theoretical inefficiency for a little more robustness.
I'm not 100% confident that I can find and fix everything. The problem may be that data extracted from the database gets converted at the point of extraction and used far downstream to produce something that assumes the conversion has already been done. It's possible I will "fix" the inappropriate early conversion without finding every downstream use of it. But I'm going to try.
I figure it's better for us to refactor our handling of entity conversion in this way in order to get an inherently more robust design than to leave things in the current precarious state that can more easily be broken by future changes.
I'll make the fixes today, check-in all of the code, put the new CdrServer on port 2019 on Dev, and document what I've done.
I've made more than 20 code changes already. Some of the problems included things like putting a title into a validation error message but not converting entities, extracting comment text from the database without converting it, and relying on normalizing code that wasn't (to my eyes) 100% complete. I also found one where the recent fix to a DocTitle string could have caused a double conversion because of a prior conversion in a CdrDoc constructor (I fixed the constructor and left the correct, new DocTitle code in place.)
I "fixed" a bunch of places that were not broken, but in which the entity conversion was done when data was extracted instead of when it was output. I changed both ends. In a couple of cases it would actually have made the software microscopically more efficient because data extracted from the database is not always used (e.g. if it is null).
There are lots more places that I could have changed but didn't because there was reason to believe that a reserved character could never appear. An easy case is when inserting a number or CDR ID into XML. Others involved element or attribute names, or some other names that would be serious practical errors, even if not strict semantic errors, if reserved characters were used. In general, I am not converting those strings in hopes of holding down the total number of changes that need to be reviewed.
I'm going to keep plugging away at this. Before I leave tonight, I'll install a new CdrServer whether or not I've finished examining every place that my pattern matching searches identified as a possible place needing checking. I'll also attach a copy of my extensive notes in JIRA. I think it will be desirable for Bob to look at all of the changes I made. He won't have to go through the hundreds of lines I checked but didn't change. If I missed something, well, the problem is likely not new and we've lived with it so far. But I would like another pair of eyes to look at the changes I made to make sure I haven't introduced new bugs. I think the patterns are repetitive and the review will go very quickly.
The attached listing shows source code patterns which could conceivably indicate entity conversion issues in the CdrServer. See the legend at the top and the interspersed comments.
I've gone through two thirds of the lines of code identified by the
regular expressions I used to identify potential problems.
I examined more than 500 lines of code, examining surrounding lines for
many of them. I made 41 changes, some of which involved two lines of
code for each change (where I moved the entity conversion from the place
where a string was created to where it was output.)
The basic types of changes were:
- Observe the convention of making conversions at the point where XML
is actually constructed.
- Modify places that are probably safe with all the data that will
ever pass through them, but could in theory fail.
- Modify places that are probably not safe and would fail with data
that would be rare but not completely unlikely in a database of over
700,000 documents. I wouldn't be at all surprised if some of these
places have failed in the past but no one ever reported the
failures.
There were only a very few of those.
I rebuilt and installed the CdrServer on Dev. It now passes the test
that Volker found it failing on.
Most of the changes made seem to me non-controversial but I haven't
checked any of them in. I think it would be a good idea for Bob to look
at least briefly at the 41 changes that I've made, and 8 more that I did
not change but marked as questionable, before I check anything in.
It's too late for me to finish reviewing the 262 pattern lines that
remain tonight. I can look at them on Thursday.
I have attached a full record of the analysis, q.v.
I have reviewed the changes. Most look correct. I'm not sure I would have taken the trouble to move the conversion from one place to another inside a tight loop for a variable which had no existence outside the scope of that loop, but it doesn't hurt, and it's more purely in line with the principle we've adopted that we work with the value itself every place except when the value is being serialized for transport within XML markup. The real concern was with discrepancies in representation for objects whose lifetime was effectively unbounded, with the biggest offender being different results from the different Doc constructors, some of which stored the value for the title string, and others of which stored the escaped version of the title's string.
One set of changes is wrong, and needs to be backed out. In CdrGlossaryMap.cpp, you have added calls to cdr::entConvert() for integers. As it turns out, C++ lets you silently pass an integer where a string object is expected, and it will be silently converted. The result won't be what you expect. 🙂 Please remove both changes for that file.
Looks like you picked up some places in CdrReport.cpp for adding entConvert() and I picked up others. Those will get merged when you update your sandbox and do a checkin.
I'm not sure I understand your decision to avoid the base principle in the changes to CdrLink.cpp. It looks as if you're applying cdr::entConvert() as you're collecting the list of error strings (for example, line 581 in cdrDelLinks()), instead of waiting until the errors are being packed up for export as part of the XML response. A similar approach appears to be take for other collections of error messages in this source file. For example, on line 755, tagWrap is being called for each of the members of the CdrLink's errCtl error message list. Is entConvert() not needed here?
I think we may also need cdr::entConvert() on lines 116 and 252 of CdrValidateDoc.cpp (converting the error messages to XML).
I'll review and fix all of the above, then do the rest of the files..
I went through all of the CDR code looking for places that call ValidationControl.addError(), or call CdrLink.addLinkErr(), which invokes it. There are 60 of them.
About 55 of those places look completely safe to me, i.e., they produce error strings that are string constants, not extracts from XML. Whether we do or don't perform entity conversion will make no difference in the output.
The ones that are not safe are in CdrXsd, that can put data extracted from a document into an error message, and in CdrFilter.filterDoc(), that registers an error handling callback and gets a pointer to "userData" that contains - what? I don't know.
I think CdrXsd is safe to wrap. It passes text strings to addError and does not, itself, call addError() or wrap errors in XML tags. filterDoc however wraps each message with <msg></msg> tags but does not call entConvert(). I don't know if Sablotron puts serial XML or plain text in the messages. I spent about 10 minutes investigating but didn't find the answer and decided not to pursue it further unless and until someone thinks it's really worth doing.
It seems to me that we have the following choices:
Do nothing. Nobody has complained in 12 years.
entConvert() CdrXsd messages before calling addError()
entConvert() in addError, running the risk of double conversion in messages that come from Sablotron.
For now, I'm tempted to use option 1, but I invite other opinions.
I've attached an updated version of the conversion analysis. I recorded 64 changes to code. The vast majority of those were to fix hypothetical errors that could conceivably occur but probably never will. Of the ones for which the danger is more than hypothetical, the consequences of error are probably very small, not leading to any corruption of the database. But I fixed them anyway.
I have checked all of the modified code into subversion, rebuilt the CdrServer, and installed it on Dev.
It seems to me that we have the following choices:...
Whatever you choose (and especially if you choose #1), let's add a comment to the code capturing the research into the problem you've done so far.
I did a little more research to refresh my memory on the handling of xsl:message content coming back from Sablotron. Here's what Michael Key [1] has to say about the xsl:message element:
Content
A template body. There is no requirement that this should only generate text nodes; it can produce any XML fragment. What happens to any markup, however, is not defined in the standard.
Sablotron gives us back the text content of our xsl:message elements, without any escaping, but discarding any markup present. To work around this limitation we have used the technique of doing our own escaping of markup inserted into the messages inside the filter itself. It would take a careful examination of every filter which uses the xsl:message element (and there are 20 such filters right now) to verify that we don't exposes ourselves to the risk that an <xsl:value-of select="..."/> inside one of those xsl:message elements could let an ampersand (or some other character that needs escaping) slip through. It's unlikely that there are any such places, or if there are, they haven't caused any failures we know of in the past dozen years. At any rate, we know we don't want to double-escape the markup we know we're putting inside the message elements, so leaving that code alone is the right thing to do. Wouldn't hurt to comment the expectations with a summary of what I've written here.
[1] XSLT Programmer's Reference, 2nd ed.
Since OCECDR-3717 has been verified and closed we can close this related issue as well.
File Name | Posted | User |
---|---|---|
convert.txt | 2014-03-27 22:34:07 | |
convert.txt | 2014-03-26 00:19:39 |
Elapsed: 0:00:00.001479