Issue Number | 3717 |
---|---|
Summary | [Term] Term with ampersand character in it produces errors in protocols |
Created | 2014-02-10 10:25:00 |
Issue Type | Bug |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2014-02-28 13:37:12 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.118293 |
-Possible next release fix.
This is an old one Volker and I exchanged emails about.We had to replace the & with AND in the term for the protocols to be published but that also made the term not match any term in the NCI Thesaurus. The CDR ID of the term is CDR703241.
Email exchanges are below:
------Original Message
From: Englisch, Volker (NIH/NCI) [C] volker@mail.nih.gov
Sent: Tuesday, September 03, 2013 9:29 PM
To: Osei-Poku, William
Cc: Beckwith, Margaret (NIH/NCI) [E]; Juthe, Robin (NIH/NCI) [E]
Subject: RE: CBIIT-PROD: Status and Error Report for Nightly
Publishing
William,
You were right. It wasn't easy to identify what the actual problem
was. From what I can gather, all of the protocols that fail to publish
are including an InterventionNameLink CDR703241 "FD&C Blue No. 2"
and it seems that the system is getting confused with the '&' in the
name. As a test I removed the '&' on DEV and then I was able to
filter the protocols.
Would it be possible to replace the '&' with an 'and' and then add a
ticket to correct this bug?
Thanks,
Volker
___________________________________
Volker Englisch
NCI OCE - Communications Technology Branch
Contractor: Sapient Government Services
NCI: 240-276-6583
------Original Message
From: Osei-Poku, William William.Osei-Poku@icfi.com
Sent: Tuesday, August 27, 2013 1:03 PM
To: Englisch, Volker (NIH/NCI) [C]
Subject: FW: CBIIT-PROD: Status and Error Report for Nightly
Publishing
Hi Volker,
There are a couple of XSLT errors for protocol documents in the status
and error report but it is hard to tell what is wrong with the records
since they are valid and publish preview works well for them and some of
them haven't been updated in a while. Can you let me know what is wrong
them?
Thanks,
William
I tried to test this on DEV by changing the ampersand character back and forth from '&' to something else but wasn't able to save the modified document containing the ampersand in XMetaL. The reason being that the '&' character didn't get converted to a character entity because the Term DocTitle filter created ASCII output (leaving the '&' as is). I modified the title to use the output method of HTML so that this conversion takes place and I'm now able to save the modified document.
R12378: DocTitle for Term
I've added Bob and Alan as watchers. If you read my last comment
you'd notice that the DocTitle filter for Term documents didn't handle
the ampersand character properly.
From what I can gather the same problem exists in the server code in the
function
:denormalizeTerm cdrutil
.
In the program
.cpp cdrCache
is a line (line number 312) that extracts the text content of the PreferredName from the document:
->name = cdr::dom::getTextContent(node).toUtf8(); pTerm
I think this statement needs to take into account that the ampersand
character is listed as an entity.
Does that make sense?
It's not that statement itself which is the culprit, though you're on the right track. The line you quoted is assigning the name member of the Term object with its preferred name string value, which has the ampersand represented as itself (as it should be). The problem is at line 467 (and similar places) where we're rolling XML by hand, but not escaping the ampersand (or other delimiters) when wrapping the text content of the PreferredName element.
Alan:
Solution is to use cdr::entConvert(term->name).
Alan, I re-assigned this task to you for the server changes.
Please assign it back to me once you're done.
Fixing the server turned out to be more than the 5 minute job I was
hoping for.
It is essential that the entConvert() routine be called exactly once on
the text contents of the constructed XML. It can't be called on the XML
as a whole because that would convert tag delimiters to character
entities, and it can't be called twice on the same string because that
would produce errors like: "FD&C".
There are two ways to do this.
Perform the conversion when I output terms into XML from the Term
objects I've created, or
Perform the conversion when I store the name itself in the Term
object - a more efficient method since Term objects are created once
and used multiple times.
Both methods require careful study of the code to be sure that all data
is converted, and all is converted only once.
I've implemented both methods to see what was involved, but finally
decided to go with the efficient method - which also had the merit of
changing only a single statement in the code. The change was:
< pTerm->name = cdr::dom::getTextContent(node).toUtf8();
> pTerm->name = cdr::entConvert(cdr::dom::getTextContent(node)).toUtf8();
I haven't tested this on the theory that:
1. Volker already knows how to test it, and
2. Even if it's wrong it should only break the Term caching on Dev.
It shouldn't make Dev unusable for anything else.
Volker: Please test it at your earliest convenience.
I did NOT convert the attributes or the TermTypeName elements, the only
other strings extracted from the stored XML documents. Only two
attributes are allowed in Term PreferredName elements, cdr:ref and
pdqKey. Both have a strictly limited character set that does not allow
any of the characters that get converted to entities. The TermTypeNames
also have controlled values which do not use characters needing
encoding.
Whenever problems come up in the CdrCache software Bob reminds me that I
ignored the sage advice of the wise Mike Rubenstein about caching. But
I was filled with the unwarranted self-confidence of youth and I went
ahead and wrote the code.
It's not the only foolish thing in my life that I wish I could take back
and do over, and probably not the worst, but I have been humbled by the
experience.
Re-assigning to Volker for further testing.
Can you say Can of Worms?
Alan fixed the problem that was originally reported:
The publishing of a CTGovProtocol fails if the protocol links to an
intervention (a Terminology document) for which the PreferredName
contains the ampersand.
I had identified a problem with the DocTitle filter not properly
handling the ampersand.
This was followed by the discovery of the bug in the server code that
Alan fixed.
As a follow-up I see now that the PreferredTerm in the CTGovProtocol document isn't displayed properly anymore. The title
&C blue no. 2 FD
is now displayed as
&amp FD
Everything is still working correctly, as it appears, because internally only the cdr:ref CDR-ID is being used. I guess the fast denormalization code may be responsible for not handling the ampersand and cutting off the remainder of the terminology title.
William, please go ahead and test this on DEV. You should now be able to create terms with ampersand, link to those terms and publish the CTGovProtocol.
The term document appears to be fine but in the protocol document, the text is cut off. I updated the following term "3&D ultrasound-guided radiation therapy" by adding the & sign. In the protocol document, what shows up is this "3&"
CDR0000647553 - term document.
CDR0000646976 - protocol document.
The term document appears to be fine but in the protocol document, the text is cut off.
Yes, that's what I already mentioned in my comment from 1:33pm.
We will look at this but it doesn't seem to affect publishing the
documents since it appears to be a display issue only.
Assuming that I haven't done a double conversion, i.e., calling entConvert() twice on the same string, I would guess that someone noticed the problem in the past and wrote code to work around it. The workaround now causes a double conversion.
I'd be tempted to change software so that double conversions can't be done. For example, in entConvert(), we could check what follows the '&' and not convert it if it is followed by "amp;" We'd have to do the same thing in the fast denormalization code, or wherever it's happening now.
There is a potential problem however in that it is conceivable that there are times when a double conversion is actually what we want.
At this point I'd like us to think out the whole chain of conversions from the point that a CIAT user enters "FD&C" to the point when the term is published to cancer.gov. Where should the entity conversions really occur? Is it possible to reduce those to a single place? What will break if we change software so that that's the only place where conversions occur?
The part of the PreferredName not being displayed correctly if the term name contains an ampersand has been entered in a new ticket: OCECDR-3744.
Adding a reminder to myself that Alan's modified server code will have to be included in the release for this issue to be fixed on PROD.
Bob has restored the original filter that created text output:
R12490: DocTitle for Term
I was able to see the complete term name in the protocol documents I tested with so I can confirm that the truncation problem has been resolved.
Verified on DEV.
Verified on QA.
VERIFIED ON PROD.
Is it OK to close this ticket?
Yes it is. I am going to close it right away.
Elapsed: 0:00:00.000511