PDQ Issues

Issue Number	3717
Summary	[Term] Term with ampersand character in it produces errors in protocols
Created	2014-02-10 10:25:00
Issue Type	Bug
Submitted By	Osei-Poku, William (NIH/NCI) [C]
Assigned To	Englisch, Volker (NIH/NCI) [C]
Status	Closed
Resolved	2014-02-28 13:37:12
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.118293

Description

-Possible next release fix.

This is an old one Volker and I exchanged emails about.We had to replace the & with AND in the term for the protocols to be published but that also made the term not match any term in the NCI Thesaurus. The CDR ID of the term is CDR703241.

Email exchanges are below:

---~~Original Message~~---
From: Englisch, Volker (NIH/NCI) [C] volker@mail.nih.gov
Sent: Tuesday, September 03, 2013 9:29 PM
To: Osei-Poku, William
Cc: Beckwith, Margaret (NIH/NCI) [E]; Juthe, Robin (NIH/NCI) [E]
Subject: RE: CBIIT-PROD: Status and Error Report for Nightly Publishing

William,

You were right. It wasn't easy to identify what the actual problem was. From what I can gather, all of the protocols that fail to publish are including an InterventionNameLink CDR703241 "FD&C Blue No. 2" and it seems that the system is getting confused with the '&' in the name. As a test I removed the '&' on DEV and then I was able to filter the protocols.
Would it be possible to replace the '&' with an 'and' and then add a ticket to correct this bug?

Thanks,

Volker
___________________________________
Volker Englisch
NCI OCE - Communications Technology Branch
Contractor: Sapient Government Services
NCI: 240-276-6583

---~~Original Message~~---
From: Osei-Poku, William William.Osei-Poku@icfi.com
Sent: Tuesday, August 27, 2013 1:03 PM
To: Englisch, Volker (NIH/NCI) [C]
Subject: FW: CBIIT-PROD: Status and Error Report for Nightly Publishing

Hi Volker,
There are a couple of XSLT errors for protocol documents in the status and error report but it is hard to tell what is wrong with the records since they are valid and publish preview works well for them and some of them haven't been updated in a while. Can you let me know what is wrong them?

Thanks,
William

Comment entered 2014-02-25 13:49:40 by Englisch, Volker (NIH/NCI) [C]

I tried to test this on DEV by changing the ampersand character back and forth from '&' to something else but wasn't able to save the modified document containing the ampersand in XMetaL. The reason being that the '&' character didn't get converted to a character entity because the Term DocTitle filter created ASCII output (leaving the '&' as is). I modified the title to use the output method of HTML so that this conversion takes place and I'm now able to save the modified document.

R12378: DocTitle for Term

Comment entered 2014-02-26 10:54:42 by Englisch, Volker (NIH/NCI) [C]

I've added Bob and Alan as watchers. If you read my last comment you'd notice that the DocTitle filter for Term documents didn't handle the ampersand character properly.
From what I can gather the same problem exists in the server code in the function

cdrutil:denormalizeTerm

In the program

cdrCache.cpp

is a line (line number 312) that extracts the text content of the PreferredName from the document:

pTerm->name = cdr::dom::getTextContent(node).toUtf8();

I think this statement needs to take into account that the ampersand character is listed as an entity.
Does that make sense?

Comment entered 2014-02-26 11:27:19 by Kline, Bob (NIH/NCI) [C]

It's not that statement itself which is the culprit, though you're on the right track. The line you quoted is assigning the name member of the Term object with its preferred name string value, which has the ampersand represented as itself (as it should be). The problem is at line 467 (and similar places) where we're rolling XML by hand, but not escaping the ampersand (or other delimiters) when wrapping the text content of the PreferredName element.

Alan:

Solution is to use cdr::entConvert(term->name).

Comment entered 2014-02-27 12:05:41 by Englisch, Volker (NIH/NCI) [C]

Alan, I re-assigned this task to you for the server changes.
Please assign it back to me once you're done.

Comment entered 2014-02-27 23:29:26 by alan

Fixing the server turned out to be more than the 5 minute job I was
hoping for.

It is essential that the entConvert() routine be called exactly once on
the text contents of the constructed XML.  It can't be called on the XML
as a whole because that would convert tag delimiters to character
entities, and it can't be called twice on the same string because that
would produce errors like: "FD&amp;amp;C".

There are two ways to do this.

    Perform the conversion when I output terms into XML from the Term
    objects I've created, or

    Perform the conversion when I store the name itself in the Term
    object - a more efficient method since Term objects are created once
    and used multiple times.

Both methods require careful study of the code to be sure that all data
is converted, and all is converted only once.

I've implemented both methods to see what was involved, but finally
decided to go with the efficient method - which also had the merit of
changing only a single statement in the code.  The change was:

< pTerm->name = cdr::dom::getTextContent(node).toUtf8();
> pTerm->name = cdr::entConvert(cdr::dom::getTextContent(node)).toUtf8();

I haven't tested this on the theory that:

    1. Volker already knows how to test it, and

    2. Even if it's wrong it should only break the Term caching on Dev.
       It shouldn't make Dev unusable for anything else.

Volker: Please test it at your earliest convenience.

I did NOT convert the attributes or the TermTypeName elements, the only
other strings extracted from the stored XML documents.  Only two
attributes are allowed in Term PreferredName elements, cdr:ref and
pdqKey.  Both have a strictly limited character set that does not allow
any of the characters that get converted to entities.  The TermTypeNames
also have controlled values which do not use characters needing
encoding.


Whenever problems come up in the CdrCache software Bob reminds me that I
ignored the sage advice of the wise Mike Rubenstein about caching.  But
I was filled with the unwarranted self-confidence of youth and I went
ahead and wrote the code.

It's not the only foolish thing in my life that I wish I could take back
and do over, and probably not the worst, but I have been humbled by the
experience.

Comment entered 2014-02-27 23:31:28 by alan

Re-assigning to Volker for further testing.

Comment entered 2014-02-28 13:33:02 by Englisch, Volker (NIH/NCI) [C]

Can you say Can of Worms?

Alan fixed the problem that was originally reported:
The publishing of a CTGovProtocol fails if the protocol links to an intervention (a Terminology document) for which the PreferredName contains the ampersand.

I had identified a problem with the DocTitle filter not properly handling the ampersand.
This was followed by the discovery of the bug in the server code that Alan fixed.

As a follow-up I see now that the PreferredTerm in the CTGovProtocol document isn't displayed properly anymore. The title

FD&amp;C blue no. 2

is now displayed as

FD&amp;amp

Everything is still working correctly, as it appears, because internally only the cdr:ref CDR-ID is being used. I guess the fast denormalization code may be responsible for not handling the ampersand and cutting off the remainder of the terminology title.

Comment entered 2014-02-28 13:37:12 by Englisch, Volker (NIH/NCI) [C]

William, please go ahead and test this on DEV. You should now be able to create terms with ampersand, link to those terms and publish the CTGovProtocol.

Comment entered 2014-02-28 14:49:11 by Osei-Poku, William (NIH/NCI) [C]

The term document appears to be fine but in the protocol document, the text is cut off. I updated the following term "3&D ultrasound-guided radiation therapy" by adding the & sign. In the protocol document, what shows up is this "3&amp"

CDR0000647553 - term document.
CDR0000646976 - protocol document.

Comment entered 2014-02-28 14:55:57 by Englisch, Volker (NIH/NCI) [C]

The term document appears to be fine but in the protocol document, the text is cut off.

Yes, that's what I already mentioned in my comment from 1:33pm.
We will look at this but it doesn't seem to affect publishing the documents since it appears to be a display issue only.

Comment entered 2014-02-28 17:33:29 by alan

Assuming that I haven't done a double conversion, i.e., calling entConvert() twice on the same string, I would guess that someone noticed the problem in the past and wrote code to work around it. The workaround now causes a double conversion.

I'd be tempted to change software so that double conversions can't be done. For example, in entConvert(), we could check what follows the '&' and not convert it if it is followed by "amp;" We'd have to do the same thing in the fast denormalization code, or wherever it's happening now.

There is a potential problem however in that it is conceivable that there are times when a double conversion is actually what we want.

At this point I'd like us to think out the whole chain of conversions from the point that a CIAT user enters "FD&C" to the point when the term is published to cancer.gov. Where should the entity conversions really occur? Is it possible to reduce those to a single place? What will break if we change software so that that's the only place where conversions occur?

Comment entered 2014-03-14 11:11:26 by Englisch, Volker (NIH/NCI) [C]

The part of the PreferredName not being displayed correctly if the term name contains an ampersand has been entered in a new ticket: OCECDR-3744.

Comment entered 2014-03-14 11:22:28 by Englisch, Volker (NIH/NCI) [C]

Adding a reminder to myself that Alan's modified server code will have to be included in the release for this issue to be fixed on PROD.

Comment entered 2014-04-01 19:55:29 by Englisch, Volker (NIH/NCI) [C]

Bob has restored the original filter that created text output:

R12490: DocTitle for Term

Comment entered 2014-04-04 17:29:33 by Osei-Poku, William (NIH/NCI) [C]

I was able to see the complete term name in the protocol documents I tested with so I can confirm that the truncation problem has been resolved.

Verified on DEV.

Comment entered 2014-04-15 15:55:18 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA.

Comment entered 2014-05-07 09:15:30 by Osei-Poku, William (NIH/NCI) [C]

VERIFIED ON PROD.

Comment entered 2014-05-07 10:08:26 by Englisch, Volker (NIH/NCI) [C]

Is it OK to close this ticket?

Comment entered 2014-05-09 10:20:38 by Osei-Poku, William (NIH/NCI) [C]

Yes it is. I am going to close it right away.

Elapsed: 0:00:00.001382

CDR Tickets