CDR Tickets

Issue Number 2443
Summary Unicode character validation in XML documents
Created 2008-01-09 16:50:24
Issue Type Improvement
Submitted By priced
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2008-03-25 16:35:30
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.106771
Description

BZISSUE::3823
BZDATETIME::2008-01-09 16:50:24
BZCREATOR::Sheri Khanna
BZASSIGNEE::Bob Kline
BZQACONTACT::Sheri Khanna

We need to implement a check to ensure that CDR documents do not have invalid
unicode characters. We have gotten several error messages about documents with odd characters in the PRS Protocol Upload Notification emails. This is because we have been cutting and pasting text more often. What is the best way to implement this check?

Comment entered 2008-01-23 13:07:21 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2008-01-23 13:07:21
BZCOMMENTOR::Bob Kline
BZCOMMENT::1

Just to clarify: it's not that the troublesome characters are not valid Unicode characters; they're in the Private Use Area (U+E000 through U+F8FF), which means they're only meaningful when used between collaborating entities which have agreed on common assignments for those characters, which is not the case for the CDR data.

I recommend that we implement this in the validation module of the CDR server.

Alan:

Please take a look at the following patch and tell me if you see any problems:

diff -u -r1.23 CdrValidateDoc.cpp
--- CdrValidateDoc.cpp  4 Mar 2005 02:58:56 -0000       1.23
+++ CdrValidateDoc.cpp  23 Jan 2008 18:01:43 -0000
@@ -275,6 +275,18 @@
     // Will append to the error list from the document object
     cdr::StringList& errList = docObj.getErrList();

+    // Check for private-use characters, which we don't allow (request #3823).
+    cdr::String serializedDoc = docObj.getXml();
+    for (size_t i = 0; i < serializedDoc.size(); ++i) {
+        unsigned int c = (unsigned int)serializedDoc[i];
+        if (c >= 0xE000 && c <= 0xF8FF) {
+            wchar_t err[80];
+            swprintf(err, L"private use character U+%04X at position %u",
+                     c, i + 1);
+            errList.push_back(err);
+        }
+    }
+
     // Get a parse tree for the XML
     if (docObj.parseAvailable()) {
         cdr::dom::Element docXml = docObj.getDocumentElement();
Comment entered 2008-01-29 08:48:53 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2008-01-29 08:48:53
BZCOMMENTOR::Bob Kline
BZCOMMENT::2

Validation enhancement installed on Mahler; ready for user testing.

Comment entered 2008-01-29 15:48:21 by priced

BZDATETIME::2008-01-29 15:48:21
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::3

Validation enhancement verified. I'm just not sure how users are going to find the characters, though. It would be impossible from the validation error message.

Example:

"Private use character UTF023 at position 7576"

In Mahler, CDR523340 and CDR521909 are example documents.

Comment entered 2008-01-29 16:05:48 by alan

BZDATETIME::2008-01-29 16:05:48
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

(In reply to comment #3)
> ... I'm just not sure how users are going to find
> the characters, though. It would be impossible from the validation error
> message.
>
> Example:
>
> "Private use character UTF023 at position 7576"

I'm going to add a note about this problem to OCECDR-2354
"Validation error messages in the CDR".

The proposals I made there don't really address this problem.
That doesn't mean we shouldn't implement them, but we should try
to see if there's a way to improve error reporting on this too.

Comment entered 2008-01-31 16:27:26 by priced

BZDATETIME::2008-01-31 16:27:26
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::5

Per our status meeting today, we decided to promote this change, but with an additional feature. We need to implement a web page that the user can easily access (perhaps from the InScopeProtocol toolbar), where they can plug the CDR ID of the document in and the web page will help them find where the the problem character is.

Comment entered 2008-03-12 18:25:47 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2008-03-12 18:25:47
BZCOMMENTOR::Bob Kline
BZCOMMENT::6

(In reply to comment #5)
> Per our status meeting today, we decided to promote this change, but with an
> additional feature. We need to implement a web page that the user can easily
> access (perhaps from the InScopeProtocol toolbar), where they can plug the CDR
> ID of the document in and the web page will help them find where the the
> problem character is.
>

Implemented as a JavaScript macro, which can be invoked using the popup context menu for any document type (since this is a problem which is not necessarily restricted to InScopeProtocol documents). Ready for testing on Mahler.

Comment entered 2008-03-18 16:51:05 by priced

BZDATETIME::2008-03-18 16:51:05
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::7

(In reply to comment #6)
> (In reply to comment #5)
> Implemented as a JavaScript macro, which can be invoked using the popup
> context menu for any document type (since this is a problem which is not
> necessarily restricted to InScopeProtocol documents).
>Ready for testing on Mahler.

Verified in Mahler. Please promote.

Comment entered 2008-03-25 15:16:08 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2008-03-25 15:16:08
BZCOMMENTOR::Bob Kline
BZCOMMENT::8

Promoted to Bach and Franck; please check (and close if OK).

Comment entered 2008-03-25 16:35:30 by priced

BZDATETIME::2008-03-25 16:35:30
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::9

Verified in Bach.

Elapsed: 0:00:00.001705