Issue Number | 2443 |
---|---|
Summary | Unicode character validation in XML documents |
Created | 2008-01-09 16:50:24 |
Issue Type | Improvement |
Submitted By | priced |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2008-03-25 16:35:30 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.106771 |
BZISSUE::3823
BZDATETIME::2008-01-09 16:50:24
BZCREATOR::Sheri Khanna
BZASSIGNEE::Bob Kline
BZQACONTACT::Sheri Khanna
We need to implement a check to ensure that CDR documents do not have
invalid
unicode characters. We have gotten several error messages about
documents with odd characters in the PRS Protocol Upload Notification
emails. This is because we have been cutting and pasting text more
often. What is the best way to implement this check?
BZDATETIME::2008-01-23 13:07:21
BZCOMMENTOR::Bob Kline
BZCOMMENT::1
Just to clarify: it's not that the troublesome characters are not valid Unicode characters; they're in the Private Use Area (U+E000 through U+F8FF), which means they're only meaningful when used between collaborating entities which have agreed on common assignments for those characters, which is not the case for the CDR data.
I recommend that we implement this in the validation module of the CDR server.
Alan:
Please take a look at the following patch and tell me if you see any problems:
-u -r1.23 CdrValidateDoc.cpp
diff --- CdrValidateDoc.cpp 4 Mar 2005 02:58:56 -0000 1.23
+++ CdrValidateDoc.cpp 23 Jan 2008 18:01:43 -0000
@@ -275,6 +275,18 @@
// Will append to the error list from the document object
::StringList& errList = docObj.getErrList();
cdr
+ // Check for private-use characters, which we don't allow (request #3823).
+ cdr::String serializedDoc = docObj.getXml();
+ for (size_t i = 0; i < serializedDoc.size(); ++i) {
+ unsigned int c = (unsigned int)serializedDoc[i];
+ if (c >= 0xE000 && c <= 0xF8FF) {
+ wchar_t err[80];
+ swprintf(err, L"private use character U+%04X at position %u",
+ c, i + 1);
+ errList.push_back(err);
+ }
+ }
+
// Get a parse tree for the XML
if (docObj.parseAvailable()) {
::dom::Element docXml = docObj.getDocumentElement(); cdr
BZDATETIME::2008-01-29 08:48:53
BZCOMMENTOR::Bob Kline
BZCOMMENT::2
Validation enhancement installed on Mahler; ready for user testing.
BZDATETIME::2008-01-29 15:48:21
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::3
Validation enhancement verified. I'm just not sure how users are going to find the characters, though. It would be impossible from the validation error message.
Example:
"Private use character UTF023 at position 7576"
In Mahler, CDR523340 and CDR521909 are example documents.
BZDATETIME::2008-01-29 16:05:48
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4
(In reply to comment #3)
> ... I'm just not sure how users are going to find
> the characters, though. It would be impossible from the validation
error
> message.
>
> Example:
>
> "Private use character UTF023 at position 7576"
I'm going to add a note about this problem to OCECDR-2354
"Validation error messages in the CDR".
The proposals I made there don't really address this problem.
That doesn't mean we shouldn't implement them, but we should try
to see if there's a way to improve error reporting on this too.
BZDATETIME::2008-01-31 16:27:26
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::5
Per our status meeting today, we decided to promote this change, but with an additional feature. We need to implement a web page that the user can easily access (perhaps from the InScopeProtocol toolbar), where they can plug the CDR ID of the document in and the web page will help them find where the the problem character is.
BZDATETIME::2008-03-12 18:25:47
BZCOMMENTOR::Bob Kline
BZCOMMENT::6
(In reply to comment #5)
> Per our status meeting today, we decided to promote this change,
but with an
> additional feature. We need to implement a web page that the user
can easily
> access (perhaps from the InScopeProtocol toolbar), where they can
plug the CDR
> ID of the document in and the web page will help them find where
the the
> problem character is.
>
Implemented as a JavaScript macro, which can be invoked using the popup context menu for any document type (since this is a problem which is not necessarily restricted to InScopeProtocol documents). Ready for testing on Mahler.
BZDATETIME::2008-03-18 16:51:05
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::7
(In reply to comment #6)
> (In reply to comment #5)
> Implemented as a JavaScript macro, which can be invoked using the
popup
> context menu for any document type (since this is a problem which
is not
> necessarily restricted to InScopeProtocol documents).
>Ready for testing on Mahler.
Verified in Mahler. Please promote.
BZDATETIME::2008-03-25 15:16:08
BZCOMMENTOR::Bob Kline
BZCOMMENT::8
Promoted to Bach and Franck; please check (and close if OK).
BZDATETIME::2008-03-25 16:35:30
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::9
Verified in Bach.
Elapsed: 0:00:00.001705