PDQ Issues

Issue Number	3319
Summary	[Internal] Fix handling of Unicode in filterset values
Created	2011-03-10 14:30:55
Issue Type	Improvement
Submitted By	Kline, Bob (NIH/NCI) [C]
Assigned To	Englisch, Volker (NIH/NCI) [C]
Status	Closed
Resolved	2011-10-06 15:29:09
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.107647

Description

BZISSUE::5012
BZDATETIME::2011-03-10 14:30:55
BZCREATOR::Bob Kline
BZASSIGNEE::Volker Englisch
BZQACONTACT::Bob Kline

Please clean up the filter set handling functions in cdr.py so they handle Unicode characters correctly in the filter set names, descriptions, and notes.

Comment entered 2011-03-10 14:35:22 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2011-03-10 14:35:22
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1

I'm assuming that this issue is related to comment #2 in OCECDR-3317 right?

Comment entered 2011-08-02 18:51:42 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2011-08-02 18:51:42
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2

I remember that you told me how you wanted me to implement this but I'm afraid I forgot what you said and can't find what I wrote down (although I'm certain I will find it as soon as you told me again).

Do you remember?

Comment entered 2011-08-03 11:54:10 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-08-03 11:54:10
BZCOMMENTOR::Bob Kline
BZCOMMENT::3

I don't remember, but here's what I see now. It looks like some of our code follows the principles we've adopted for processing strings and some (including some of my own code) violates those principles. Those principles boil down to the following: process with unicode strings, serialize with utf-8. Sometimes we intentionally ignore the second part of the rule when we know we're working exclusively in the Python world, in which we may have reason to use repr or pickle for serialization instead, but that exception doesn't apply here. The function getFilterSet() in cdr.py follows the general principle: we ask the server for the information about the filter set, and what we get is serialized as utf-8 (as is the case for all communications between the CDR client and server). We parse the response and extract the pieces of information, which the parser gives us as unicode strings, which (in the case of the set name and description) get stored directly in the FilterSet object. Same is true for the getFilterSets() function, which returns a sequence of IdAndName objects in which the name member is a unicode string.

It doesn't look like the functions to store or replace a filter set were as careful, as they appear to assume that the name and description members of the FilterSet object passed in are already encoded as utf-8. Let's document those functions (and the FilterSet class) to say that the string members of the class should be unicode, and let's have the packFilterSet() function encode the unicode strings as part of the serialization process for sending the information to the CDR server. We can do the encoding conditionally, in order to preserve backward compatibility with code that has already done the encoding. The quick and dirty way to do this would be:

if type(s) is unicode:
s = s.encode('utf-8')

A more robust approach would verify that the non-unicode string is really encoded as utf-8:

def toUnicode(s):
if type(s) is unicode:
return s
if type(s) is not str:
return unicode(s)
for e in ('ascii', 'utf-8', 'iso-8859-1'):
try:
return unicode(s, e)
except:
pass
raise Exception("unknown encoding for %s" % repr(s))

def packFilterSet(filterSet):
elems = [u"""\
<FilterSetName>%s</FilterSetName>
<FilterSetDescription>%s</FilterSetDescription>
""" % (cgi.escape(toUnicode(filterSet.name)),
cgi.escape(toUnicode(filterSet.name))]
if filterSet.notes is not None:
elems.append(u"""\
<FilterSetNotes>%s</FilterSetNotes>
""" % cgi.escape(toUnicode(filterSet.notes)))
for member in filterSet.members:
if type(member.id) is type(9):
elems.append("<FilterSet SetId='%d'/>" % member.id)
else:
elems.append("<Filter DocId='%s'/>" % member.id)
return u"".join(elems).encode('utf-8')

It also appears that the CGI code which uses these functions isn't always careful about unicode handling, as it incorrectly assumes that the string members of the FilterSet object are byte strings, not unicode. Instead of building up ASCII strings for the HTML returned to the browser, the filter-set CGI scripts should be building up unicode HTML, which the cdrcgi.sendPage() function will convert correctly at serialization time.

Hope this is helpful. If not, we can sit down together tomorrow to go through the code.

Comment entered 2011-09-12 17:48:49 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2011-09-12 17:48:49
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4

(In reply to comment #0)
> Please clean up the filter set handling functions in cdr.py so they handle
> Unicode characters correctly in the filter set names, descriptions, and notes.

I have somehow managed to modify the functions to accept Unicode characters and not crash when these characters are entered as part of the filter set name, description, or notes. It is doubtful, however, if these changes can be considered a "clean up". If possible, I would like to go over these changes once again because I still have questions about some of the changes.
Also, when deleting a filter set now, a confirmation box is displayed with the name of the filter set where the Unicode characters are escaped.

Comment entered 2011-09-27 16:26:13 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2011-09-27 16:26:13
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5

Bob helped me to look over the code and fix a couple of things that didn't come out as expected. I've now versioned those changes and copied them to MAHLER and FRANCK.
cdr.py - R10208
Modified: toUnicode(), packFilterSet(), addFilterSet(), repFilterSet(),
getFilterSet()
EditFilterSets.py - R10209
EditFilterSet.py - R10209

Comment entered 2011-10-06 15:29:09 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2011-10-06 15:29:09
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6

As discussed at today's CDR status meeting, the following programs have been copied to BACH:
cdr.py - R10208
EditFilterSets.py - R10209
EditFilterSet.py - R10209

Closing issue.

Elapsed: 0:00:00.001415

CDR Tickets