Issue Number | 3319 |
---|---|
Summary | [Internal] Fix handling of Unicode in filterset values |
Created | 2011-03-10 14:30:55 |
Issue Type | Improvement |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2011-10-06 15:29:09 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.107647 |
BZISSUE::5012
BZDATETIME::2011-03-10 14:30:55
BZCREATOR::Bob Kline
BZASSIGNEE::Volker Englisch
BZQACONTACT::Bob Kline
Please clean up the filter set handling functions in cdr.py so they handle Unicode characters correctly in the filter set names, descriptions, and notes.
BZDATETIME::2011-03-10 14:35:22
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1
I'm assuming that this issue is related to comment #2 in OCECDR-3317 right?
BZDATETIME::2011-08-02 18:51:42
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2
I remember that you told me how you wanted me to implement this but I'm afraid I forgot what you said and can't find what I wrote down (although I'm certain I will find it as soon as you told me again).
Do you remember?
BZDATETIME::2011-08-03 11:54:10
BZCOMMENTOR::Bob Kline
BZCOMMENT::3
I don't remember, but here's what I see now. It looks like some of our code follows the principles we've adopted for processing strings and some (including some of my own code) violates those principles. Those principles boil down to the following: process with unicode strings, serialize with utf-8. Sometimes we intentionally ignore the second part of the rule when we know we're working exclusively in the Python world, in which we may have reason to use repr or pickle for serialization instead, but that exception doesn't apply here. The function getFilterSet() in cdr.py follows the general principle: we ask the server for the information about the filter set, and what we get is serialized as utf-8 (as is the case for all communications between the CDR client and server). We parse the response and extract the pieces of information, which the parser gives us as unicode strings, which (in the case of the set name and description) get stored directly in the FilterSet object. Same is true for the getFilterSets() function, which returns a sequence of IdAndName objects in which the name member is a unicode string.
It doesn't look like the functions to store or replace a filter set were as careful, as they appear to assume that the name and description members of the FilterSet object passed in are already encoded as utf-8. Let's document those functions (and the FilterSet class) to say that the string members of the class should be unicode, and let's have the packFilterSet() function encode the unicode strings as part of the serialization process for sending the information to the CDR server. We can do the encoding conditionally, in order to preserve backward compatibility with code that has already done the encoding. The quick and dirty way to do this would be:
if type(s) is unicode:
s = s.encode('utf-8')
A more robust approach would verify that the non-unicode string is really encoded as utf-8:
def toUnicode(s):
if type(s) is unicode:
return s
if type(s) is not str:
return unicode(s)
for e in ('ascii', 'utf-8', 'iso-8859-1'):
try:
return unicode(s, e)
except:
pass
raise Exception("unknown encoding for %s" % repr(s))
def packFilterSet(filterSet):
elems = [u"""\
<FilterSetName>%s</FilterSetName>
<FilterSetDescription>%s</FilterSetDescription>
""" % (cgi.escape(toUnicode(filterSet.name)),
cgi.escape(toUnicode(filterSet.name))]
if filterSet.notes is not None:
elems.append(u"""\
<FilterSetNotes>%s</FilterSetNotes>
""" % cgi.escape(toUnicode(filterSet.notes)))
for member in filterSet.members:
if type(member.id) is type(9):
elems.append("<FilterSet SetId='%d'/>" % member.id)
else:
elems.append("<Filter DocId='%s'/>" % member.id)
return u"".join(elems).encode('utf-8')
It also appears that the CGI code which uses these functions isn't always careful about unicode handling, as it incorrectly assumes that the string members of the FilterSet object are byte strings, not unicode. Instead of building up ASCII strings for the HTML returned to the browser, the filter-set CGI scripts should be building up unicode HTML, which the cdrcgi.sendPage() function will convert correctly at serialization time.
Hope this is helpful. If not, we can sit down together tomorrow to go through the code.
BZDATETIME::2011-09-12 17:48:49
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4
(In reply to comment #0)
> Please clean up the filter set handling functions in cdr.py so they
handle
> Unicode characters correctly in the filter set names, descriptions,
and notes.
I have somehow managed to modify the functions to accept Unicode
characters and not crash when these characters are entered as part of
the filter set name, description, or notes. It is doubtful, however, if
these changes can be considered a "clean up". If possible, I would like
to go over these changes once again because I still have questions about
some of the changes.
Also, when deleting a filter set now, a confirmation box is displayed
with the name of the filter set where the Unicode characters are
escaped.
BZDATETIME::2011-09-27 16:26:13
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5
Bob helped me to look over the code and fix a couple of things that
didn't come out as expected. I've now versioned those changes and copied
them to MAHLER and FRANCK.
cdr.py - R10208
Modified: toUnicode(), packFilterSet(), addFilterSet(),
repFilterSet(),
getFilterSet()
EditFilterSets.py - R10209
EditFilterSet.py - R10209
BZDATETIME::2011-10-06 15:29:09
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6
As discussed at today's CDR status meeting, the following programs
have been copied to BACH:
cdr.py - R10208
EditFilterSets.py - R10209
EditFilterSet.py - R10209
Closing issue.
Elapsed: 0:00:00.000365