CDR Tickets

Issue Number 1889
Summary Address external_map length failures
Created 2006-04-20 16:53:15
Issue Type Improvement
Submitted By Kline, Bob (NIH/NCI) [C]
Assigned To alan
Status Closed
Resolved 2006-11-02 13:53:25
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.106217
Description

BZISSUE::2057
BZDATETIME::2006-04-20 16:53:15
BZCREATOR::Bob Kline
BZASSIGNEE::Alan Meyer
BZQACONTACT::Lakshmi Grama

Currently the filtering module throws an exception when it tries to record a mapping value longer than the external_map table's value column can hold. Please modify the software to detect values which are too long and either truncate and store or discard with a warning. Discuss with Lakshmi which approach she prefers. I have bumped up the column size from 256 to 356 on Bach and in tables.sql.

Comment entered 2006-05-09 22:22:52 by alan

BZDATETIME::2006-05-09 22:22:52
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

Looking to see where the external_map table is used, I found
three places where it can be updated:

CdrFilter.cpp
CdrExternalMap.cpp
LoadExternalIds.py

There are also about 10 places where the table is read, of which
three look like they are potentially impacted by value
truncations. The others look mainly like reports upon which no
logic hinges, i.e., a program is not comparing a value to what it
found in the table and making a decision based upon the answer,
or lookups of columns other than the long length value column.

I've considered various options for what to do about these. My
recommendations are as follows:

1. Convert the two CdrServer routines to both call one place to
update the external_map table.

2. Change that one place to do the following:

If a value does not fit in the table then

Truncate the value to column width.

Replace the last three characters of the truncated data
with "...".

Write a warning to a log file.

3. Modify the output of the CdrAddExternalMapping to return the
warning in its response element. This would be in addition
to the warning written to the log file.

4. Modify LoadExternalIds.py (assuming we still use that
program) to login to the CdrServer and call
cdr.addExternalMapping() instead of modifying the table
directly.

I like this better than discarding over long fields because the
visibly (because of "...") truncated values give us more
information about what happened in a place that will show up in
any reports we ever write that read the external_map table.

I'm not sure which, if any, of the lookup programs need to be
modified. I'll look at them again if we proceed with this plan.
If we do none of them, we will probably still be no worse off
with them than we are now.

Comment entered 2006-05-16 22:29:27 by alan

BZDATETIME::2006-05-16 22:29:27
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

Based on our discussion of this issue at last Thursday's status
meeting, I am implementing the new plan we agreed upon instead of
the one described in the previous comment. I will:

1. Add a new column to the external map table.

ALTER TABLE external_map
ADD mappable CHAR NOT NULL DEFAULT 'Y'

This will add it to every row in the table, setting all of
them to the value 'Y'.

2. Modify the update routines to support this.

3. Modify the user interface to the mapping tables to enable
users to mark any value as mappable or not.

4. Modify the lookup software to not map any values found in the
table which are marked as mappable='N'.

I'm not sure yet which or how many places are affected.
There are about 10 places in our code where we perform
external_map table lookups, but there may only be a few of
these that require modification.

In our meeting we also said that I would modify the External Map
Failures report, but I'm not sure any changes are required in the
report itself. If a match is found on a non-mappable value, it
will not be treated as an error and need not appear on the
report. If we modify the software that collects the data for the
map failures report we won't need to modify the report itself.

I'll have to research that to be sure.

Comment entered 2006-05-26 00:02:11 by alan

BZDATETIME::2006-05-26 00:02:11
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

I've learned a lot since entering the last Bugzilla comment.
Here is my current thinking about how to handle the external map
issues.

Large numbers of non-mappable values seem to conform to a small
number of patterns. For example, there are 1282 values in the
extern_map table on bach that begin with:

"For additional information regarding investigative %"

However, the complete map value is always unique. The "Import
CTGovProtocol" filter appends city|state|zip|country onto each
one, making it unique.

Here's what I'm thinking we should do with all of these, using
the new "mappable" column in the external_map table:

1. Execute a batch SQL query to set mappable='N' for all values
matching the above pattern (and with no mapped doc id), and
for any other patterns that are known to be non-mappable
values.

This will avoid having to individually mark 1,282+ values as
non-mappable.

We might create a program to enable users to do batch
processing on patterns on a regular basis. It would work
something like this:

A user enters a pattern.

The program updates all external_map rows where the value
matches the pattern AND there is no mapped document ID.

I'd propose to do that later, if it turns out that we need
it. It might be that the changes in 3 below, or the optional
changes in 4 below, will be enough.

2. Modify the External Map Failures report to exclude these.

I have experimentally added a checkbox to the report that
asks whether to include non-mappable values or not, with a
default value of not.

Checking the box would be rare, but it allows the user to see
what's going on behind the scenes.

3. Modify the External Map Editor to support marking values
non-mappable.

a. Add a new option to the selection screen.

Currently a user can enter any combination of the
following query parameters:

Map usage.
Pattern for values.
Document ID.

I would add a fourth query parameter that would allow a
user to ask for all values that are:

Not currently mapped.
Marked as mappable.

Naturally, if a user selected this, then he would not be
allowed to enter a document ID.

b. Add an additional check box to mark a value as
non-mappable.

In order to mark a currently mapped value as non-mappable,
a user would have to check two boxes:

Delete mapping.
Make non-mappable.

If the value is not mapped, he need only check the second
box.

4. Leave the CTGov import code alone.

The program would continue to add new unmapped values to the
external_map table. It would not try to do any pattern
matching to avoid such mappings.

We could change our minds on this if the burden of marking
specific value patterns as non-mappable turned out to be a
significant chore.

Does all this make sense?

Does anyone have comments, questions, or corrections?

Comment entered 2006-05-31 00:20:32 by alan

BZDATETIME::2006-05-31 00:20:32
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

I think I've finally got all this working. There was nothing
terribly hard to do, but there was a lot of existing code to
understand, and a lot of detailed changes to get right in the
external mapping editor form.

To test this, there are two administrative screens to be checked.

1. Reports / General Reports / External Map Failures Report.

This now has a new checkbox. If checked, then values that
have been marked as unmappable will not be included in the
report. This should eliminate a lot of noisy clutter.

If the box is checked, all the clutter comes back - which is
useful for testing to be sure that the mappable /
non-mappable flag on each value is really doing what it's
supposed to be doing, or to check that no mappable values got
inadvertently made non-mappable.

2. Update Mapping Table.

I made a number of changes to this form.

a. Input field explanations.

To begin with, I added some gloss to each of the input
fields to explain what they do. I did that because I've
added two input checkboxes and I thought the form might be
getting hard to understand. Figuring out the right gloss
text was also an exercise for testing of my own
understanding of the form.

b. New checkboxes.

The two new checkboxes are:

Only include unmapped values.

Check this box if you don't want to look at values
that are already mapped, i.e., you're just planning
to fix unmapped documents.

Also include unmappable values.

As with the External Map Failures Report, I've left
all the unmappable values off this form by default.

If you wish to see the unmappable values, for example
to make one or more of them mappable again, check
this box.

c. Added new "Mappable?" checkbox.

I added this checkbox to each row of the report. It shows
the value of the new "mappable" column in the external_map
table.

Unchecking this box will make a value non-mappable (set
the value of mappable='N'). Checking it makes the value
mappable again (mappable='Y').

Remember that, if you need to make a non-mappable value
mappable again, the "Also include unmappable values"
checkbox at the top of the screen must also be checked.

d. Changed the text of the "Delete mapping?" checkbox.

I changed it to "Del map?" to get a bit more screen real
estate for each row.

e. Changed condition for "Del map?" appearance.

If a user checks the new "Only include unmapped values"
checkbox, he'll only get values that are unmapped. It
makes no sense to offer a "Del map?" checkbox on that
screen. So I only include the "Del map?" if mapped values
are included.

f. Validated use of "Mappable?"

If a user makes a value non-mappable (unchecks
"Mappable?"), but there is a CDRID, I refuse to make the
change and instead produce an error message. My theory is
that if a CDRID exists and mapping is disallowed, then the
user may have made a mistake, i.e., inadvertently added a
CDRID to the wrong row, or inadvertently unchecked a
"Mappable?" box. So rather than assume one or the other
is right, I just refuse to make any changes to such a row.

This validation applies to either type of change -
adding a CDRID to an unmappable value, or unmapping a
value that already has a CDRID.

3. Made some values non-mappable.

For testing, I made all of the values beginning with "For
additional information regarding investigative ..."
non-mappable. There are various variations on this,
including some beginning with a quotation mark, that are
still mappable.

If you check the appropriate boxes on the report or update
form, the non-mappable values should appear or disappear.

Finally, I've added a fair number of comments to the source code
for the program so that I'll understand it better if it needs to
be maintained again in the future. I also passed it through
html-tidy and cleaned up some html validation warnings.

I think we're ready for testing on Mahler.

Comment entered 2006-07-12 11:08:24 by priced

BZDATETIME::2006-07-12 11:08:24
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::5

A few modification requests, if possible:

1. It is possible to delete mapping that has a CDR ID. Could the same principles that you used for validating 'mappable' be applied to 'del mapping'? Or, if we could get a warning message that we are about to delete mapping that has a CDR ID, that would work too.

2. It allows a mappable value with a CDRID to also be bogus. Perhaps, bogus being checked does not really work when a value is marked as mappable with a CDR ID? Should the program only allow one or the other?

3. When we receive error messages, the message refers to to rows. (Example, 'Can't make row 10656 non-mappable if it's already mapped').
Is there any way for the message to refer to the variant string rather
than the row, since we do not see row numbers in the interface?

Can't make row 10656 non-mappable if it's already mapped

Comment entered 2006-07-20 17:00:28 by alan

BZDATETIME::2006-07-20 17:00:28
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6

(In reply to comment #5)
> A few modification requests, if possible:
>
> 1. It is possible to delete mapping that has a CDR ID. Could the same
> principles that you used for validating 'mappable' be applied to 'del
> mapping'?

Done.

> 2. It allows a mappable value with a CDRID to also be bogus. Perhaps, bogus
> being checked does not really work when a value is marked as mappable with a
> CDR ID? Should the program only allow one or the other?

I don't really know the answer to your question so I decided not to make
any change here. I'd like to consult with Bob before we make this change.

> 3. When we receive error messages, the message refers to to rows. (Example,
> 'Can't make row 10656 non-mappable if it's already mapped').
> Is there any way for the message to refer to the variant string rather
> than the row, since we do not see row numbers in the interface?

Done. I also added the string value display to a couple of
other places that looked to me like they might benefit from it.

I also made a couple of other changes to defend against erroneous
inputs that could cause the program to crash.

All changes are on Mahler.

Comment entered 2006-07-24 14:53:34 by priced

BZDATETIME::2006-07-24 14:53:34
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::7

When saving changes to a document that I changed to 'del mapping', I am receiving a Python error message (attached).

Comment entered 2006-07-24 14:53:34 by priced

Attachment CDR Error.htm has been added with description: Mapping Error

Comment entered 2006-07-25 16:55:41 by alan

BZDATETIME::2006-07-25 16:55:41
BZCOMMENTOR::Alan Meyer
BZCOMMENT::8

I think I've fixed the bug reported in comment #7.

Please test again.

Thanks for your patience.

Comment entered 2006-07-27 17:32:59 by priced

BZDATETIME::2006-07-27 17:32:59
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::9

Alan, I receive this message when trying to delete a mapped doc, but the document DOES actually gets deleted. Also, I get this message even if there is no CDRID and bogus (and mappable) are checked. It seems that if you get this error message, the document should be deleted.

'Can't make "XXXX" non-mappable if it's already mapped. Must also erase CDRID.'

Maybe we should receive an error message telling us why it can't be deleted and also another message confirming that it will be deleted (if it can be).

Does this make sense?

I've attached a sample error message that I am receiving.

Comment entered 2006-07-27 17:32:59 by priced

Attachment CDRMapping_AL.htm has been added with description: Mapping Table message

Comment entered 2006-07-27 17:45:32 by priced

BZDATETIME::2006-07-27 17:45:32
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::10

Alan, I receive this message when trying to delete a mapped doc, but the document DOES actually gets deleted. Also, I get this message even if there is no CDRID and bogus (and mappable) are checked. It seems that if you get this error message, the document should be deleted.

'Can't make "XXXX" non-mappable if it's already mapped. Must also erase CDRID.'

Maybe we should receive an error message telling us why it can't be deleted and also another message confirming that it will be deleted (if it can be).

Does this make sense?

I've attached a sample error message that I am receiving.

Comment entered 2006-07-27 17:45:32 by priced

Attachment CDRMapping_AL.htm has been added with description: Mapping Table message

Comment entered 2006-07-27 20:38:22 by alan

BZDATETIME::2006-07-27 20:38:22
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

I'm not sure I understand the problem in the last comment.

I just deleted a number of mappings, and made some things
non-mappable. I seem to be getting what I think are the
right error messages at the right times. It's possible that
I'm not doing the same thing you're doing, or perhaps the
program is doing what I think is right, but it's not really
right.

Let's see if we can step through it on the phone on Tuesday.
I'll give you a call or email on Tuesday morning.

Comment entered 2006-08-01 15:34:41 by alan

BZDATETIME::2006-08-01 15:34:41
BZCOMMENTOR::Alan Meyer
BZCOMMENT::12

I've modified the program so that, if you delete the map
value altogether (by checking the "Del map?" checkbox),
it completely ignores the bogus and mappable boxes. They
can be checked or unchecked. It doesn't matter. No
validation will be done on them since the delete mapping
overrides any bogus or mappable settings.

Once again, it's ready for testing.

Comment entered 2006-08-02 17:27:40 by priced

BZDATETIME::2006-08-02 17:27:40
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::13

(In reply to comment #12)
> I've modified the program so that, if you delete the map
> value altogether (by checking the "Del map?" checkbox),
> it completely ignores the bogus and mappable boxes.

Alan, this works much better now. We no longer receive error messages that do not really pertain to deleting. But, what I had hoped for was something asking the user to confirm (or warn the user) that they were deleting the document before it is actually deleted. Is this possible? If not, we can promote what we have.

Comment entered 2006-08-03 10:53:09 by alan

BZDATETIME::2006-08-03 10:53:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14

> Alan, this works much better now. We no longer receive error messages that do
> not really pertain to deleting. But, what I had hoped for was something asking
> the user to confirm (or warn the user) that they were deleting the document
> before it is actually deleted. Is this possible? If not, we can promote what
> we have.

Let's talk about it at the status meeting. The program will
already stop you if you try to delete a value that is mapped
to a CDR ID. I had thought that would do the trick, requiring
the user to explicitly erase the CDR ID as well as check the
"Del map?" box. Do you also need it to warn you if you delete
an unmapped value? Or is it that you want a confirmation of
any delete?

Comment entered 2006-08-03 11:26:29 by priced

BZDATETIME::2006-08-03 11:26:29
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::15

(In reply to comment #14)
> Do you also need it to warn you if you delete
> an unmapped value? Or is it that you want a confirmation of
> any delete?

I was hoping for a confirmation of any delete.

Comment entered 2006-08-03 16:12:14 by priced

BZDATETIME::2006-08-03 16:12:14
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::16

Per the status meeting, please go ahead and promote as is. Adding in another message to confirm the delete is not necessary.

Comment entered 2006-08-03 19:14:39 by alan

BZDATETIME::2006-08-03 19:14:39
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17

Promoted to Bach.

Comment entered 2006-08-04 15:42:37 by priced

BZDATETIME::2006-08-04 15:42:37
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::18

In Bach, when hitting the 'get values' button, we are getting a python error message:

A problem occurred in a Python script.

d:\cdr\Log\tmpqlse8c.html contains the description of this error.

Comment entered 2006-08-07 17:53:58 by alan

BZDATETIME::2006-08-07 17:53:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::19

When I copied the program over there was something I forgot to
do.

I think it works now.

I also made 1700 values non-mappable. They were all of the form:
"For additional information regarding investigative sites for
this trial...", including various misspellings.

Comment entered 2006-08-09 16:57:36 by priced

BZDATETIME::2006-08-09 16:57:36
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::20

It works now. I really do want to close this issue, but before I do I want to let you know that it is taking 5 to 6 minutes for the Investigators, Officials and Facilities tables to open. The timing doesn't really change much according to what options (checkboxes) I am using. I realize these particular tables are very large so there is probably not too much that can be done about this, is there?

Comment entered 2006-08-10 11:46:15 by alan

BZDATETIME::2006-08-10 11:46:15
BZCOMMENTOR::Alan Meyer
BZCOMMENT::21

(In reply to comment #20)
> It works now. I really do want to close this issue, but before I do I want to
> let you know that it is taking 5 to 6 minutes for the Investigators, Officials
> and Facilities tables to open. The timing doesn't really change much according
> to what options (checkboxes) I am using. I realize these particular tables are
> very large so there is probably not too much that can be done about this, is
> there?

It turns out that there are some things I can do. I found one
trick that dramatically speeds up the form construction time.
Now I'm looking at the time spent in processing the form
variables in the browser and server.

I'll post a new version before the end of the day.

Comment entered 2006-08-10 20:53:19 by alan

BZDATETIME::2006-08-10 20:53:19
BZCOMMENTOR::Alan Meyer
BZCOMMENT::22

I've made a number of modifications to the program and achieved
some dramatic efficiencies.

The server time is now negligible. I've gone from more than 5
minutes (how much more I don't know because the old program timed
out after that) to less than 3 seconds in the actual Python code.

This was done by:

1. Changing the form building strategy from appending to a
string to appending to a list and building a string at the
end. This is a common Python optimization that takes
minutes off the construction of multi-megabyte strings made
from small bits and pieces.

2. Changing six lines with the following construct:

from: X = fields and fields.getvalue('whatever') or None

to: X = fields.getvalue('whatever') or None

This saved almost 3 minutes in processing the investigators
form!!

Bob and Volker will scratch their heads at this and ask, How
is that possible? Well I don't know how, never expected it,
and wouldn't have believed it myself. But I've got data to
prove that it happened and can reproduce it easily.

Firefox is still very slow in processing the browser portion of
the work. It takes minutes to download and render the form, and
more minutes to pack up all the form variables and post them back
to the server. The browser becomes totally unresponsive while
doing all that and the user's workstation is likely to slow way
down.

I've reduced that time significantly by changing the mapped
string values from read-only "input" fields to plain, ordinary
text. That reduced upload times from 6 minutes to 2. However
it's still a multi-minute operation and the download time, while
improved, is not as much improved by this technique.

I think I might be able to improve things further by a major
change in the report format, but that will take significant work
and testing, and may not be necessary. Internet Explorer handles
the form very efficiently. If users use IE for this task,
they'll find that updating mapping values is now a very
manageable task, even for very large sets of mappings.

The revised form is on Mahler only. I've left in my timing
profiles in case anyone wants to see these surprising
optimizations in action.

Comment entered 2006-08-11 17:04:16 by priced

BZDATETIME::2006-08-11 17:04:16
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::23

(In reply to comment #22)
> The revised form is on Mahler only. I've left in my timing
> profiles in case anyone wants to see these surprising
> optimizations in action.

Alan, should I be able to see the new optimizations when I try to open
the tables in the Admin screen in Mahler? I don't see the changes.

Comment entered 2006-08-11 17:52:48 by alan

BZDATETIME::2006-08-11 17:52:48
BZCOMMENTOR::Alan Meyer
BZCOMMENT::24

Yes, when I tried to look at all CTGov Investigators it was much
faster. Are you not seeing that?

Hmmmm.

I just checked and the right code is in place. I see in the log
file that you ran the program a number of times today, the last
one at 4:57 pm, and got a 3 second response time in the server.

Are you using Firefox or IE? Firefox is very slow handling the
huge display but IE was fast - at least my copy of it was on my
2.4 GHz Pentium 4 with 512 MB RAM.

Maybe it's the network delay getting from NCI to CIAT? It is a 6
MB HTML transfer. Is it slow when you get a similar sized file?

I don't know if you have remote terminal server access to Mahler.
If you do, you might try that. With RTS, the 6 MB isn't
transferred to CIAT, just the information on screen is
transferred.

Comment entered 2006-08-15 17:01:01 by priced

BZDATETIME::2006-08-15 17:01:01
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::25

(In reply to comment #24)
> Are you using Firefox or IE?

I'm using IE.

> Maybe it's the network delay getting from NCI to CIAT? It is a 6
> MB HTML transfer. Is it slow when you get a similar sized file?

I'm not sure. Maybe I can do some testing.

> I don't know if you have remote terminal server access to Mahler.
> If you do, you might try that. With RTS, the 6 MB isn't
> transferred to CIAT, just the information on screen is
> transferred.

This wouldn't really help users, though. (I wouldn't want to have to use remote terminal server regularly).

Comment entered 2006-08-15 17:28:06 by alan

BZDATETIME::2006-08-15 17:28:06
BZCOMMENTOR::Alan Meyer
BZCOMMENT::26

I understand that we don't want to rely on remote terminal
server. I'm just asking that as a way of finding out whether the
problem is in the program or in the transmission of data.

I just called up the program on Mahler and requested all CT.Gov
Investigators.

Using the browser on my own computer, it took about 20 seconds to
complete and fully appear in my browser window. I then clicked
Get Values again, and it took another 20 seconds. Of that 20
seconds, about 3 seconds was spent in the server (compared to at
least 5 minutes before the optimization).

Doing the same thing using remote terminal services resulted in
about 14 seconds for each.

The displayed HTML was almost 6 MB. If you had the entire use of
the T1 (1.544 million bits per second), with no other users
connected and no other bottlenecks, I estimate it would take
about 36 seconds to download the html, and maybe 40 or more
seconds to do a Get Values - which uploads all the form fields
before downloading it all again. Since there may be other
bottlenecks and contention in the link, I imagine that about a
one minute wait is realistic. That's pretty slow, but still
much better than it was.

What I'd like you to try is to time the performance using CT.Gov
Investigators and see what you get both with your desktop and
with remote terminal server. With RTS, I'm guessing you'll see a
result around 20 seconds or so. The difference between that and
what you see with your desktop is most likely entirely due to
communication time. I don't think we can do much about that
unless we cut down the amount of information on the form.

If it's taking more than a minute, especially if it's much more
than a minute, but RTS is still reasonable, then maybe something
is screwed up in the communications link. We'd want to get
network folks involved to see if they can track that down if that
is the problem.

Comment entered 2006-08-15 17:50:43 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2006-08-15 17:50:43
BZCOMMENTOR::Bob Kline
BZCOMMENT::27

(In reply to comment #26)
> I don't think we can do much about that unless we cut down the
> amount of information on the form.

Is this a good candidate for Ajax, perhaps?

Comment entered 2006-08-15 19:01:34 by alan

BZDATETIME::2006-08-15 19:01:34
BZCOMMENTOR::Alan Meyer
BZCOMMENT::28

(In reply to comment #27)
> (In reply to comment #26)
> > I don't think we can do much about that unless we cut down the
> > amount of information on the form.
>
> Is this a good candidate for Ajax, perhaps?

That's an interesting idea. We could greatly reduce the amount of
time taken to post changes. With more work, we could greatly
reduce the time taken for a repaint of the screen where only a
few things have changed. However the initial screen paint and
any paint with different input values would not be faster and I
think the programming would be fairly complex.

I'm thinking that if we want to try out AJAX, there are some
other applications that are better candidates.

Comment entered 2006-08-15 19:08:31 by alan

BZDATETIME::2006-08-15 19:08:31
BZCOMMENTOR::Alan Meyer
BZCOMMENT::29

Trying to track down what caused our slowdown in the Python
construction: "fields and fields.getvalue(...)", I modified a
copy of cgi.py, as suggested by Bob, and installed it on
Mahler in \Inetpub\wwwroot\cgi-bin\cdr.

The modifications were to write to the cdr debug.log when any of
the following functions were called:

FieldStorage.init
MiniFieldStorage.repr
FieldStorage.repr
FieldStorage.iter

Here is the output, including messages logged from the
EditExternMap.py program.

!5068 Tue Aug 15 18:39:15 2006: Program initiation time=1155681555.682000
!5068 Tue Aug 15 18:39:15 2006: FieldStorage_init_
!5068 Tue Aug 15 18:39:16 2006: Got field storage time=1155681556.417000
!5068 Tue Aug 15 18:39:16 2006: Got session time=1155681556.417000
!5068 Tue Aug 15 18:39:16 2006: Got request time=1155681556.432000
!5068 Tue Aug 15 18:39:45 2006: fields-and time=1155681585.026000
!5068 Tue Aug 15 18:40:13 2006: fields-and time=1155681613.588000
!5068 Tue Aug 15 18:40:42 2006: fields-and time=1155681642.056000
!5068 Tue Aug 15 18:41:10 2006: fields-and time=1155681670.697000
!5068 Tue Aug 15 18:41:39 2006: fields-and time=1155681699.431000
!5068 Tue Aug 15 18:42:08 2006: fields-and time=1155681728.024000
!5068 Tue Aug 15 18:42:08 2006: Got 6 more fields time=1155681728.024000
!5068 Tue Aug 15 18:42:08 2006: End of program initiation time=1155681728.024000

The log shows that FieldStorage.init() was called - proving
that my modified version of cgi.py was invoked. But the internal
repr and iter functions were not called.

The program took about 29 seconds for each call to "fields and ...".

I conclude that our speculation that the repr (or iter)
functions might be involved is disproven.

I have removed the modified cgi.py and returned the
EditExternMap.py program to its faster state.

We could, of course, make the test just once and store the result
in a boolean value so that the logic is unchanged but the bottleneck
is bypassed. But we still wouldn't know what causes the bottleneck.

Comment entered 2006-08-15 19:23:12 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2006-08-15 19:23:12
BZCOMMENTOR::Bob Kline
BZCOMMENT::30

Python is invoking len(), which in turn is constructing a list of all of the dictionary keys and calculating the length.

Comment entered 2006-08-15 22:37:57 by alan

BZDATETIME::2006-08-15 22:37:57
BZCOMMENTOR::Alan Meyer
BZCOMMENT::31

(In reply to comment #30)
> Python is invoking len(), which in turn is constructing a list of all of
> the dictionary keys and calculating the length.

This still mystifies me. The following program generates 50,000
quite long strings and puts them in a dictionary with long strings
as keys and long strings as values. Then it tries two ways to get
the length - by calling len, and by calling len on an extraction of
the keys.

The program runs in under one second on my machine, including the
generation of the big dictionary - the slowest part.

a = {}
print "start"
for i in range (50000):
a[strℹ*100] = strℹ * 100
print "done init"
print len(a)
print "done len"
print len(a.keys())
print "done len(a.keys())"

Comment entered 2006-08-16 08:21:19 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2006-08-16 08:21:19
BZCOMMENTOR::Bob Kline
BZCOMMENT::32

The problem is that the CGI values aren't in a dictionary, they're in a list (because there can be more than one value for a given name). See what happens when you do it the way the FieldStorage class does it:

#!/usr/bin/python

import random

class Item:
def init(self, name, value):
self.name = name
self.value = value

class Foo:
def init(self):
self.list = []
print "start"
for i in range (50000):
self.list.append(Item(str(random.random()), str(random.random())))
print "done init"
def len(self):
return len(self.keys())
def keys(self):
keys = []
for item in self.list:
if item.name not in keys: keys.append(item.name)
return keys

a = Foo()
print len(a)
print "done len"
print len(a.keys())
print "done len(a.keys())"

The fact is, we don't really need the "if fields" test at all, and if we did, then "if fields.list" would be orders of magnitude more efficient (though the presence of the list attribute, while mentioned in the documentation, is mentioned in passing, not in a way that makes me think I could rely on its name being available in future versions). The reason we don't need the test at all is that cgi.FieldStorage() is a constructor invocation, not a simple function call, so the object must exist if an exception wasn't thrown.

Comment entered 2006-08-16 12:03:14 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2006-08-16 12:03:14
BZCOMMENTOR::Bob Kline
BZCOMMENT::33

Try this version:

#!/usr/bin/python

import random, time

class Item:
def init(self, name, value):
self.name = name
self.value = value

class Foo:
def init(self):
self.list = []
print "start"
for i in range (50000):
self.list.append(Item(str(random.random()), str(random.random())))
print "done init"
def len(self):
return len(self.keys())
def keys(self):
return {}.fromkeys([i.name for i in self.list]).keys()

a = Foo()
t = time.time()
print len(a)
print "done len (%f seconds)" % (time.time() - t)
t = time.time()
print len(a.keys())
print "done len(a.keys()) (%f seconds)" % (time.time() - t)

I have given Guido this optimization (he's the current maintainer for cgi.py), and I'll patch it in to our copies.

Comment entered 2006-08-16 12:13:34 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2006-08-16 12:13:34
BZCOMMENTOR::Bob Kline
BZCOMMENT::34

Here's an even more dramatic optimization of the test of the FieldStorage object:

#!/usr/bin/python

import random, time

class Item:
    def __init__(self, name, value):
        self.name = name
        self.value = value

class Foo:
    def __init__(self):
        self.list = []
        print "start"
        for i in range (50000):
            self.list.append(Item(str(random.random()), str(random.random())))
        print "done init"
    def __len__(self):
        return len({}.fromkeys([i.name for i in self.list]))
    def keys(self):
        return {}.fromkeys([i.name for i in self.list]).keys()
    def __nonzero__(self):
        return self.list and True or False

a = Foo()
t = time.time()
print len(a)
print "done len (%f seconds)" % (time.time() - t)
t = time.time()
print len(a.keys())
print "done len(a.keys()) (%f seconds)" % (time.time() - t)
t = time.time()
b = a and "yes" or "no"
print "done test of a (%s) (%f seconds)" % (b, time.time() - t)
Comment entered 2006-08-17 21:57:10 by alan

BZDATETIME::2006-08-17 21:57:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::35

I modified the program to add two new lines to the input controls
at the top of the screen.

One line asks for "Alphabetical start". If you put a string
here, e.g., "Boston", and leave Pattern and Document ID blank,
then the display will start showing values beginning with
"Boston...".

The next line asks for the maximum number of values to retrieve,
with a default of 250. This maximum works in all modes -
pattern, document ID, or alphabetical listing. It limits the
number of rows displayed to that maximum.

Try this out and see how you like it. If you have a favorite
number, I can set that as the default maximum rows.

Right now, if you see a screen full of alphabetically listed
headings and want to see more starting from the last one, you
have to type or cut and past part of the last heading into the
"Alphabetical start" box.

If you like, I can instead automatically put the last heading on
the display in the Alphabetical list, so you can automatically go
forward from there when clicking "Save Changes" or "Get Values".

I've done so much work on this issue after marking it RESOLVED
FIXED, that I'm going to re-open it. Maybe that will be the
charm that finishes it up for us. 🙂

Comment entered 2006-08-22 11:50:24 by priced

BZDATETIME::2006-08-22 11:50:24
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::36

The changes you implemented appear to work fine in Mahler. I don't think we need to enhance the alphabetical display any further.

Let's leave the default at displaying 250 docs at a time, also.

I did notice that one default is not working correctly. If you do not check
'Only include unmapped values' or 'Also include unmappable values', both
mappable and unmappable values are being displayed. Not checking either box should result in only mappable values being displayed, correct? After this default is fixed, we can promote to Bach and close this issue.

Comment entered 2006-08-23 15:07:59 by alan

BZDATETIME::2006-08-23 15:07:59
BZCOMMENTOR::Alan Meyer
BZCOMMENT::37

(In reply to comment #36)
> ...
> I did notice that one default is not working correctly. If you
> do not check 'Only include unmapped values' or 'Also include
> unmappable values', both mappable and unmappable values are
> being displayed. Not checking either box should result in only
> mappable values being displayed, correct? After this default
> is fixed, we can promote to Bach and close this issue.

I'm not seeing this. Try this:

Select a map usage: CT.gov Facilities
...
Alphabetical start: Florida
Max values to retrieve: 250

Do it once with neither box checked and once with "Also include
unmappable values" checked.

When I do that, I get all the unmappable values beginning with
"For information call ... " only when the box is checked.

Do you have an example where it's behaving incorrectly?

Thanks.

Comment entered 2006-08-23 15:28:03 by priced

BZDATETIME::2006-08-23 15:28:03
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::38

When I select CTGov facilities with the alpha start of Florida and no boxes checked, I see both mapped and unmapped values. Please see example.

Comment entered 2006-08-23 15:28:03 by priced

Attachment MappingTableExample_Florida.htm has been added with description: Mapping Table example

Comment entered 2006-08-23 15:51:09 by alan

BZDATETIME::2006-08-23 15:51:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::39

(In reply to comment #38)
> ...
> When I select CTGov facilities with the alpha start of Florida and no boxes
> checked, I see both mapped and unmapped values. Please see example.

The terminology is confusing you I think.

It's right for you to see both mapped and unmapped values.
If you check the unmappable (emphasis on "ble"), you'll also
see values that are not mappable. They're not only unmapped,
they're also not mappable at all.

Here's a matrix of what you see:

No boxes checked:
All mappable values.
Some are mapped.
Some are not mapped.

Top box (Only include unmapped values) checked:
All unmapped values, but only if they are mappable.

Bottom box (Also include unmappable values) checked:
All values:
whether they're mapped or not,
whether they're mappable or not.

Both boxes checked:
All unmapped values, whether they're mappable or not.

I know the form is starting to look like a Starbucks menu,
but I'm thinking that sometimes you'll want cream, sometimes
sugar, and sometimes with or without caffeine 🙂

Comment entered 2006-08-25 10:30:39 by priced

BZDATETIME::2006-08-25 10:30:39
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::40

Thanks for clarifying! Verified in Mahler. Please promote.

Comment entered 2006-08-29 11:50:19 by alan

BZDATETIME::2006-08-29 11:50:19
BZCOMMENTOR::Alan Meyer
BZCOMMENT::41

(In reply to comment #40)
> Thanks for clarifying! Verified in Mahler. Please promote.

Promoted to Bach and Franck.

Comment entered 2006-08-29 14:44:18 by priced

BZDATETIME::2006-08-29 14:44:18
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::42

Verified in Bach. Closing issue.

Comment entered 2006-08-29 22:37:06 by alan

BZDATETIME::2006-08-29 22:37:06
BZCOMMENTOR::Alan Meyer
BZCOMMENT::43

In working on the ExternalMap.py administrative subsystem
program, I lost track of the fact that the changes I made are not
just for mapping table update program. I have a new version of
the CDR server that uses the new "mappable" column in the host
processing that supports the cdrutil:extern_map function used in
the "Import CTGovProtocol" XSLT filter.

I have not installed this new server version anywhere. I need to
meet with Bob to walk through what I've done, make sure it is
correct behavior for the CTGovImport program, and set up a test
plan.

Perhaps we can do that on Thursday.

In the meantime, although the new program to update the external
mappings is on Bach, the import program won't actually pay
attention to the new "mappable" column. We'll have to test on
Mahler first and then implement this on Bach.

I'm reopening this bug as the right place to track this issue in
Bugzilla.

Comment entered 2006-09-05 10:56:05 by alan

BZDATETIME::2006-09-05 10:56:05
BZCOMMENTOR::Alan Meyer
BZCOMMENT::44

I installed the modified server code for this last Thursday.
Bob and I walked through it and agreed it should have no effect
except in the case of other errors.

I'm marking this resolved.

Comment entered 2006-09-21 16:59:23 by alan

BZDATETIME::2006-09-21 16:59:23
BZCOMMENTOR::Alan Meyer
BZCOMMENT::45

I think we can mark this resolved if we wish. The modified
server code is not installed yet on Bach or Franck but will
be promoted in the next general server release. It has been
running on Mahler for quite some time, and is insignificant
since it only catches errors that should be caught by other
code anyway.

Comment entered 2006-10-17 20:50:05 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2006-10-17 20:50:05
BZCOMMENTOR::Bob Kline
BZCOMMENT::46

New server promoted to Bach; please check (and close if OK).

Comment entered 2006-11-02 13:53:25 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2006-11-02 13:53:25
BZCOMMENTOR::Bob Kline
BZCOMMENT::47

Closed at status meeting per LG.

Attachments
File Name Posted User
CDR Error.htm 2006-07-24 14:53:34
CDRMapping_AL.htm 2006-07-27 17:45:32
CDRMapping_AL.htm 2006-07-27 17:32:59
MappingTableExample_Florida.htm 2006-08-23 15:28:03

Elapsed: 0:00:00.001678