CDR Tickets

Issue Number 3271
Summary [CiteMS] import utility import error
Created 2010-11-29 13:31:55
Issue Type Bug
Submitted By alan
Assigned To alan
Status Closed
Resolved 2011-01-12 14:42:08
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107599
Description

BZISSUE::4961
BZDATETIME::2010-11-29 13:31:55
BZCREATOR::Alan Meyer
BZASSIGNEE::Alan Meyer
BZQACONTACT::Cynthia Boggess

Minaxi ran into a problem importing articles for the summary
topic "Genetics of Breast and Ovarian Cancer". One of the
imported citations was read incorrectly and three more following
it were corrupted.

Minaxi reports that this problem is blocking further imports at
this time, at least until the broken records are removed.

I am attaching Minaxi's email with details of the error.

I've assigned Cynthia as the QA contact for this issue because
Minaxi is away on vacation for the coming month. Cynthia or
William should re-assign that if someone else is the right person
to do the QA testing.

I'm listing the priority as P2, "critical". Since I'm going on
vacation on Wednesday I'm going to try to resolve the problem
before I leave.

I obtained a backup of the production database made by the
infrastructure team the night before the error. I plan to
restore that into the test database and use it, together with the
original import file, to try to find the problem.

Comment entered 2010-11-29 13:31:55 by alan

Attachment PMID_not_found_error_Email.txt has been added with description: Email from Minaxi describing the error.

Comment entered 2010-11-29 13:36:22 by alan

BZDATETIME::2010-11-29 13:36:22
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

Here's a plain text version of Minaxi's email that's easier
to read.

Comment entered 2010-11-29 13:36:22 by alan

Attachment MinaxiEmail.txt has been added with description: Email from Minaxi describing the error - plain text.

Comment entered 2010-11-29 13:41:01 by alan

BZDATETIME::2010-11-29 13:41:01
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

This is the PubMed / Medline format import file.

The error occurred in CiteMS article ref_id = 222654.
This corresponds to PMID = 19920272 in the import file.

Comment entered 2010-11-29 13:41:01 by alan

Attachment breast_gen_dec10.txt has been added with description: Import file of PubMed citations

Comment entered 2010-11-30 20:27:07 by alan

BZDATETIME::2010-11-30 20:27:07
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

It turns out that the cause of this failure is the same old
problem, the mt_ImportDef table lists a field "EIN" as singly
occurring but the record with PMID=19920272 has two of them.

We can fix this the same old way, by changing an "S" to an "M" in
the mt_ImportDef table.

But I also found the real underlying problem. The programmers
made what looks to me like a remarkable design decision about how
to find record boundaries. The logic they use is:

If any singly occurring field is seen more than once then:

We've crossed a record boundary.

Save what's in the current record buffer (regardless of
whether it's complete or valid.)

Clear the current record buffer.

Start creating the next record using the field we are now
on.

This happens even on a field like EIN, which is not actually
imported by the import program.

This is extraordinary in many ways, including:

1. Any change of a field from single to multiple occurrences by
NLM is guaranteed to cause a failure.

In fact, even if NLM doesn't change the data definitions but,
somehow, we get an invalid record in an input file, it is
very likely to cause serious database corruption.

2. Many fields, in fact virtually all of the fields, are known
to not occur on record boundaries.

I need verification of this, but I think that a record must
begin with a PMID, end with an SO, and there must be a blank
line between records.

Instead of depending on these intentional markers of record
boundaries, the program depends on a relatively incidental
characteristic of fields - whether they are singly or
multiply occurring - which has no logical connection with
record boundaries. If the first field in a record were made
multiply occurring, for example if records began with an
author field, the logic wouldn't work at all no matter what
mt_ImportDef said.

3. There is no validation that all required fields have been
received.

The program saves a less than complete record with no
recognition of the fact that it didn't see, for example, a
required SO field. The program will save a record with no
actual citation at all. I'm not sure about this, but the
only field I saw that looks to be absolutely required is the
PMID.

4. The problem propagates itself.

If a singly occurring field occurs more than once in a
record, then any singly occurring field after that will
trigger more invalid record boundary decisions when the
program processes the next record to have that field. A
whole train of records will be misinterpreted.

There will be a series of records saved consisting of the
last part of one record combined with the first part of the
next record. No warnings will be produced for any of these
mangled records (and there's no way to fix them.)

5. Even when an error is recognized, e.g., "No PMID found"
in one of the records, the program continues importing
records, spreading havoc in the database.

I am going on vacation tomorrow and don't want to modify the
import program before I leave. I need to do more research to
determine what the best modification should be, and we'll need
all the people present - programmer and import utility users -
for thorough testing. I therefore plan to do the following:

1. Change the mt_ImportDef table in the production and test
databases to make "EIN" multiply occurring.

2. Attempt to back out the invalid data entered into the
production database.

I won't make any other changes until I come back.

Comment entered 2010-11-30 20:29:16 by alan

BZDATETIME::2010-11-30 20:29:16
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

Forgot to assign this. Changing status from NEW to ASSIGNED.

I won't call it "RESOLVED-FIXED" until I either fix the logic of
the program, or we decide to live with it as is (in which case
WONTFIX might be a more accurate status.)

Comment entered 2010-11-30 20:49:49 by alan

BZDATETIME::2010-11-30 20:49:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5

Notes to myself:

The erroneous record boundary logic is on line 242 of
basImport.bas.

Two tests are performed on that line. The first test is the
one that causes the error. The second test appears to be
useless. I could not find any circumstances that would trigger
it. It's probably a hold over from an earlier version of the
system.

It appears that there is no other use of the single/multiple
occurrence distinction than recognition of record boundaries.
For example, the program does not use it to validate that a
field that should only occur once has actually only occurred
once (of course if it did, this problem would have been
detected before the database could be corrupted.)

A thorough cleaning of the problem might involve fixing the
record boundary logic and then either actually using the single
/ multiple occurrence distinction for validation, or cleaning
out all of the places where this information is initialized,
set, tested, and reset.

If we don't do a thorough cleaning, we'll be left with more
detritus in the system - code that runs but doesn't do anything
useful. But there's already a great deal of that and cleaning
this useless code might not be the highest cleaning priority.

Comment entered 2010-11-30 22:15:13 by alan

BZDATETIME::2010-11-30 22:15:13
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6

> ... I therefore plan to do the following:
>
> 1. Change the mt_ImportDef table in the production and test
> databases to make "EIN" multiply occurring.

Done.

> 2. Attempt to back out the invalid data entered into the
> production database.

Done. I deleted four records from each of three tables. They
were the four identified in Minaxi's email. The technique I used
was the same one Brent Carter used a year ago, following
suggestions from Lockheed Martin and documented in Bugzilla Issue
#4733 comment #10.

To test, I loaded the backup file from 11-23-2010 into the test
system and did the following:

Extracted all of the records from "breast_gen_dec10.txt" (the
PubMed input file that Minaxi used) starting from PMID =
19920272 (the record with two EIN fields) and continuing to
the end.

Imported that file into the test database using the same
parameters Minaxi used - December review cycle, Cancer
Genetics, etc.

Examined the results.

A total of 16 records were processed.

7 went in as new records. I compared the titles in the
database to the titles in the input file and all were
correct. Only four of these went into the production
database in Minaxi's import. Here are the IDs:

ref_id accession_number

                      • --------------------
                        222541 19920272
                        222542 19920254
                        222543 19912305
                        222544 19863427
                        222545 19659662
                        222546 19448619
                        222547 19346409

9 were processed as "Duplicate - Only Summary(s) Imported".
The PubMed IDs for them were as follows. I don't know how
many of these went into the production database during
Minaxi's import, and don't know how to test whether they are
correct:

19920271
19911270
19894111
19859804
19760033
19754664
19728083
19701705
19629691

I don't know enough to to test the duplicates, but I assume they
were fine since the ones that were corrupted in the original load
were fine.

For testing, I suggest that Cynthia check the test database to
see if anything looks wrong, then the production database. If
all is okay, we should be good to continue importing data into
production.

If not, then the CIAT and NCI staff will need to decide what to
do. I don't expect to have access to the Internet or even
telephone from Dec. 1-17.

Comment entered 2010-12-01 14:12:47 by Boggess, Cynthia (NIH/NCI) [C]

BZDATETIME::2010-12-01 14:12:47
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::7

I have confirmed that the 4 citations were deleted in production.
I reviewed the citations from the breast cancer genetics text file that Alan imported in the test database. Everything seemed correct so I tried importing the remaining citations in the file. The import was successful so I tried a few other files which were also successful.
I then tried importing the breast cancer genetics text file into production. Everything imported correctly. So I continued to import the remaining cancer genetics citations with no problems.
I am going to go ahead and finish importing all files for the Dec 2010 review cycle. I'll let you know if I run into any additional problems. But so far everything seems to be importing correctly without any error messages.

Comment entered 2010-12-02 16:11:36 by Boggess, Cynthia (NIH/NCI) [C]

BZDATETIME::2010-12-02 16:11:36
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::8

I have completed importing all of the citations for the Dec2010 review cycle with no other errors.

Comment entered 2010-12-20 13:45:58 by alan

BZDATETIME::2010-12-20 13:45:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9

I've searched the PubMed documentation at the NLM website to try
to determine a better way to recognize record boundaries but I
haven't found any clear cut instructions. That's unfortunate
because it's always much better to rely on published promises
from the data publisher than to rely on perceived regularities in
data where the publisher does not promise not to change things.

Looking at the actual records in an input file, I see that every
one starts with a PMID and ends with an SO, but I couldn't find
any documents from NLM that guarantee that this will be the case.
I did find a statement that the PMID "is present on all records"
(http://www.nlm.nih.gov/bsd/mms/medlineelements.html#pmid) but
couldn't find anything that guaranteed that

The PMID will always be the first field.

An SO will always be present (though one would think it has
to be.)

The SO will always be the last field in the record.

Perhaps Cynthia or Minaxi knows of some documentation from NLM
that addresses these issues.

In the absence of clear direction from NLM I'm inclined to do the
following:

Leave the input utility test in place for identifying any
field that has occurred more than onceĀ and checking to see if
it's a field that is not known to be allowed to occur more
than once. (Here again, I couldn't find any NLM publication
saying whether any particular field can occur multiple
times.)

Test the field ID in such cases.

If it's a PMID, treat the field as the beginning of a new
record, as is now done, but maybe add some validation:

If an SO field is not found, stop processing and display
an error message. I can't imagine that we want
"citations" in the database that don't actually contain
citations.

If it's not a PMID then:

Stop all processing of the input file. Do not save the
current record. Do not read any more records.

Display an error message saying what happened. Include:

The field tag of the field that was unexpected.

The value of the last PMID that was encountered
before the error.

The reasons for this approach are:

It leaves all the existing software in place.

If a new record is encountered that doesn't have a PMID,
then:

The system will detect that it's a new record because of
the occurrence of an unexpected repeat of a field.

Processing will stop before corrupting the database.

Information will be displayed that enables a user to
easily find the problem.

The system will still halt if a multiply occurring field is
encountered that is expected to be singly occurring. However,
the program will tell the user what the field was and enable us
to immediately recognize the nature of the problem, find the
record that generated the problem, and figure out whether the
problem is caused by a record without a PMID or a record with two
occurrences of what was expected to be a singly occurring field.

Does this seem like a good idea? It will only work if EVERY
single record begins with a PMID and contains an SO. If we don't
think that's the case, then we need a different strategy.

Comment entered 2010-12-20 15:43:45 by Boggess, Cynthia (NIH/NCI) [C]

BZDATETIME::2010-12-20 15:43:45
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::10

This sounds like a great idea. Minaxi and I have spent too many hours trying to identify the specific citation record and field that caused an import error.
I think it is safe to assume that pumbed records will continue to start with pmid and stop with the source field. We are not going to find NLM documentation stating that this is a gaurantee. These two fields have not changed order in 12+ years. But they have added a number of new fields in between. I doubt the pmid field will ever change but they may add elements to the source field to allow for changes in biblographic data or indexing of other non-journal publications.

Comment entered 2010-12-21 17:15:26 by alan

BZDATETIME::2010-12-21 17:15:26
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

I have completed and tested the changes described in comment #9.
I tested as follows:

1. Loaded the production database from 11-23-2010 into test.
This is the one that has EIN defined as singly occurring and
has not yet had the December breast cancer genetics records
loaded.

2. Imported the breast cancer genetics records, including the
one with the two EIN fields.

The program successfully loaded all of the records up to the
one with two fields and then stopped. It reported the PMID
of the "bad" record and said it got two EIN fields in that
record. It didn't load any more.

I checked the database to be sure the records up to the bad
one were loaded and none were loaded after. Everything
appeared to be correct.

3. Re-loaded the 11-23-2010 data into test, restoring the
pre-import database.

4. Edited one of the records to delete the SO field.

5. Imported again.

Again, the program successfully loaded everything up to the
bad record, reported the PMID of the bad record and that it
didn't have an SO, and stopped at that point.

I noticed some other behaviors in the program that weren't ideal,
but I haven't tried to fix any of them. Some things I saw
included:

If more than a few records are imported in one batch, the
report of records loaded displayed in a separate window after
an import is truncated in the middle of a line.

I saw a report of duplicate records displayed in the separate
window after an import that said there were 2 duplicates
(which was right), but the results shown in the upper left of
the main window didn't report any.

I don't think I caused any of those problems.

One problem I exacerbated a bit is that the report of how many
records there were in the file actually reports the number that
were read. Since I stop reading records after an error, the
report on the screen is not really the number in the file, it's
the number read from the file. However this problem only occurs
when there's a serious error anyway and I don't think it should
confuse anyone. If we want to fix that I would propose that we
just change the message to say the number read rather than the
number in the file.

I am attaching the new executable file that can be dropped in to
replace the existing one. To get the new file:

Make a backup of your existing Cips_CMS.exe.

Download the attachment from Bugzilla.

Replace the existing Cips_CMS.exe with the new one.

I once again restored the 11-23-2010 database to test in case
Cynthia or William wants to test this. Importing the December
breast cancer genetics file will demonstrate the new
functionality.

If everything is okay, William should probably be sure that
Minaxi finds this Bugzilla comment in her email and gets the new
import utility executable.

When testing is done, I'll load the latest production database
into test, correcting the EIN field description in the database
and giving us the latest data for test purposes.

Comment entered 2010-12-21 17:15:26 by alan

Attachment Cips_CMS.exe has been added with description: New CipsCMS.exe import utility.

Comment entered 2010-12-21 17:24:47 by alan

BZDATETIME::2010-12-21 17:24:47
BZCOMMENTOR::Alan Meyer
BZCOMMENT::12

I've attached a screen image of what appeared when two EIN fields were encountered in one record. We've changed the production database so that EIN is known to be multiply occurring, so this error should never occur for EIN fields again, but might if there are any other fields that were always singly occurring in the past but show up in a future record with multiple occurrences.

The image is a JPEG file. You may have to enlarge it a bit to read the error message clearly.

Comment entered 2010-12-21 17:24:47 by alan

Attachment ImportErrorTwoFields.jpg has been added with description: Screen image of the error message generated by the program changes

Comment entered 2010-12-21 21:21:44 by alan

BZDATETIME::2010-12-21 21:21:44
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13

I've committed the new code to version control.

I'm marking the issue RESOLVED-FIXED.

Comment entered 2010-12-22 08:32:06 by Boggess, Cynthia (NIH/NCI) [C]

BZDATETIME::2010-12-22 08:32:06
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::14

I'll be sure to discuss this with Minaxi when she gets back. Thanks Alan I think this new error reporting feature will save us alot of time trouble shooting import errors. Minaxi will be importing Jan 2011 citations soon after she returns from India so this will be her opportunity to try out the new feature.

Comment entered 2011-01-10 09:51:30 by alan

BZDATETIME::2011-01-10 09:51:30
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15

I'd like to export the latest production database into the test database to use it for another test of the data cleanups for the NOT list issue. See OCECDR-3154 Comment #69.

The current test database is the one from just before the citation import that Minaxi did that revealed the errors in the field definitions and damaged the data. I presume that all of the users are now using the fixed import utility from Dec. 21 and no one needs the old database for testing. I'm therefore going to replace it with the latest version from production.

I've still got a backup copy of that old database, and the file of import citations, if anyone needs to go back to it for further testing. So I'll go ahead and replace that. We'll use the backup if more testing is required.

Comment entered 2011-01-10 10:04:42 by priced

BZDATETIME::2011-01-10 10:04:42
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::16

I have imported Adult search files without any problem, using the new version of the CiteMs EXE file. So far no problem. I doubt if we will need the old database for any testing.

Comment entered 2011-01-10 10:27:58 by alan

BZDATETIME::2011-01-10 10:27:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17

(In reply to comment #16)
> I have imported Adult search files without any problem, using the new version
> of the CiteMs EXE file. So far no problem. I doubt if we will need the old
> database for any testing.

Very good.

The backup and restore are done. The test database is current with production as of this moment.

Comment entered 2011-01-12 14:42:08 by Boggess, Cynthia (NIH/NCI) [C]

BZDATETIME::2011-01-12 14:42:08
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::18

I am closing this bug.

Attachments
File Name Posted User
breast_gen_dec10.txt 2010-11-29 13:41:01
Cips_CMS.exe 2010-12-21 17:15:26
ImportErrorTwoFields.jpg 2010-12-21 17:24:47
MinaxiEmail.txt 2010-11-29 13:36:22
PMID_not_found_error_Email.txt 2010-11-29 13:31:55

Elapsed: 0:00:00.000617