PDQ Issues

Issue Number	3405
Summary	CiteMS Importing Error
Created	2011-08-16 11:11:38
Issue Type	Bug
Submitted By	Boggess, Cynthia (NIH/NCI) [C]
Assigned To	alan
Status	Closed
Resolved	2011-08-22 17:08:17
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.107733

Description

BZISSUE::5098
BZDATETIME::2011-08-16 11:11:38
BZCREATOR::Cynthia Boggess
BZASSIGNEE::Alan Meyer
BZQACONTACT::Cynthia Boggess

Error "wrong PMID" received when importing citation using import utility. Citation that gave error is pmid=21822267. See text file.

Comment entered 2011-08-16 11:11:38 by Boggess, Cynthia (NIH/NCI) [C]

Attachment wrong_id_error_citation.txt has been added with description: citation record that generated error upon import

Comment entered 2011-08-16 11:19:15 by Boggess, Cynthia (NIH/NCI) [C]

BZDATETIME::2011-08-16 11:19:15
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::1

Minaxi thinks that the problem may be due to a new tag she just noticed in PubMed.

” LID - 10.1038/ng.893 [doi]”

Comment entered 2011-08-16 11:29:50 by alan

BZDATETIME::2011-08-16 11:29:50
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

I'll check it out.

Was this record downloaded by itself or as part of a batch of records? I ask because, in the past, I found that an error processing one record can go undetected until a subsequent record is started. The software gets confused and reports the error in the wrong place.

If it's part of a batch, could you also attach the whole batch to this Bugzilla issue?

Since the issue has been marked as "critical" I've upped the "importance" priority to P3 and I'll suspend other work to work on this.

Comment entered 2011-08-16 11:44:45 by priced

BZDATETIME::2011-08-16 11:44:45
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::3

This record was downloaded by itself.

Comment entered 2011-08-16 13:24:33 by alan

BZDATETIME::2011-08-16 13:24:33
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

I tried to import the record into the test database and it went in okay. No error occurred.

I'm going to take last night's backup of the production database, restore it on test, and try again. Maybe there's a conflict with something in the database. If that doesn't work (or rather does work), I'll either take a current backup today, or I'll test on the production database itself.

I also did a case insensitive search for the word "wrong" in the source code for the import utility to try to find out what specific part of the program was rejecting the record. The word only appears twice in the entire system - one in a comment that I added and the other in an error message that I added that looks like this:

"Encountered two occurrences of tag ... without intervening PMID ... table has a wrong definition for this field"

Was "wrong PMID" the exact error message that was displayed?

Comment entered 2011-08-16 14:01:15 by priced

BZDATETIME::2011-08-16 14:01:15
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::5

The error was "No PMID found"

Comment entered 2011-08-16 15:25:36 by alan

BZDATETIME::2011-08-16 15:25:36
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6

I tested again using last night's database. It worked fine!

What I did was the following:

Called up the import system and logged on to the test database (updated from last night) as "cms_developer".

Clicked File / Add New Batch (import)

Search Source: CMS2
Board: Cancer Genetics
Summary Topic: Genetics of Breast and Ovarian Cancer
Review Cycle ID: August 2011
Select Source File: wrong_id_error_citation.txt

The file was the one I downloaded from the Bugzilla attachment. I checked the database after the import and the record was there.

My next step will be to try it in production itself. However, before I do that, let's check to make sure we're all using the same import utility. The one I used is the same as the one uploaded in OCECDR-3271. See:

http://verdi.nci.nih.gov/tracker/attachment.cgi?id=2050

Is it possible that Minaxi is using a different version?

Comment entered 2011-08-16 15:49:34 by priced

BZDATETIME::2011-08-16 15:49:34
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::7

Hi Alan,

I have a new batch of 15 citations from Sharon to be added to CiteMS.

Please let me know if I can import them.

Minaxi

Comment entered 2011-08-16 15:53:09 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2011-08-16 15:53:09
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::8

(In reply to comment #6)

> My next step will be to try it in production itself. However, before I do
> that, let's check to make sure we're all using the same import utility. The
> one I used is the same as the one uploaded in OCECDR-3271. See:
>
> http://verdi.nci.nih.gov/tracker/attachment.cgi?id=2050
>
> Is it possible that Minaxi is using a different version?

It is possible because Minaxi's laptop has recently been re-imaged and the import utility was re-installed. However, I think it is unlikely because she has the version which has both production and the test server options. I believe that is the latest version of the import utility we have.

Comment entered 2011-08-16 15:58:15 by alan

BZDATETIME::2011-08-16 15:58:15
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9

(In reply to comment #8)
> (In reply to comment #6)
>
> > My next step will be to try it in production itself. However, before I do
> > that, let's check to make sure we're all using the same import utility. The
> > one I used is the same as the one uploaded in OCECDR-3271. See:
> >
> > http://verdi.nci.nih.gov/tracker/attachment.cgi?id=2050
> >
> > Is it possible that Minaxi is using a different version?
>
> It is possible because Minaxi's laptop has recently been re-imaged and the
> import utility was re-installed. However, I think it is unlikely because she
> has the version which has both production and the test server options. I
> believe that is the latest version of the import utility we have.

Let's check just to be sure. If I remember correctly, I implemented the production/test dichotomy before the last round of import bug fixes.

Here's how Minaxi or someone helping at Z-Tech can check:

Download the import utility from Bugzilla:
http://verdi.nci.nih.gov/tracker/attachment.cgi?id=2050

Open a command window:
Start button, "command", Enter.

In the command window, go to the directory that has your current executable.

Locate the directory containing the downloaded executable.

Compare them using:

FC /B Cips_CMS.exe \wherever_the_downloaded_file_is\Cips_CMS.exe

It should result in this:

FC: no differences encountered

If it doesn't, we need to fix that right away.

Comment entered 2011-08-16 16:12:00 by priced

BZDATETIME::2011-08-16 16:12:00
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::10

(In reply to comment #9)
>
> FC: no differences encountered
>
> If it doesn't, we need to fix that right away.

William just run these commands on my computer and got
FC: no differences encountered

Comment entered 2011-08-16 16:20:20 by alan

BZDATETIME::2011-08-16 16:20:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

(In reply to comment #10)
> (In reply to comment #9)
> >
> > FC: no differences encountered
> >
> > If it doesn't, we need to fix that right away.
>
> William just run these commands on my computer and got
> FC: no differences encountered

Okay.

I'll wait until later tonight when everyone is off the system. Then I'll
attempt to load the record in the production database. If it fails I'll
attempt to debug the import and fix whatever caused the problem. The LID
field might be a clue but the fact that it worked in the test system may
mean there's something else involved.

If it works, then we're home free and can write it up for the Journal of
Unreproducible Results.

In the meantime, I'd prefer that no more imports be done today. It's
possible that the production database is in a strange state and needs
fixing.

To assist with the debugging, it would be useful if you could send me the following files (by attaching them to this Bugzilla issue):

Any import files you uploaded today BEFORE you tried the one that failed. It looks to me like there is exactly one record.

The file of fifteen records you'd like to upload now, with instructions regarding the Board, Topic, and review cycle.

I will NOT upload the 15 into production but may try them in test.

I'll report back tonight on results.

Thanks.

Comment entered 2011-08-16 16:26:54 by priced

BZDATETIME::2011-08-16 16:26:54
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::12

I did not upload any fie today BEFORE I tried the one that failed.

The file with 15 records to be imported to Pediatric board/Late Effects summary topic for July 2011 review cycle is attached

Comment entered 2011-08-16 16:26:54 by priced

Attachment late-effects_Aug11.txt has been added with description: 15 citations to be added to CiteMS

Comment entered 2011-08-16 16:47:09 by alan

BZDATETIME::2011-08-16 16:47:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13

(In reply to comment #12)
> Created attachment 2142 [details]
> 15 citations to be added to CiteMS
>
> I did not upload any fie today BEFORE I tried the one that failed.

My mistake. I see now that the last record imported was August 12.

> The file with 15 records to be imported to Pediatric board/Late Effects summary
> topic for July 2011 review cycle is attached

Thanks.

I have to say that this is a real puzzler. We've now confirmed that when I load a copy of the identical record using a copy of the identical software into a copy of the identical database, it worked.

I'm thinking that Minaxi walked through a rare confluence of cosmic rays today that jinxed her import. But I'll see what I can do tonight.

Is it okay to use the cms_developer userid, or should I use Minaxi's (I can get the login and password from the database.)

Comment entered 2011-08-16 21:24:02 by alan

BZDATETIME::2011-08-16 21:24:02
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14

Adding users to the CC list for this issue.

Comment entered 2011-08-16 21:25:58 by alan

BZDATETIME::2011-08-16 21:25:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15

I imported the record into the production system using the same
parameters described in comment 6.

The import worked for me - sort of. The record was imported
successfully and I got the little alert box with the legend:

Import Results:
PMID 21822267 Result: Record Imported
[ OK ]

However, Minaxi's intuition about the "LID" field causing
problems turned out to be correct. Although the import worked,
the LID field was concatenated to the title during the import.
So the title now looks like this:

Germline mutations in RAD51D confer susceptibility to
ovarian cancer. LID - 10.1038/ng.893 [doi]

I hadn't noticed that when I imported into the test system, but
it happened there too.

I searched the database to find out how many records have this
problem and when it started. It turns out that there are 62
records with the substring "LID -" in their titles. The first
one was imported on April 13, 2009. The next one was almost a
year later, on March 8, 2010. The majority are in the last few
months.

I've attached a listing of CMS ID, PMID, Date imported, and
Citation for all 62 records.

I think that I fixed the problem for future records. We have a
control table named "mt_importdef" that lists all of the fields
in a PubMed record. I added an entry for "LID -", said that the
data in that field should not be imported, and allowed for
multiple occurrences. I doubt that there are multiple
occurrences but previous experience has made me believe that it
doesn't hurt to allow for them.

To test, I made a copy of the text file containing the record in
question, added a '0' to the end of the PMID (218222670), and
imported it again into the test database. This time it came in
with the correct title, no LID field appended to it.

Being fairly confident that this is a good and safe fix, I went
ahead and also made the change to the production database.
However, before importing more records I'd like to ask that you
first import them into the test system and have a look at them.
Look at the titles. Look at the other fields. Make sure nothing
that you can see is broken. Look especially at any record with
an LID field.

If everything looks okay, then go ahead and import into
production.

I can programmatically fix the 62 titles with the LID suffix if
that's important. If so, let's create a separate issue for it
and assign it a not too high priority. I'm not sure it's worth
taking time away from other things for it but I'll let others be
the judge of that.

I still don't know why the import failed for Minaxi but worked
for me. I thought maybe I was lucky and she wasn't, but it
appears to actually be the other way around. If this hadn't
happened we would probably have imported many more records with
LID fields appended to titles before anyone noticed that they had
problems.

I'm also adding other CDR team members to this issue so they'll
be aware of what was done.

Comment entered 2011-08-16 21:25:58 by alan

Attachment ImportTitlesWithLID.txt has been added with description: Citations with "LID -" appended to the titles

Comment entered 2011-08-16 21:27:12 by alan

BZDATETIME::2011-08-16 21:27:12
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16

I'm marking the issue as resolved-fixed.

Comment entered 2011-08-16 21:38:41 by alan

BZDATETIME::2011-08-16 21:38:41
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17

(In reply to comment #16)

> I'm marking the issue as resolved-fixed.

Maybe that was premature. We still need to find out if Minaxi can import successfully. If not, we'll re-open the issue and I'll work on it some more.

Comment entered 2011-08-16 22:01:06 by alan

BZDATETIME::2011-08-16 22:01:06
BZCOMMENTOR::Alan Meyer
BZCOMMENT::18

For the record, in case I ever need to recall what was done, here
is the SQL executed in test and production databases:

INSERT mt_importdef
(literaltag, TagLink, DataSource, TagType, CounterLink)
VALUES
('LID -', NULL, 'CMS2', 'M', NULL)

Comment entered 2011-08-17 08:58:50 by priced

BZDATETIME::2011-08-17 08:58:50
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::19

(In reply to comment #17)
> (In reply to comment #16)
>
> > I'm marking the issue as resolved-fixed.
>
> Maybe that was premature. We still need to find out if Minaxi can import
> successfully. If not, we'll re-open the issue and I'll work on it some more.

I am not able to import the citation in test database.
While talking to Cynthia, I discovered that you are testing the file sent by Cynthia which was created using IE older version that what I am using. I use IE version 9. I am attacing the file created by me so that you can test importing that file. I can also test the file created by Cynthia and let you know.

Comment entered 2011-08-17 09:01:50 by priced

BZDATETIME::2011-08-17 09:01:50
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::20

Comment entered 2011-08-17 09:01:50 by priced

Attachment breast_ova_gen_Aug11.txt has been added with description: citation record created using IE 9

Comment entered 2011-08-17 09:15:36 by priced

BZDATETIME::2011-08-17 09:15:36
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::21

I tried importing the file created by Cynthia to test db and it worked.

Comment entered 2011-08-17 13:21:20 by alan

BZDATETIME::2011-08-17 13:21:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::22

(In reply to comment #20)
> Created attachment 2144 [details]
> citation record created using IE 9

It turns out that IE9 is putting Unicode Byte Order Marks at the beginning of the file. There are three bytes just in front of the "PMID". These bytes tell an application that what follows is Unicode and tells how to interpret numbers that are encoded in more than 8 bits. You can't see or edit them in most text editors because they aren't treated as data by most editors, they're treated as instructions to the software.

There are a number of ways to deal with this. There may be an option in IE9 to turn them off on saving. There may be a way to delete them outside of IE9. Or as a permanent fix, I can probably modify the import utility to recognize and delete them.

I'll look at all the options tomorrow. In the meantime, would it be practical to re-download the files with Firefox, or with IE8 on another system?

If there's a lot of saved files, already edited, can you hold off until tomorrow?

I'll work on a permanent solution when I'm in tomorrow.

In the meantime, I'm re-opening the issue.

It looks like the fix for the "LID" tag was a freebie on this one 🙂

Comment entered 2011-08-17 13:29:20 by alan

BZDATETIME::2011-08-17 13:29:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::23

I fired up an old XP machine and found a temporary workaround that might still work on your more modern machine.

Call up the file to be imported in Notepad.

Click File / Save As

In the dialog box I see a drop down menu for Encoding.

Change that from UTF-8 to ANSI and save the file.

On my system, the Byte Order Marks disappeared in the saved file.

Comment entered 2011-08-17 13:32:45 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-08-17 13:32:45
BZCOMMENTOR::Bob Kline
BZCOMMENT::24

(In reply to comment #22)

> It turns out that IE9 is putting Unicode Byte Order Marks ....

Still more incentive to come up with a system which (a) stores the XML version of the citation document and (b) stores the document in the system directly from NLM (rather than storing it on the local machine and uploading it to CiteMS).

Comment entered 2011-08-17 13:36:59 by alan

BZDATETIME::2011-08-17 13:36:59
BZCOMMENTOR::Alan Meyer
BZCOMMENT::25

(In reply to comment #23)

> Change that from UTF-8 to ANSI and save the file.

Although this works for removing the offending marks, In theory it could change the encoding and display of any non-ASCII characters.

In practice, I believe that this is a non-issue for the Medline Print Format that we're using. I believe that has already been converted to pure ASCII by the PubMed printer formatter before it comes to us.

Comment entered 2011-08-18 17:36:16 by alan

BZDATETIME::2011-08-18 17:36:16
BZCOMMENTOR::Alan Meyer
BZCOMMENT::26

The attached executable replaces the existing Import Utility program.

The only change is that it looks for Unicode "Byte Order Marks" anywhere in the file and deletes them before processing input data. The fixes the problem caused by IE9, and possibly other browsers in the future, that insert these marks at the beginning of saved screens that contain UTF-8 encoded characters.

The change uses a little overkill in removing these marks from anywhere in the file since they are only supposed to occur at the beginning of a file, but it should not hurt to remove them from anywhere and it protects against any inadvertent additions of BOMs caused by user editing or combining of files.

I recommend the following installation procedure:

1. Rename your existing Cips_CMS.exe.

The convention I use for this is to rename it using the date of last use, for example: Cips_CMS_2011-08-17.exe

I suggest renaming this so that if I did something wrong in building the new executable it will be easy to re-instate the old one by simply renaming it back to its original name while waiting for me to fix things.

2. Download the new executable from this attachment and put it in the same directory where the old file was.

That should be all there is to it.

To make testing easier, I restored last night's backup of the production database to the test database. That will elminate any test records that you may have entered into the test database and allow you to test again.

Comment entered 2011-08-18 17:36:16 by alan

Attachment Cips_CMS.exe has been added with description: Executable import utility patched for Unicode Byte Order Mark inserted by IE9

Comment entered 2011-08-18 17:37:18 by alan

BZDATETIME::2011-08-18 17:37:18
BZCOMMENTOR::Alan Meyer
BZCOMMENT::27

In an optimistic spirit I am, once again, marking this Resolved-fixed.

Comment entered 2011-08-18 18:35:11 by priced

BZDATETIME::2011-08-18 18:35:11
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::28

(In reply to comment #27)
> In an optimistic spirit I am, once again, marking this Resolved-fixed.

I copied the new exe file and imported citations to test database first and then to production db. I could import without any error.

Thanks for getting me out of the influence of the "rare confluence of cosmic rays"--Minaxi

Comment entered 2011-08-18 19:08:46 by alan

BZDATETIME::2011-08-18 19:08:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::29

Another note on the changes:

The new executable should be used by everyone who uses the import utility, not just IE9 users. Nobody else needs it right now, but you might upgrade in the future to IE9 and have forgotten about this problem. In addition some future update of Firefox or one of the other browsers might do the same thing that Microsoft did with IE9.

The change works fine with files produced by the older browsers as well as the new one.

Comment entered 2011-08-18 19:10:40 by alan

BZDATETIME::2011-08-18 19:10:40
BZCOMMENTOR::Alan Meyer
BZCOMMENT::30

(In reply to comment #28)
...
> Thanks for getting me out of the influence of the "rare confluence of cosmic
> rays"--Minaxi

Those cosmic rays can be dangerous. I often have to put on my special aluminum foil hat in order to reduce the number of new bugs I introduce into my software.

Comment entered 2011-08-22 17:08:17 by Boggess, Cynthia (NIH/NCI) [C]

BZDATETIME::2011-08-22 17:08:17
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::31

Ok I am going to close this bug. Everything seems to be working fine with the new version. No other reported problems with importing citations.

Attachments

File Name	Posted	User
breast_ova_gen_Aug11.txt	2011-08-17 09:01:50
Cips_CMS.exe	2011-08-18 17:36:16
ImportTitlesWithLID.txt	2011-08-16 21:25:58
late-effects_Aug11.txt	2011-08-16 16:26:54
wrong_id_error_citation.txt	2011-08-16 11:11:38

Elapsed: 0:00:00.000408

CDR Tickets