Issue Number | 3405 |
---|---|
Summary | CiteMS Importing Error |
Created | 2011-08-16 11:11:38 |
Issue Type | Bug |
Submitted By | Boggess, Cynthia (NIH/NCI) [C] |
Assigned To | alan |
Status | Closed |
Resolved | 2011-08-22 17:08:17 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.107733 |
BZISSUE::5098
BZDATETIME::2011-08-16 11:11:38
BZCREATOR::Cynthia Boggess
BZASSIGNEE::Alan Meyer
BZQACONTACT::Cynthia Boggess
Error "wrong PMID" received when importing citation using import utility. Citation that gave error is pmid=21822267. See text file.
Attachment wrong_id_error_citation.txt has been added with description: citation record that generated error upon import
BZDATETIME::2011-08-16 11:19:15
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::1
Minaxi thinks that the problem may be due to a new tag she just noticed in PubMed.
” LID - 10.1038/ng.893 [doi]”
BZDATETIME::2011-08-16 11:29:50
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2
I'll check it out.
Was this record downloaded by itself or as part of a batch of records? I ask because, in the past, I found that an error processing one record can go undetected until a subsequent record is started. The software gets confused and reports the error in the wrong place.
If it's part of a batch, could you also attach the whole batch to this Bugzilla issue?
Since the issue has been marked as "critical" I've upped the "importance" priority to P3 and I'll suspend other work to work on this.
BZDATETIME::2011-08-16 11:44:45
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::3
This record was downloaded by itself.
BZDATETIME::2011-08-16 13:24:33
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4
I tried to import the record into the test database and it went in okay. No error occurred.
I'm going to take last night's backup of the production database, restore it on test, and try again. Maybe there's a conflict with something in the database. If that doesn't work (or rather does work), I'll either take a current backup today, or I'll test on the production database itself.
I also did a case insensitive search for the word "wrong" in the source code for the import utility to try to find out what specific part of the program was rejecting the record. The word only appears twice in the entire system - one in a comment that I added and the other in an error message that I added that looks like this:
"Encountered two occurrences of tag ... without intervening PMID ... table has a wrong definition for this field"
Was "wrong PMID" the exact error message that was displayed?
BZDATETIME::2011-08-16 14:01:15
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::5
The error was "No PMID found"
BZDATETIME::2011-08-16 15:25:36
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6
I tested again using last night's database. It worked fine!
What I did was the following:
Called up the import system and logged on to the test database (updated from last night) as "cms_developer".
Clicked File / Add New Batch (import)
Search Source: CMS2
Board: Cancer Genetics
Summary Topic: Genetics of Breast and Ovarian Cancer
Review Cycle ID: August 2011
Select Source File: wrong_id_error_citation.txt
The file was the one I downloaded from the Bugzilla attachment. I checked the database after the import and the record was there.
My next step will be to try it in production itself. However, before I do that, let's check to make sure we're all using the same import utility. The one I used is the same as the one uploaded in OCECDR-3271. See:
http://verdi.nci.nih.gov/tracker/attachment.cgi?id=2050
Is it possible that Minaxi is using a different version?
BZDATETIME::2011-08-16 15:49:34
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::7
Hi Alan,
I have a new batch of 15 citations from Sharon to be added to CiteMS.
Please let me know if I can import them.
Minaxi
BZDATETIME::2011-08-16 15:53:09
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::8
(In reply to comment #6)
> My next step will be to try it in production itself. However,
before I do
> that, let's check to make sure we're all using the same import
utility. The
> one I used is the same as the one uploaded in OCECDR-3271.
See:
>
> http://verdi.nci.nih.gov/tracker/attachment.cgi?id=2050
>
> Is it possible that Minaxi is using a different version?
It is possible because Minaxi's laptop has recently been re-imaged and the import utility was re-installed. However, I think it is unlikely because she has the version which has both production and the test server options. I believe that is the latest version of the import utility we have.
BZDATETIME::2011-08-16 15:58:15
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9
(In reply to comment #8)
> (In reply to comment #6)
>
> > My next step will be to try it in production itself. However,
before I do
> > that, let's check to make sure we're all using the same import
utility. The
> > one I used is the same as the one uploaded in OCECDR-3271.
See:
> >
> > http://verdi.nci.nih.gov/tracker/attachment.cgi?id=2050
> >
> > Is it possible that Minaxi is using a different version?
>
> It is possible because Minaxi's laptop has recently been re-imaged
and the
> import utility was re-installed. However, I think it is unlikely
because she
> has the version which has both production and the test server
options. I
> believe that is the latest version of the import utility we
have.
Let's check just to be sure. If I remember correctly, I implemented the production/test dichotomy before the last round of import bug fixes.
Here's how Minaxi or someone helping at Z-Tech can check:
Download the import utility from Bugzilla:
http://verdi.nci.nih.gov/tracker/attachment.cgi?id=2050
Open a command window:
Start button, "command", Enter.
In the command window, go to the directory that has your current executable.
Locate the directory containing the downloaded executable.
Compare them using:
FC /B Cips_CMS.exe \wherever_the_downloaded_file_is\Cips_CMS.exe
It should result in this:
FC: no differences encountered
If it doesn't, we need to fix that right away.
BZDATETIME::2011-08-16 16:12:00
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::10
(In reply to comment #9)
>
> FC: no differences encountered
>
> If it doesn't, we need to fix that right away.
William just run these commands on my computer and got
FC: no differences encountered
BZDATETIME::2011-08-16 16:20:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11
(In reply to comment #10)
> (In reply to comment #9)
> >
> > FC: no differences encountered
> >
> > If it doesn't, we need to fix that right away.
>
> William just run these commands on my computer and got
> FC: no differences encountered
Okay.
I'll wait until later tonight when everyone is off the system. Then
I'll
attempt to load the record in the production database. If it fails
I'll
attempt to debug the import and fix whatever caused the problem. The
LID
field might be a clue but the fact that it worked in the test system
may
mean there's something else involved.
If it works, then we're home free and can write it up for the Journal
of
Unreproducible Results.
In the meantime, I'd prefer that no more imports be done today.
It's
possible that the production database is in a strange state and
needs
fixing.
To assist with the debugging, it would be useful if you could send me the following files (by attaching them to this Bugzilla issue):
Any import files you uploaded today BEFORE you tried the one that failed. It looks to me like there is exactly one record.
The file of fifteen records you'd like to upload now, with instructions regarding the Board, Topic, and review cycle.
I will NOT upload the 15 into production but may try them in test.
I'll report back tonight on results.
Thanks.
BZDATETIME::2011-08-16 16:26:54
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::12
I did not upload any fie today BEFORE I tried the one that failed.
The file with 15 records to be imported to Pediatric board/Late Effects summary topic for July 2011 review cycle is attached
Attachment late-effects_Aug11.txt has been added with description: 15 citations to be added to CiteMS
BZDATETIME::2011-08-16 16:47:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13
(In reply to comment #12)
> Created attachment 2142 [details]
> 15 citations to be added to CiteMS
>
> I did not upload any fie today BEFORE I tried the one that
failed.
My mistake. I see now that the last record imported was August 12.
> The file with 15 records to be imported to Pediatric board/Late
Effects summary
> topic for July 2011 review cycle is attached
Thanks.
I have to say that this is a real puzzler. We've now confirmed that when I load a copy of the identical record using a copy of the identical software into a copy of the identical database, it worked.
I'm thinking that Minaxi walked through a rare confluence of cosmic rays today that jinxed her import. But I'll see what I can do tonight.
Is it okay to use the cms_developer userid, or should I use Minaxi's (I can get the login and password from the database.)
BZDATETIME::2011-08-16 21:24:02
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14
Adding users to the CC list for this issue.
BZDATETIME::2011-08-16 21:25:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15
I imported the record into the production system using the same
parameters described in comment 6.
The import worked for me - sort of. The record was imported
successfully and I got the little alert box with the legend:
Import Results:
PMID 21822267 Result: Record Imported
[ OK ]
However, Minaxi's intuition about the "LID" field causing
problems turned out to be correct. Although the import worked,
the LID field was concatenated to the title during the import.
So the title now looks like this:
Germline mutations in RAD51D confer susceptibility to
ovarian cancer. LID - 10.1038/ng.893 [doi]
I hadn't noticed that when I imported into the test system, but
it happened there too.
I searched the database to find out how many records have this
problem and when it started. It turns out that there are 62
records with the substring "LID -" in their titles. The first
one was imported on April 13, 2009. The next one was almost a
year later, on March 8, 2010. The majority are in the last few
months.
I've attached a listing of CMS ID, PMID, Date imported, and
Citation for all 62 records.
I think that I fixed the problem for future records. We have
a
control table named "mt_importdef" that lists all of the fields
in a PubMed record. I added an entry for "LID -", said that the
data in that field should not be imported, and allowed for
multiple occurrences. I doubt that there are multiple
occurrences but previous experience has made me believe that it
doesn't hurt to allow for them.
To test, I made a copy of the text file containing the record
in
question, added a '0' to the end of the PMID (218222670), and
imported it again into the test database. This time it came in
with the correct title, no LID field appended to it.
Being fairly confident that this is a good and safe fix, I went
ahead and also made the change to the production database.
However, before importing more records I'd like to ask that you
first import them into the test system and have a look at them.
Look at the titles. Look at the other fields. Make sure nothing
that you can see is broken. Look especially at any record with
an LID field.
If everything looks okay, then go ahead and import into
production.
I can programmatically fix the 62 titles with the LID suffix if
that's important. If so, let's create a separate issue for it
and assign it a not too high priority. I'm not sure it's worth
taking time away from other things for it but I'll let others be
the judge of that.
I still don't know why the import failed for Minaxi but worked
for me. I thought maybe I was lucky and she wasn't, but it
appears to actually be the other way around. If this hadn't
happened we would probably have imported many more records with
LID fields appended to titles before anyone noticed that they had
problems.
I'm also adding other CDR team members to this issue so they'll
be aware of what was done.
Attachment ImportTitlesWithLID.txt has been added with description: Citations with "LID -" appended to the titles
BZDATETIME::2011-08-16 21:27:12
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16
I'm marking the issue as resolved-fixed.
BZDATETIME::2011-08-16 21:38:41
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17
(In reply to comment #16)
> I'm marking the issue as resolved-fixed.
Maybe that was premature. We still need to find out if Minaxi can import successfully. If not, we'll re-open the issue and I'll work on it some more.
BZDATETIME::2011-08-16 22:01:06
BZCOMMENTOR::Alan Meyer
BZCOMMENT::18
For the record, in case I ever need to recall what was done,
here
is the SQL executed in test and production databases:
INSERT mt_importdef
(literaltag, TagLink, DataSource, TagType, CounterLink)
VALUES
('LID -', NULL, 'CMS2', 'M', NULL)
BZDATETIME::2011-08-17 08:58:50
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::19
(In reply to comment #17)
> (In reply to comment #16)
>
> > I'm marking the issue as resolved-fixed.
>
> Maybe that was premature. We still need to find out if Minaxi can
import
> successfully. If not, we'll re-open the issue and I'll work on it
some more.
I am not able to import the citation in test database.
While talking to Cynthia, I discovered that you are testing the file
sent by Cynthia which was created using IE older version that what I am
using. I use IE version 9. I am attacing the file created by me so that
you can test importing that file. I can also test the file created by
Cynthia and let you know.
BZDATETIME::2011-08-17 09:01:50
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::20
Attachment breast_ova_gen_Aug11.txt has been added with description: citation record created using IE 9
BZDATETIME::2011-08-17 09:15:36
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::21
I tried importing the file created by Cynthia to test db and it worked.
BZDATETIME::2011-08-17 13:21:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::22
(In reply to comment #20)
> Created attachment 2144 [details]
> citation record created using IE 9
It turns out that IE9 is putting Unicode Byte Order Marks at the beginning of the file. There are three bytes just in front of the "PMID". These bytes tell an application that what follows is Unicode and tells how to interpret numbers that are encoded in more than 8 bits. You can't see or edit them in most text editors because they aren't treated as data by most editors, they're treated as instructions to the software.
There are a number of ways to deal with this. There may be an option in IE9 to turn them off on saving. There may be a way to delete them outside of IE9. Or as a permanent fix, I can probably modify the import utility to recognize and delete them.
I'll look at all the options tomorrow. In the meantime, would it be practical to re-download the files with Firefox, or with IE8 on another system?
If there's a lot of saved files, already edited, can you hold off until tomorrow?
I'll work on a permanent solution when I'm in tomorrow.
In the meantime, I'm re-opening the issue.
It looks like the fix for the "LID" tag was a freebie on this one 🙂
BZDATETIME::2011-08-17 13:29:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::23
I fired up an old XP machine and found a temporary workaround that might still work on your more modern machine.
Call up the file to be imported in Notepad.
Click File / Save As
In the dialog box I see a drop down menu for Encoding.
Change that from UTF-8 to ANSI and save the file.
On my system, the Byte Order Marks disappeared in the saved file.
BZDATETIME::2011-08-17 13:32:45
BZCOMMENTOR::Bob Kline
BZCOMMENT::24
(In reply to comment #22)
> It turns out that IE9 is putting Unicode Byte Order Marks ....
Still more incentive to come up with a system which (a) stores the XML version of the citation document and (b) stores the document in the system directly from NLM (rather than storing it on the local machine and uploading it to CiteMS).
BZDATETIME::2011-08-17 13:36:59
BZCOMMENTOR::Alan Meyer
BZCOMMENT::25
(In reply to comment #23)
> Change that from UTF-8 to ANSI and save the file.
Although this works for removing the offending marks, In theory it could change the encoding and display of any non-ASCII characters.
In practice, I believe that this is a non-issue for the Medline Print Format that we're using. I believe that has already been converted to pure ASCII by the PubMed printer formatter before it comes to us.
BZDATETIME::2011-08-18 17:36:16
BZCOMMENTOR::Alan Meyer
BZCOMMENT::26
The attached executable replaces the existing Import Utility program.
The only change is that it looks for Unicode "Byte Order Marks" anywhere in the file and deletes them before processing input data. The fixes the problem caused by IE9, and possibly other browsers in the future, that insert these marks at the beginning of saved screens that contain UTF-8 encoded characters.
The change uses a little overkill in removing these marks from anywhere in the file since they are only supposed to occur at the beginning of a file, but it should not hurt to remove them from anywhere and it protects against any inadvertent additions of BOMs caused by user editing or combining of files.
I recommend the following installation procedure:
1. Rename your existing Cips_CMS.exe.
The convention I use for this is to rename it using the date of last use, for example: Cips_CMS_2011-08-17.exe
I suggest renaming this so that if I did something wrong in building the new executable it will be easy to re-instate the old one by simply renaming it back to its original name while waiting for me to fix things.
2. Download the new executable from this attachment and put it in the same directory where the old file was.
That should be all there is to it.
To make testing easier, I restored last night's backup of the production database to the test database. That will elminate any test records that you may have entered into the test database and allow you to test again.
Attachment Cips_CMS.exe has been added with description: Executable import utility patched for Unicode Byte Order Mark inserted by IE9
BZDATETIME::2011-08-18 17:37:18
BZCOMMENTOR::Alan Meyer
BZCOMMENT::27
In an optimistic spirit I am, once again, marking this Resolved-fixed.
BZDATETIME::2011-08-18 18:35:11
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::28
(In reply to comment #27)
> In an optimistic spirit I am, once again, marking this
Resolved-fixed.
I copied the new exe file and imported citations to test database first and then to production db. I could import without any error.
Thanks for getting me out of the influence of the "rare confluence of cosmic rays"--Minaxi
BZDATETIME::2011-08-18 19:08:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::29
Another note on the changes:
The new executable should be used by everyone who uses the import utility, not just IE9 users. Nobody else needs it right now, but you might upgrade in the future to IE9 and have forgotten about this problem. In addition some future update of Firefox or one of the other browsers might do the same thing that Microsoft did with IE9.
The change works fine with files produced by the older browsers as well as the new one.
BZDATETIME::2011-08-18 19:10:40
BZCOMMENTOR::Alan Meyer
BZCOMMENT::30
(In reply to comment #28)
...
> Thanks for getting me out of the influence of the "rare confluence
of cosmic
> rays"--Minaxi
Those cosmic rays can be dangerous. I often have to put on my special aluminum foil hat in order to reduce the number of new bugs I introduce into my software.
BZDATETIME::2011-08-22 17:08:17
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::31
Ok I am going to close this bug. Everything seems to be working fine with the new version. No other reported problems with importing citations.
File Name | Posted | User |
---|---|---|
breast_ova_gen_Aug11.txt | 2011-08-17 09:01:50 | |
Cips_CMS.exe | 2011-08-18 17:36:16 | |
ImportTitlesWithLID.txt | 2011-08-16 21:25:58 | |
late-effects_Aug11.txt | 2011-08-16 16:26:54 | |
wrong_id_error_citation.txt | 2011-08-16 11:11:38 |
Elapsed: 0:00:00.001468