Issue Number | 4561 |
---|---|
Summary | [Citations] Consider changing import/storage mechanism for citations from PubMed |
Created | 2019-01-05 18:22:34 |
Issue Type | Improvement |
Submitted By | Juthe, Robin (NIH/NCI) [E] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2019-08-01 07:17:30 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.238319 |
We run into problems importing citations 2-3X(?) each year when NLM suddenly makes a change to something in their schema that we weren't expecting.
Bob made the following suggestion in OCECDR-4556:
"Looking to the future, I recommend that consideration be given to changing the approach used for importing and storing citation documents from NLM, by coming up with a specification of the information actually needed from those documents, adapting the schema accordingly, and using an XSL/T filter to transform what NLM exports into what you want to store. Information you don't need would be simply ignored. This approach would reduce the number of times you'd have to do anything in response to NLM's changes to their structure, and for those cases where they modify something you need to have imported, the change can be handled in the XSL/T filter without the need for having CBIIT install modifications to the software."
Let's discuss what this would entail and the anticipated LOE in a future status meeting.
Subtasks would include:
Identify what needs to be kept (in what structure)
Modify the schema
Globally change the existing Citation
documents
Install the filter to extract what we need from what NLM gives us
Modify the import/update script to use the new filter
Modify existing reports
Modify XMetaL CSS/templates/etc.
Test
Deploy
I have attached a spreadsheet showing all of the elements we
currently have under the /Citation/PubmedArticle
portion of
the Citation documents (we current have a little under 50,000
Citation
documents on PROD, and about 3/4 of them are from
NLM). I haven't included the parts of their schema which reflect
citations we never import (such as
MedlineCitation/Article/Book
), though in one or two places
I make reference to these in the Notes
column. The first
column contains a proposed disposition for the element or attribute
represented by the path, and the third column shows the number of
occurrences for each in all the documents we have. Let's take this as a
starting point to be tweaked and refined. Look for things I'm suggesting
we can drop which you would prefer we retain, as well as things I've
said we could keep which you really don't need any more. Bear in mind
that we can always go back to PubMed for the original, and that the
price we pay for including things we don't really need is that we
increase the frequency with which we are likely to incur breakage by
changes to the structure at NLM. Also, note that I am in some cases
proposing that we eliminate some of the more stringent checks on the
data on our end for things which should really be verified at the source
by NLM (for example, letting the type of
MedlineCitation/@Status
be simply xsd::string
instead of trying to keep track of changes in NLM's valid values
list).
Assigned to you, ~juther, for the review of the proposed dispositions in the attached spreadsheet.
Just a note that all of my remaining Joule tickets (including this one) are awaiting feedback, so I won't be able to make much more progress on the release until that feedback rolls in. (I'm doing some tentative work on OCECDR-4573, based on some guesses about what the answer to the outstanding questions might be, but I'm reluctant to go too far without confirmation.)
I have the scaffolding in place for much of this ticket, awaiting
final decisions on the keep/drop list. I have a test global change job
running right now. It will take a long time to finish, as we would
expect, given the large number of Citation
documents in the
system, but there are already quite a few converted documents to take a
look at.
https://cdr-dev.cancer.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2019-07-05_15-19-37
The Diff
links won't be of much use, because the
interface shows the new MedlineCitation
element all on a
single line, so I would instead suggest looking at the New
links.
This is to give you something to look at while you're considering the options. It shouldn't be difficult to modify the in-progress filter to reflect changes to the proposed drop/keep list.
I presume we'll need to keep all the old parts of the schema to support looking at old versions of the documents in XMeTaL, just marking elements or attributes as optional as appropriate.
The spreadsheet has a mistake in it. It's the Year
element we need to keep, not the Day
element.
I've reviewed the spreadsheet and attached a version with my comments. William, could you please ask Cynthia and/or Minaxi to take a look at this as well? Thanks.
Sure. I sent them the spreadsheet earlier this week while waiting for you to finish. It should be ready tomorrow after they have looked at the one you just attached.
I have attached spreadsheet with a few comments from Minaxi and Cynthia. Thank you!
Does William's spreadsheet represent both Robin's and CIAT's decisions? Or do they need to be merged?
It should represent both decisions. We used Robin's spreadsheet and created a column to include CIAT comments.
The changes to the filter have been made to reflect the new drop/keep
decisions. The global change job is running on DEV in test mode. It will
take a while before it completes (there are over 39,000 documents to
process), but while it's running you can get a look at the ones which
have been done. Look carefully to verify that each of the new keep/drop
decisions has been honored. Again, the diff
output will be
less useful than just looking at the new
documents (in
fact, this is even more true now that we're retaining the abstract).
https://cdr-dev.cancer.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2019-07-22_11-55-00
Test global change took a little over 8 hours.
Is the import tool ready to test as well or we should just look at the global test results?
If you're happy with the global change test results, go ahead and test the import tool.
I reviewed some of the citations from the global change test results and it appears to correctly show the keep or drop elements. However, I imported one document (CDR0000794642)on DEV and noticed the following elements are included but it looks like we had marked them to be excluded:
ELocationID
MeshHeadingList
History
We also said to keep the PublicationStatus element if it is needed for the pre-Medline citation report so I assume it is included because it is needed.
Ah, my mistake. I had been testing the interface earlier, but under a different script name, because I didn't want to disrupt the use of the older script until we had settled on the keep/drop decisions. Please try again. PublicationStatus isn't needed.
We've imported a few and they all look good now. Thanks!
Please run the global in Live mode on DEV.
The job is running now.
What is the status of the job? I checked a few citations this morning and it doesn't look like they have been processed yet.
Looks like it aborted partway through (possibly by CBIIT maintenance). I'll rewrite the script so that it can be run a second time on the same tier and start it again. How was your flight?
It wasn't CBIIT maintenance (not directly, anyway), it was a bug in the code which captures information about a failed save so that the save attempt can be tried again as part of efforts to diagnose the failure. (The save failure itself may have been caused by CBIIT maintenance, but because of the bug in the code to capture information about the failure, we probably won't be able to determine that.) The failure occurred after 14 and 1/2 hours, at which point roughly 2/3 of the citations had been processed. I am running a modified version of the script, which skips past the citations which were already processed successfully. I'll let you know when it's done.
OK. Thanks! The flight was great. We had some delays taking off at Dulles, which made the layover in Brussels even shorter. Thanks for asking.
The global change on DEV finished successfully this time, and it does look as if the original problem arose from some CBIIT server anomaly, as the document which failed the first time had no problems when I ran the job a second time.
Verified on DEV. Thank you!
Verified on QA. Thanks!
Verified on PROD.
File Name | Posted | User |
---|---|---|
pubmed-paths_RJ_CIAT.xlsx | 2019-07-11 11:33:37 | Osei-Poku, William (NIH/NCI) [C] |
pubmed-paths_RJ.xlsx | 2019-07-10 16:11:47 | Juthe, Robin (NIH/NCI) [E] |
pubmed-paths.xlsx | 2019-06-19 15:51:40 | Kline, Bob (NIH/NCI) [C] |
Elapsed: 0:00:00.001506