CDR Tickets

Issue Number 4561
Summary [Citations] Consider changing import/storage mechanism for citations from PubMed
Created 2019-01-05 18:22:34
Issue Type Improvement
Submitted By Juthe, Robin (NIH/NCI) [E]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2019-08-01 07:17:30
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.238319
Description

We run into problems importing citations 2-3X(?) each year when NLM suddenly makes a change to something in their schema that we weren't expecting.

Bob made the following suggestion in OCECDR-4556:

"Looking to the future, I recommend that consideration be given to changing the approach used for importing and storing citation documents from NLM, by coming up with a specification of the information actually needed from those documents, adapting the schema accordingly, and using an XSL/T filter to transform what NLM exports into what you want to store. Information you don't need would be simply ignored. This approach would reduce the number of times you'd have to do anything in response to NLM's changes to their structure, and for those cases where they modify something you need to have imported, the change can be handled in the XSL/T filter without the need for having CBIIT install modifications to the software."

Let's discuss what this would entail and the anticipated LOE in a future status meeting.

Comment entered 2019-01-05 18:28:38 by Kline, Bob (NIH/NCI) [C]

Subtasks would include:

  1. Identify what needs to be kept (in what structure)

  2. Modify the schema

  3. Globally change the existing Citation documents

  4. Install the filter to extract what we need from what NLM gives us

  5. Modify the import/update script to use the new filter

  6. Modify existing reports

  7. Modify XMetaL CSS/templates/etc.

  8. Test

  9. Deploy

Comment entered 2019-06-19 15:53:48 by Kline, Bob (NIH/NCI) [C]

I have attached a spreadsheet showing all of the elements we currently have under the /Citation/PubmedArticle portion of the Citation documents (we current have a little under 50,000 Citation documents on PROD, and about 3/4 of them are from NLM). I haven't included the parts of their schema which reflect citations we never import (such as MedlineCitation/Article/Book), though in one or two places I make reference to these in the Notes column. The first column contains a proposed disposition for the element or attribute represented by the path, and the third column shows the number of occurrences for each in all the documents we have. Let's take this as a starting point to be tweaked and refined. Look for things I'm suggesting we can drop which you would prefer we retain, as well as things I've said we could keep which you really don't need any more. Bear in mind that we can always go back to PubMed for the original, and that the price we pay for including things we don't really need is that we increase the frequency with which we are likely to incur breakage by changes to the structure at NLM. Also, note that I am in some cases proposing that we eliminate some of the more stringent checks on the data on our end for things which should really be verified at the source by NLM (for example, letting the type of MedlineCitation/@Status be simply xsd::string instead of trying to keep track of changes in NLM's valid values list).

Comment entered 2019-06-19 15:55:04 by Kline, Bob (NIH/NCI) [C]

Assigned to you, , for the review of the proposed dispositions in the attached spreadsheet.

Comment entered 2019-07-03 12:05:20 by Kline, Bob (NIH/NCI) [C]

Just a note that all of my remaining Joule tickets (including this one) are awaiting feedback, so I won't be able to make much more progress on the release until that feedback rolls in. (I'm doing some tentative work on OCECDR-4573, based on some guesses about what the answer to the outstanding questions might be, but I'm reluctant to go too far without confirmation.)

Comment entered 2019-07-05 15:29:04 by Kline, Bob (NIH/NCI) [C]

I have the scaffolding in place for much of this ticket, awaiting final decisions on the keep/drop list. I have a test global change job running right now. It will take a long time to finish, as we would expect, given the large number of Citation documents in the system, but there are already quite a few converted documents to take a look at.

https://cdr-dev.cancer.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2019-07-05_15-19-37

The Diff links won't be of much use, because the interface shows the new MedlineCitation element all on a single line, so I would instead suggest looking at the New links.

This is to give you something to look at while you're considering the options. It shouldn't be difficult to modify the in-progress filter to reflect changes to the proposed drop/keep list.

I presume we'll need to keep all the old parts of the schema to support looking at old versions of the documents in XMeTaL, just marking elements or attributes as optional as appropriate.

Comment entered 2019-07-08 08:36:40 by Kline, Bob (NIH/NCI) [C]

The spreadsheet has a mistake in it. It's the Year element we need to keep, not the Day element.

Comment entered 2019-07-10 16:11:19 by Juthe, Robin (NIH/NCI) [E]

I've reviewed the spreadsheet and attached a version with my comments. William, could you please ask Cynthia and/or Minaxi to take a look at this as well? Thanks.

Comment entered 2019-07-10 21:40:17 by Osei-Poku, William (NIH/NCI) [C]

Sure. I sent them the spreadsheet earlier this week while waiting for you to finish. It should be ready tomorrow after they have looked at the one you just attached.

Comment entered 2019-07-11 11:34:09 by Osei-Poku, William (NIH/NCI) [C]

pubmed-paths_RJ_CIAT.xlsx

 

I have attached spreadsheet with a few comments from Minaxi and Cynthia. Thank you!

Comment entered 2019-07-22 08:52:56 by Kline, Bob (NIH/NCI) [C]

Does William's spreadsheet represent both Robin's and CIAT's decisions? Or do they need to be merged?

Comment entered 2019-07-22 08:59:11 by Osei-Poku, William (NIH/NCI) [C]

It should represent both decisions. We used Robin's spreadsheet and created a column to include CIAT comments.

Comment entered 2019-07-22 12:04:23 by Kline, Bob (NIH/NCI) [C]

The changes to the filter have been made to reflect the new drop/keep decisions. The global change job is running on DEV in test mode. It will take a while before it completes (there are over 39,000 documents to process), but while it's running you can get a look at the ones which have been done. Look carefully to verify that each of the new keep/drop decisions has been honored. Again, the diff output will be less useful than just looking at the new documents (in fact, this is even more true now that we're retaining the abstract).

https://cdr-dev.cancer.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2019-07-22_11-55-00

Comment entered 2019-07-23 08:35:05 by Kline, Bob (NIH/NCI) [C]

Test global change took a little over 8 hours.

Comment entered 2019-07-23 17:38:07 by Osei-Poku, William (NIH/NCI) [C]

Is the import tool ready to test as well or we should just look at the global test results?

Comment entered 2019-07-24 10:36:14 by Kline, Bob (NIH/NCI) [C]

If you're happy with the global change test results, go ahead and test the import tool.

Comment entered 2019-07-24 18:51:21 by Osei-Poku, William (NIH/NCI) [C]

I reviewed some of the citations from the global change test results and it appears to correctly show the keep or drop elements. However, I imported one document (CDR0000794642)on DEV and noticed the following elements are included but it looks like we had marked them to be excluded:

ELocationID

MeshHeadingList

History

We also said to keep the PublicationStatus element if it is needed for the pre-Medline citation report so I assume it is included because it is needed.

Comment entered 2019-07-24 19:32:07 by Kline, Bob (NIH/NCI) [C]

Ah, my mistake. I had been testing the interface earlier, but under a different script name, because I didn't want to disrupt the use of the older script until we had settled on the keep/drop decisions. Please try again. PublicationStatus isn't needed.

Comment entered 2019-07-25 12:17:53 by Osei-Poku, William (NIH/NCI) [C]

We've imported a few and they all look good now. Thanks!

Please run the global in Live mode on DEV.

Comment entered 2019-07-25 14:11:26 by Kline, Bob (NIH/NCI) [C]

The job is running now.

Comment entered 2019-07-29 07:59:22 by Osei-Poku, William (NIH/NCI) [C]

What is the status of the job? I checked a few citations this morning and it doesn't look like they have been processed yet.

Comment entered 2019-07-29 08:19:36 by Kline, Bob (NIH/NCI) [C]

Looks like it aborted partway through (possibly by CBIIT maintenance). I'll rewrite the script so that it can be run a second time on the same tier and start it again. How was your flight?

Comment entered 2019-07-29 08:37:16 by Kline, Bob (NIH/NCI) [C]

It wasn't CBIIT maintenance (not directly, anyway), it was a bug in the code which captures information about a failed save so that the save attempt can be tried again as part of efforts to diagnose the failure. (The save failure itself may have been caused by CBIIT maintenance, but because of the bug in the code to capture information about the failure, we probably won't be able to determine that.) The failure occurred after 14 and 1/2 hours, at which point roughly 2/3 of the citations had been processed. I am running a modified version of the script, which skips past the citations which were already processed successfully. I'll let you know when it's done.

Comment entered 2019-07-29 08:46:05 by Osei-Poku, William (NIH/NCI) [C]

OK. Thanks!  The flight was great. We had some delays taking off at Dulles, which made the layover in Brussels even shorter. Thanks for asking.

Comment entered 2019-07-30 04:56:15 by Kline, Bob (NIH/NCI) [C]

The global change on DEV finished successfully this time, and it does look as if the original problem arose from some CBIIT server anomaly, as the document which failed the first time had no problems when I ran the job a second time.

Comment entered 2019-07-30 12:51:05 by Osei-Poku, William (NIH/NCI) [C]

Verified on DEV. Thank you!

Comment entered 2019-08-05 12:25:12 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA. Thanks!

Comment entered 2019-08-27 12:29:33 by Juthe, Robin (NIH/NCI) [E]

Verified on PROD.

Attachments
File Name Posted User
pubmed-paths_RJ_CIAT.xlsx 2019-07-11 11:33:37 Osei-Poku, William (NIH/NCI) [C]
pubmed-paths_RJ.xlsx 2019-07-10 16:11:47 Juthe, Robin (NIH/NCI) [E]
pubmed-paths.xlsx 2019-06-19 15:51:40 Kline, Bob (NIH/NCI) [C]

Elapsed: 0:00:00.001506