CDR Tickets

Issue Number 4556
Summary [Citation] Schema validation error
Created 2018-12-12 15:12:18
Issue Type Improvement
Submitted By Osei-Poku, William (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2019-01-05 14:59:12
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.237465
Description

We are getting the following error message when importing some citations on PROD.

PMID 20820814

      • IMPORTED WITH ERRORS *** PUBLISHABLE VERSION NOT CREATED
        Element 'ReferenceList': This element is not expected. Expected is ( ObjectList ).

CDR0000796058

Comment entered 2019-01-03 14:40:37 by Juthe, Robin (NIH/NCI) [E]

Increased the priority on this as discussed in our status meeting. William estimated that about 80% of citations are now importing with errors, so they have stopped importing citations until the problem is fixed.

Comment entered 2019-01-04 16:44:51 by Kline, Bob (NIH/NCI) [C]

This is not going to be as straightforward as we might have originally thought. NLM has introduced a new element named Citation, which clashes with the top-level element of the CDR document into which we're trying to import their document. As you know, a DTD cannot have two different elements with the same name and conflicting definitions. NLM doesn't have this problem in their DTD, because they don't have the wrapper into which we insert their document. Here is the new structure they're introducing:

...
   <PubmedData>
     <History/>
     <PublicationStatus/>
     <ArticleIdList/>
     <ObjectList/>
     <ReferenceList> <!-- this block is new -->
       <Title/> <!-- optional -->
       <Reference> <!-- zero or more occurrences allowed -->
         <Citation/> <!-- REQUIRED; THIS IS THE PROBLEM ELEMENT -->
         <ArticleIdList> <!-- optional -->
       </Reference>
       <ReferenceList/> <!-- zero or more allowed; note possible unlimited recursion -->
     </ReferenceList>
   </PubmedData>
   ...

I can think of a number of possible solutions.

  1. we strip the new ReferenceList blocks

  2. we import the new ReferenceList blocks but drop the Citation child element

  3. we filter the incoming NLM document and rename Citation to something that will not conflict with our own Citation element (and which we hope will not conflict with any other element NLM chooses to introduce in the future).

Which approach do you prefer? If the last, what name would you like us to use instead of Citation? All of these options bump the ticket out of the realm of things we can fix without CBIIT's assistance to get the changed import code installed on the upper tiers.

Comment entered 2019-01-04 18:37:23 by Osei-Poku, William (NIH/NCI) [C]

I vote for #1. I don't think we need this information. It looks like a list of references cited in the journal article.

Comment entered 2019-01-04 19:50:51 by Juthe, Robin (NIH/NCI) [E]

Thanks for the analysis, Bob. I tend to agree with William. I don't think we need it. It could also make the documents much larger if it's actually a list of every reference.

Comment entered 2019-01-05 07:55:13 by Kline, Bob (NIH/NCI) [C]

Give it a try on DEV.

Comment entered 2019-01-05 12:15:21 by Osei-Poku, William (NIH/NCI) [C]

I was able to import about three successfully. However, two others are coming up with errors:
PMID25099306
PMID26530227

Comment entered 2019-01-05 12:47:50 by Kline, Bob (NIH/NCI) [C]

Interesting. They've added a new attribute which wasn't mentioned at all in the documentation of what's been changed. I re-installed the schema (on DEV) with the stealth attribute included.

Looking to the future, I recommend that consideration be given to changing the approach used for importing and storing citation documents from NLM, by coming up with a specification of the information actually needed from those documents, adapting the schema accordingly, and using an XSL/T filter to transform what NLM exports into what you want to store. Information you don't need would be simply ignored. This approach would reduce the number of times you'd have to do anything in response to NLM's changes to their structure, and for those cases where they modify something you need to have imported, the change can be handled in the XSL/T filter without the need for having CBIIT install modifications to the software.

Comment entered 2019-01-05 13:39:33 by Osei-Poku, William (NIH/NCI) [C]

I have been able to successfully update the two that failed so it looks like we are good to go to QA for more testing.

Comment entered 2019-01-05 13:44:34 by Kline, Bob (NIH/NCI) [C]

Installed schema and import software changes on QA.

Comment entered 2019-01-05 14:25:40 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA. Thank you!

Comment entered 2019-01-05 14:59:12 by Kline, Bob (NIH/NCI) [C]

Ticket NCI-RITM0155660 has been submitted to CBIIT for patching the import script on STAGE and PROD. The schema (which does not depend on the patch to the script) has already been updated on all four tiers.

Comment entered 2019-01-05 18:19:57 by Juthe, Robin (NIH/NCI) [E]

Thanks so much, Bob. And thank you William for testing this! Just to be sure I understand, we'll need to wait for CBIIT to install the import script patch before we can import the problem citations on PROD, right?

Your proposal for changing the import mechanism makes sense to me, Bob. These import problems really throw a wrench in our processes so having something more flexible and in our control sounds good. I'll enter a ticket for us to consider that in the future.

Comment entered 2019-01-05 18:24:29 by Kline, Bob (NIH/NCI) [C]

Yes, we need CBIIT to install that patch before imports will work on PROD. I told them there's no need in this case to coordinate the timing with the users for this patch, as the script it fixes is already broken, so the application of the patch shouldn't be able to introduce any additional down time, and asked them to do it asap.

Comment entered 2019-01-05 18:25:44 by Juthe, Robin (NIH/NCI) [E]

Sounds good. Thank you!

Comment entered 2019-01-07 11:04:51 by Kline, Bob (NIH/NCI) [C]

STAGE is patched. Please verify and let me know so I can have CBIIT proceed with PROD.

Comment entered 2019-01-07 11:09:34 by Juthe, Robin (NIH/NCI) [E]

Verified on STAGE. I was able to import PMID26530227, PMID25099306, and PMID 20820814 successfully. Thanks.

Comment entered 2019-01-07 11:27:11 by Osei-Poku, William (NIH/NCI) [C]

Verified on STAGE as well.
Successful import of

CDR0000793594 (PMID: 30602793)
CDR0000793596 (PMID: 30605229)
CDR0000793597 (PMID: 30580963)

Comment entered 2019-01-07 11:34:19 by Kline, Bob (NIH/NCI) [C]

Patch is on PROD. Please verify.

Comment entered 2019-01-07 12:00:10 by Kline, Bob (NIH/NCI) [C]

: Can I confirm that this is done successfully on PROD and have CBIIT close this ticket?

Comment entered 2019-01-07 12:14:49 by Osei-Poku, William (NIH/NCI) [C]

Verified on PROD. Thanks!

Attachments
File Name Posted User
PubMed Import Error - PMID25099306.png 2019-01-05 12:14:13 Osei-Poku, William (NIH/NCI) [C]
PubMed Import Error - PMID26530227.png 2019-01-05 12:14:13 Osei-Poku, William (NIH/NCI) [C]

Elapsed: 0:00:00.000607