Issue Number | 4556 |
---|---|
Summary | [Citation] Schema validation error |
Created | 2018-12-12 15:12:18 |
Issue Type | Improvement |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2019-01-05 14:59:12 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.237465 |
We are getting the following error message when importing some citations on PROD.
PMID 20820814
IMPORTED WITH ERRORS *** PUBLISHABLE VERSION NOT CREATED
Element 'ReferenceList': This element is not expected. Expected is (
ObjectList ).
CDR0000796058
Increased the priority on this as discussed in our status meeting. William estimated that about 80% of citations are now importing with errors, so they have stopped importing citations until the problem is fixed.
This is not going to be as straightforward as we might have
originally thought. NLM has introduced a new element named
Citation
, which clashes with the top-level element of the
CDR document into which we're trying to import their document. As you
know, a DTD cannot have two different elements with the same name and
conflicting definitions. NLM doesn't have this problem in their DTD,
because they don't have the wrapper into which we insert their document.
Here is the new structure they're introducing:
...PubmedData>
<History/>
<PublicationStatus/>
<ArticleIdList/>
<ObjectList/>
<ReferenceList> <!-- this block is new -->
<Title/> <!-- optional -->
<Reference> <!-- zero or more occurrences allowed -->
<Citation/> <!-- REQUIRED; THIS IS THE PROBLEM ELEMENT -->
<ArticleIdList> <!-- optional -->
<Reference>
</ReferenceList/> <!-- zero or more allowed; note possible unlimited recursion -->
<ReferenceList>
</PubmedData>
</ ...
I can think of a number of possible solutions.
we strip the new ReferenceList
blocks
we import the new ReferenceList
blocks but drop the
Citation
child element
we filter the incoming NLM document and rename
Citation
to something that will not conflict with our own
Citation
element (and which we hope will not conflict with
any other element NLM chooses to introduce in the future).
Which approach do you prefer? If the last, what name would you like
us to use instead of Citation
? All of these options bump
the ticket out of the realm of things we can fix without CBIIT's
assistance to get the changed import code installed on the upper
tiers.
I vote for #1. I don't think we need this information. It looks like a list of references cited in the journal article.
Thanks for the analysis, Bob. I tend to agree with William. I don't think we need it. It could also make the documents much larger if it's actually a list of every reference.
Give it a try on DEV.
I was able to import about three successfully. However, two others
are coming up with errors:
PMID25099306
PMID26530227
Interesting. They've added a new attribute which wasn't mentioned at all in the documentation of what's been changed. I re-installed the schema (on DEV) with the stealth attribute included.
Looking to the future, I recommend that consideration be given to changing the approach used for importing and storing citation documents from NLM, by coming up with a specification of the information actually needed from those documents, adapting the schema accordingly, and using an XSL/T filter to transform what NLM exports into what you want to store. Information you don't need would be simply ignored. This approach would reduce the number of times you'd have to do anything in response to NLM's changes to their structure, and for those cases where they modify something you need to have imported, the change can be handled in the XSL/T filter without the need for having CBIIT install modifications to the software.
I have been able to successfully update the two that failed so it looks like we are good to go to QA for more testing.
Installed schema and import software changes on QA.
Verified on QA. Thank you!
Ticket NCI-RITM0155660 has been submitted to CBIIT for patching the import script on STAGE and PROD. The schema (which does not depend on the patch to the script) has already been updated on all four tiers.
Thanks so much, Bob. And thank you William for testing this! Just to be sure I understand, we'll need to wait for CBIIT to install the import script patch before we can import the problem citations on PROD, right?
Your proposal for changing the import mechanism makes sense to me, Bob. These import problems really throw a wrench in our processes so having something more flexible and in our control sounds good. I'll enter a ticket for us to consider that in the future.
Yes, we need CBIIT to install that patch before imports will work on PROD. I told them there's no need in this case to coordinate the timing with the users for this patch, as the script it fixes is already broken, so the application of the patch shouldn't be able to introduce any additional down time, and asked them to do it asap.
Sounds good. Thank you!
STAGE is patched. Please verify and let me know so I can have CBIIT proceed with PROD.
Verified on STAGE. I was able to import PMID26530227, PMID25099306, and PMID 20820814 successfully. Thanks.
Verified on STAGE as well.
Successful import of
CDR0000793594 (PMID: 30602793)
CDR0000793596 (PMID: 30605229)
CDR0000793597 (PMID: 30580963)
Patch is on PROD. Please verify.
~oseipokuw: Can I confirm that this is done successfully on PROD and have CBIIT close this ticket?
Verified on PROD. Thanks!
File Name | Posted | User |
---|---|---|
PubMed Import Error - PMID25099306.png | 2019-01-05 12:14:13 | Osei-Poku, William (NIH/NCI) [C] |
PubMed Import Error - PMID26530227.png | 2019-01-05 12:14:13 | Osei-Poku, William (NIH/NCI) [C] |
Elapsed: 0:00:00.000607