Issue Number | 3664 |
---|---|
Summary | [Citations] Possible changes to Citations Schema |
Created | 2013-09-12 13:14:54 |
Issue Type | Bug |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2013-09-26 07:40:05 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.113138 |
It looks like it's that time of year when NLM makes a lot of changes to their data elements which in turn affects our imports. We've had three citations import failures recently and it looks like this pattern will continue until we make changes on our end. I am creating this issue for us to discuss how to go about making possible changes to the citations schema.
It looks like the 2013 DTD and XML changes can be found in the following link:
http://www.nlm.nih.gov/bsd/licensee/announce/2012.html#d08_10
Please take a look.
Could you post the PubMed IDs of the three citations which failed to import?
Here are the PMIDs. The CDR IDs and two of the error messages can be found below also. I don't have the error message for the third one.
22841674
23210953
23090888
Citation added as CDR0000752468 (with validation errors)
IMPORTED WITH ERRORS *** PUBLISHABLE VERSION NOT CREATED
Invalid value: 'UNASSIGNED' in attribute NlmCategory of element
AbstractText
752389
IMPORTED WITH ERRORS *** PUBLISHABLE VERSION NOT CREATED
No match752389 found in content model for type Article with child
elements of Article element
(Journal,ArticleTitle,Pagination,ELocationID,Abstract,Affiliation,AuthorList,Language,Language,PublicationTypeList);
stopped at element Journal
751463
I don't have the exact error message for this one.
I just looked at the first document, and I don't see the error (in fact, the only version in the version history table is marked as valid). Is this not on production? Or did you alter the data?
Yes. I had to alter the data in order to the documents validated.
Is that a good idea? Hacking NLM's data introduces incorrect information. Also, it's making it more difficult to track down what needs to be modified in the schema. For example, it appears that the Language chld element under Article has been made multiply-occurring, but the change notes to which you provided the link doesn't make any mention of this, so we're having to identify the changes by knowing what causes documents to be invalid. Do you have any recollection at all of what was wrong for the third document?
Can you import from QA or DEV to see the errors?
Do you have a record of all of the Citation documents which have been hacked to get around the validation checks, so we can restore NLM's original data?
The three I provided are the only records.
NLM has changed the definition of the CommentsCorrections element in a way that is incompatible with 3,866 of our Citation documents (again, absolutely no mention of this change in the change documentation linked above). I will attempt to come up with a convoluted schema definition which works with both definitions (it will be ugly), unless you tell me you would prefer that we create a global change to make all of those documents conform to NLM's new definition.
I've done as much as I can. NLM doesn't make it easy. At some point we may want to consider modifying the approach we're currently using, which breaks our validation every time NLM sneezes, with an import which uses a filter to take what we need and put it into a stable structure. Would take some work, but over the long haul it might be worth it.
There one part of their DTDs which is left dangling, so I had to guess what the definition might be for that part of the schema. I filed a request for information on where to find the missing entity definition, but got back a canned response which said I might get back a response within four business days, unless they're busy, in which case it might take longer.
The modified schema has been installed on DEV. I imported the three citations, without validation errors, as:
CDR747717
CDR747718
CDR747719
I've attached the SVN diff, in case you want Volker to adjust the CSS.
Ready for user review.
I have reviewed these in DEV and they all look good. I wasn't able to download the attached diff though. I will try it on another machine later today. We can proceed without any CSS changes.
I have tracked down the definition of %text; and have incorporated it into the Citation schema on DEV. This introduces some new markup elements for the AbstractText element. Please verify that the schema and DTD still work correctly on DEV.
Verified on DEV. Everything appears to be working correctly.
I believe this is on production. Please verify and close the issue if everything looks OK. I've we miss something we can open a new ticket.
Verified on Prod.
File Name | Posted | User |
---|---|---|
CitationSchema.diff | 2013-09-25 15:33:23 |
Elapsed: 0:00:00.001827