CDR Tickets

Issue Number 3664
Summary [Citations] Possible changes to Citations Schema
Created 2013-09-12 13:14:54
Issue Type Bug
Submitted By Osei-Poku, William (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2013-09-26 07:40:05
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.113138
Description

It looks like it's that time of year when NLM makes a lot of changes to their data elements which in turn affects our imports. We've had three citations import failures recently and it looks like this pattern will continue until we make changes on our end. I am creating this issue for us to discuss how to go about making possible changes to the citations schema.

Comment entered 2013-09-23 09:43:26 by Osei-Poku, William (NIH/NCI) [C]

It looks like the 2013 DTD and XML changes can be found in the following link:

http://www.nlm.nih.gov/bsd/licensee/announce/2012.html#d08_10

Please take a look.

Comment entered 2013-09-24 15:12:57 by Kline, Bob (NIH/NCI) [C]

Could you post the PubMed IDs of the three citations which failed to import?

Comment entered 2013-09-25 07:09:45 by Osei-Poku, William (NIH/NCI) [C]

Here are the PMIDs. The CDR IDs and two of the error messages can be found below also. I don't have the error message for the third one.

22841674
23210953
23090888

Citation added as CDR0000752468 (with validation errors)

      • IMPORTED WITH ERRORS *** PUBLISHABLE VERSION NOT CREATED
        Invalid value: 'UNASSIGNED' in attribute NlmCategory of element AbstractText

752389

      • IMPORTED WITH ERRORS *** PUBLISHABLE VERSION NOT CREATED
        No match752389 found in content model for type Article with child elements of Article element (Journal,ArticleTitle,Pagination,ELocationID,Abstract,Affiliation,AuthorList,Language,Language,PublicationTypeList); stopped at element Journal

751463

I don't have the exact error message for this one.

Comment entered 2013-09-25 11:17:38 by Kline, Bob (NIH/NCI) [C]

I just looked at the first document, and I don't see the error (in fact, the only version in the version history table is marked as valid). Is this not on production? Or did you alter the data?

Comment entered 2013-09-25 11:38:24 by Osei-Poku, William (NIH/NCI) [C]

Yes. I had to alter the data in order to the documents validated.

Comment entered 2013-09-25 11:48:36 by Kline, Bob (NIH/NCI) [C]

Is that a good idea? Hacking NLM's data introduces incorrect information. Also, it's making it more difficult to track down what needs to be modified in the schema. For example, it appears that the Language chld element under Article has been made multiply-occurring, but the change notes to which you provided the link doesn't make any mention of this, so we're having to identify the changes by knowing what causes documents to be invalid. Do you have any recollection at all of what was wrong for the third document?

Comment entered 2013-09-25 11:55:31 by Osei-Poku, William (NIH/NCI) [C]

Can you import from QA or DEV to see the errors?

Comment entered 2013-09-25 12:18:31 by Kline, Bob (NIH/NCI) [C]

Do you have a record of all of the Citation documents which have been hacked to get around the validation checks, so we can restore NLM's original data?

Comment entered 2013-09-25 12:21:58 by Osei-Poku, William (NIH/NCI) [C]

The three I provided are the only records.

Comment entered 2013-09-25 14:13:33 by Kline, Bob (NIH/NCI) [C]

NLM has changed the definition of the CommentsCorrections element in a way that is incompatible with 3,866 of our Citation documents (again, absolutely no mention of this change in the change documentation linked above). I will attempt to come up with a convoluted schema definition which works with both definitions (it will be ugly), unless you tell me you would prefer that we create a global change to make all of those documents conform to NLM's new definition.

Comment entered 2013-09-25 15:33:23 by Kline, Bob (NIH/NCI) [C]

I've done as much as I can. NLM doesn't make it easy. At some point we may want to consider modifying the approach we're currently using, which breaks our validation every time NLM sneezes, with an import which uses a filter to take what we need and put it into a stable structure. Would take some work, but over the long haul it might be worth it.

There one part of their DTDs which is left dangling, so I had to guess what the definition might be for that part of the schema. I filed a request for information on where to find the missing entity definition, but got back a canned response which said I might get back a response within four business days, unless they're busy, in which case it might take longer.

The modified schema has been installed on DEV. I imported the three citations, without validation errors, as:

CDR747717
CDR747718
CDR747719

I've attached the SVN diff, in case you want Volker to adjust the CSS.

Ready for user review.

Comment entered 2013-09-26 07:39:48 by Osei-Poku, William (NIH/NCI) [C]

I have reviewed these in DEV and they all look good. I wasn't able to download the attached diff though. I will try it on another machine later today. We can proceed without any CSS changes.

Comment entered 2013-10-05 09:58:05 by Kline, Bob (NIH/NCI) [C]

I have tracked down the definition of %text; and have incorporated it into the Citation schema on DEV. This introduces some new markup elements for the AbstractText element. Please verify that the schema and DTD still work correctly on DEV.

Comment entered 2013-10-07 12:30:00 by Osei-Poku, William (NIH/NCI) [C]

Verified on DEV. Everything appears to be working correctly.

Comment entered 2013-12-06 15:43:24 by Kline, Bob (NIH/NCI) [C]

I believe this is on production. Please verify and close the issue if everything looks OK. I've we miss something we can open a new ticket.

Comment entered 2013-12-11 14:39:57 by Osei-Poku, William (NIH/NCI) [C]

Verified on Prod.

Attachments
File Name Posted User
CitationSchema.diff 2013-09-25 15:33:23

Elapsed: 0:00:00.001827