CDR Tickets

Issue Number 4434
Summary [Citations] NLM Changes to Math Formulas (Deadline: June 1)
Created 2018-03-08 16:14:37
Issue Type Improvement
Submitted By Juthe, Robin (NIH/NCI) [E]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2018-05-02 16:19:42
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.222395
Description

Per Bob's email below, NLM will be making some changes June 1 to the way math formulas are presented in article titles. We expect this change to affect very few (if any) citations, so we've decided to go with option #2 in his email below.

----------
From: Bob Kline bkline@rksystems.com
Sent: Friday, March 2, 2018 11:52 AM
To: Beckwith, Margaret (NIH/NCI) [E] <mbeckwit@icic.nci.nih.gov>; Juthe, Robin (NIH/NCI) [E] <robin.juthe@nih.gov>; Englisch, Volker (NIH/NCI) [C] <volker@mail.nih.gov>; Osei-Poku, William <William.Osei-Poku@icf.com>
Cc: Dugan, Amy (NIH/NCI) [C] <amy.dugan@nih.gov>; Sun, Victoria (NIH/NCI) [C] <victoria.sun@nih.gov>; Bennett, Cameron (NIH/NCI) [C] <cameron.bennett@nih.gov>
Subject: Fwd: [Utilities-announce] PubMed E-Utilities DTD Mid-year Update June 2018

Hi, all. Just got this notice from NLM for a fairly disruptive change in the Citations DTD. They're adding <mml:math> element block to a bunch of elements, including the ArticleTitle element. Part of the disruption is that they are – for the first time – adding element names (and possibly attributes) with namespaces. More problematic would be handling of things like titles. I haven't really delved into the new DTD files, but off the top of my head we have two basic options:
1. undertake an extensive project to modify our schemas, QC reports, filters, XMetaL DTDs, etc. to accommodate this change; or
2. modify the citation import script to replace all of the <mml:math> elements in the incoming documents with the text "[formula]".
I recommend the second option, which is basically what NLM used to do, if I understand correctly. Let's discuss at our next status meeting. This change will have an impact not just on the CDR but also the EBMS, I would think.

Thanks,
Bob

                    • Forwarded message ----------
                      From: <utilities-announce@ncbi.nlm.nih.gov>
                      Date: Fri, Mar 2, 2018 at 11:22 AM
                      Subject: [Utilities-announce] PubMed E-Utilities DTD Mid-year Update June 2018
                      To: NLM/NCBI List utilities-announce <utilities-announce@ncbi.nlm.nih.gov>

Dear NCBI PubMed E-Utilities Users,

On June 1, 2018, we will begin displaying formulas in citation titles, abstracts, and keywords in PubMed. Today, formulas are replaced with [Formula: see text]. With this enhancement to PubMed, you will see formulas in the PubMed summary and abstract displays when these data are available. We will also be including the MathML 3.0 element tags in PubMed XML.

To support the addition of MathML tagging in our XML, we have created a new, forthcoming DTD which will be in use as of June 1, 2018. You can download the forthcoming DTD for June 2018 now. Existing content will be valid against the new DTD. You can also download sample XML files with MathML 3.0 tags.

Thank you,
PubMed Development Team

Comment entered 2018-04-04 15:55:26 by Kline, Bob (NIH/NCI) [C]
--- CiteSearch.py    2018-04-02 19:03:13.141709800 -0400
+++ CiteSearch-ocecdr-4434.py   2018-04-04 15:50:32.789876700 -0400
@@ -5,6 +5,7 @@
 # BZIssue::5174
 # JIRA::OCECDR-3456
 # JIRA::OCECDR-4201
+# JIRA::OCECDR-4434
 #----------------------------------------------------------------------
 import cgi, cdr, cdrcgi, cdrdb, requests, sys, lxml.etree as etree

@@ -103,6 +104,15 @@
         cdrcgi.bail("unable to parse document from NLM")
     for node in tree.findall("PubmedArticle"):
         etree.strip_elements(node, "CommentsCorrectionsList")
+        namespace = "http://www.w3.org/1998/Math/MathML"
+        mml_math = "{{{}}}math".format(namespace)
+        namespaces = dict(mml=namespace)
+        for child in node.xpath("//mml:math", namespaces=namespaces):
+            if child.tail is None:
+                child.tail = "[formula]"
+            else:
+                child.tail = "[formula]" + child.tail
+        etree.strip_elements(node, mml_math, with_tail=False)
         return node

 #----------------------------------------------------------------------
Comment entered 2018-05-02 16:19:42 by Kline, Bob (NIH/NCI) [C]

Patch has been applied on DEV. Users will not be able to test against a document imported from NLM with MathML blocks, because there aren't supposed to be any before June 1. All I'm looking for is verification that citation importing isn't broken. This should be taken care of by the testing for OCECDR-4463, which will be applied together with this patch before June 1. If you want to do more testing, I won't discourage you. :-)

https://github.com/NCIOCPL/cdr-admin/commit/57adc311

Comment entered 2018-05-18 15:38:29 by Kline, Bob (NIH/NCI) [C]

This can be closed, I think, as it's on PROD. You won't be able to test it.

Comment entered 2018-05-29 12:55:53 by Juthe, Robin (NIH/NCI) [E]

Agreed. We are importing citations without a problem on PROD.

Elapsed: 0:00:00.001649