Issue Number | 4434 |
---|---|
Summary | [Citations] NLM Changes to Math Formulas (Deadline: June 1) |
Created | 2018-03-08 16:14:37 |
Issue Type | Improvement |
Submitted By | Juthe, Robin (NIH/NCI) [E] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2018-05-02 16:19:42 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.222395 |
Per Bob's email below, NLM will be making some changes June 1 to the way math formulas are presented in article titles. We expect this change to affect very few (if any) citations, so we've decided to go with option #2 in his email below.
----------
From: Bob Kline bkline@rksystems.com
Sent: Friday, March 2, 2018 11:52 AM
To: Beckwith, Margaret (NIH/NCI) [E] <mbeckwit@icic.nci.nih.gov>;
Juthe, Robin (NIH/NCI) [E] <robin.juthe@nih.gov>; Englisch, Volker
(NIH/NCI) [C] <volker@mail.nih.gov>; Osei-Poku, William
<William.Osei-Poku@icf.com>
Cc: Dugan, Amy (NIH/NCI) [C] <amy.dugan@nih.gov>; Sun, Victoria
(NIH/NCI) [C] <victoria.sun@nih.gov>; Bennett, Cameron (NIH/NCI)
[C] <cameron.bennett@nih.gov>
Subject: Fwd: [Utilities-announce] PubMed E-Utilities DTD Mid-year
Update June 2018
Hi, all. Just got this notice from NLM for a fairly disruptive change
in the Citations DTD. They're adding <mml:math> element block to a
bunch of elements, including the ArticleTitle element. Part of the
disruption is that they are – for the first time – adding element names
(and possibly attributes) with namespaces. More problematic would be
handling of things like titles. I haven't really delved into the new DTD
files, but off the top of my head we have two basic options:
1. undertake an extensive project to modify our schemas, QC reports,
filters, XMetaL DTDs, etc. to accommodate this change; or
2. modify the citation import script to replace all of the
<mml:math> elements in the incoming documents with the text
"[formula]".
I recommend the second option, which is basically what NLM used to do,
if I understand correctly. Let's discuss at our next status meeting.
This change will have an impact not just on the CDR but also the EBMS, I
would think.
Thanks,
Bob
Forwarded message ----------
From: <utilities-announce@ncbi.nlm.nih.gov>
Date: Fri, Mar 2, 2018 at 11:22 AM
Subject: [Utilities-announce] PubMed E-Utilities DTD Mid-year Update
June 2018
To: NLM/NCBI List utilities-announce
<utilities-announce@ncbi.nlm.nih.gov>
Dear NCBI PubMed E-Utilities Users,
On June 1, 2018, we will begin displaying formulas in citation titles, abstracts, and keywords in PubMed. Today, formulas are replaced with [Formula: see text]. With this enhancement to PubMed, you will see formulas in the PubMed summary and abstract displays when these data are available. We will also be including the MathML 3.0 element tags in PubMed XML.
To support the addition of MathML tagging in our XML, we have created a new, forthcoming DTD which will be in use as of June 1, 2018. You can download the forthcoming DTD for June 2018 now. Existing content will be valid against the new DTD. You can also download sample XML files with MathML 3.0 tags.
Thank you,
PubMed Development Team
--- CiteSearch.py 2018-04-02 19:03:13.141709800 -0400
+++ CiteSearch-ocecdr-4434.py 2018-04-04 15:50:32.789876700 -0400
@@ -5,6 +5,7 @@
::5174
# BZIssue::OCECDR-3456
# JIRA::OCECDR-4201
# JIRA+# JIRA::OCECDR-4434
----------------------------------------------------------------------
#import cgi, cdr, cdrcgi, cdrdb, requests, sys, lxml.etree as etree
@@ -103,6 +104,15 @@
.bail("unable to parse document from NLM")
cdrcgifor node in tree.findall("PubmedArticle"):
.strip_elements(node, "CommentsCorrectionsList")
etree+ namespace = "http://www.w3.org/1998/Math/MathML"
+ mml_math = "{{{}}}math".format(namespace)
+ namespaces = dict(mml=namespace)
+ for child in node.xpath("//mml:math", namespaces=namespaces):
+ if child.tail is None:
+ child.tail = "[formula]"
+ else:
+ child.tail = "[formula]" + child.tail
+ etree.strip_elements(node, mml_math, with_tail=False)
return node
---------------------------------------------------------------------- #
Patch has been applied on DEV. Users will not be able to test against a document imported from NLM with MathML blocks, because there aren't supposed to be any before June 1. All I'm looking for is verification that citation importing isn't broken. This should be taken care of by the testing for OCECDR-4463, which will be applied together with this patch before June 1. If you want to do more testing, I won't discourage you. :-)
This can be closed, I think, as it's on PROD. You won't be able to test it.
Agreed. We are importing citations without a problem on PROD.
Elapsed: 0:00:00.001649