PDQ Issues

Issue Number	684
Summary	PubMed stores HTML markup in the ArticleTitle element
Created	2023-01-07 08:34:07
Issue Type	Improvement
Submitted By	Kline, Bob (NIH/NCI) [C]
Assigned To	Kline, Bob (NIH/NCI) [C]
Status	Closed
Resolved	2023-01-09 12:13:25
Resolution	Fixed
Path	/home/bkline/backups/jira/oceebms/issue.335897

Description

While I was working on a pair of tickets for a problem caused by changes in the database configuration, I noticed that some of NLM's PubMed article documents have HTML markup stored as part of the title. For example:

EBMS 3 is inconsistent in this regard, but it appears to have the problem, too.

So it's not necessarily something which has to be addressed as part of the Drupal 9 upgrade, but I wanted to capture the issue in a ticket for possible remediation at some point in the future.

It's a bit dirty to store formatting markup in an XML document intended for storing semantic information, and it puts us in a situation with the potential for security risks.

Our options for dealing with the problem would include ...

Do what we're doing now, which is to escape the title. This is the least expensive and safest choice (but leaves us with some ugliness and sorting anomalies).
Like the first option, but strip the markup from the normalized version of the title we use for searching and sorting.
Strip the markup from both copies of the title.
Store it as we get it (as we do now) and give it to the browser without escaping. There's some risk in this, as it's possible we could get malicious markup. The probability of this happening is low, but we'd be at the mercy of the assumption that NLM will always do the right thing, ensuring that such markup never makes it to us.
Same as the previous option, but also strip out the markup from the normalized value used for searching and sorting.
One of the previous two options, with the addition of some additional processing each time a document is imported to verify that all markup is safe, at the cost of performance and added complexity in our software.
Like the previous option, but perform the check every time we display the title instead of when we store it, also with the same costs as the previous option, but those costs would be higher.

There may be other variations on these options, but seven is enough to start with, I think. At this point I'm inclined to go with #5 (with #3 as my second choice), as I believe the risk is low enough. NLM does have a DTD, after all, which should—in theory, at least—prevent malicious markup getting through.

Comment entered 2023-01-07 08:35:40 by Kline, Bob (NIH/NCI) [C]

Added watchers.

Comment entered 2023-01-09 07:25:49 by Kline, Bob (NIH/NCI) [C]

It looks like NLM is itself not consistent about whether it escapes the inline markup it uses. Also, I see that we're losing some of the text inside that markup, so we can't defer addressing this issue until after Everglades. I have a solution implemented and tested in a private docker instance of the site, and I'm deploying it on ebms.rksystems.com, updating the articles, and testing there.

Comment entered 2023-01-09 12:13:25 by Kline, Bob (NIH/NCI) [C]

Well, that was a kick in the pants! I had just about finished a very long summary of the status for this ticket, when my GFE laptop decided to crash, so I'm starting all over again. Let's hope I can remember everything I wrote the first time. 😛

The inline markup problem also affects the abstracts. I have modified the import software to preserve the values as HTML instead of plain text, and I have modified all the templates which render article titles and/or abstracts to avoid escaping the markup. That doesn't mean you'll never see escaped markup in a title or abstract, but when you do, it's because that's what NLM gave us. I just got a response to the ticket I opened with NLM, and they tell me that the escaped markup is caused by flawed submissions by the publishers, which eventually get noticed and corrected. Part of my ticket asked why we're not getting notification of such changes when the corrections are made. That discussion is still in progress.

I have installed the modifications on https://ebms.rksystems.com, and have reloaded the affected articles from the original XML. Note that the most common markup element ({}<i>...</i>{}) does not have any visible effect in the titles, since we already display the entire title in italics, following standard bibliographic procedures, but you'll see the effect of that element in the abstracts which use it. The most important inline markup, and the real incentive to avoid solutions which just strip the markup tags out of the text, are the <sup> and <sub> elements, as those can change the meaning of the text much more significantly than bold or italics.

Comment entered 2023-01-09 15:54:44 by Kline, Bob (NIH/NCI) [C]

Part of my ticket asked why we're not getting notification of such changes when the corrections are made. That discussion is still in progress.

I created ticket OCEEBMS-687 for that problem.

Comment entered 2023-04-05 15:03:53 by Boggess, Cynthia (NIH/NCI) [C]

As an FYI, only 2 citations since we went live with EBMS4 have shown up with html tags in the title. This includes two review cycles (March and April) imported after going live. EBMS ID: 903062 & EBMS ID: 901284. Screenshot included.

Attachments

File Name	Posted	User
abstract-inline-markup.png	2023-01-09 12:14:05	Kline, Bob (NIH/NCI) [C]
image-2023-01-07-08-05-41-619.png	2023-01-07 08:05:42	Kline, Bob (NIH/NCI) [C]
image-2023-01-07-08-08-01-471.png	2023-01-07 08:08:02	Kline, Bob (NIH/NCI) [C]
image-2023-04-05-15-02-07-970.png	2023-04-05 15:02:08	Boggess, Cynthia (NIH/NCI) [C]
title-inline-markup.png	2023-01-09 12:14:31	Kline, Bob (NIH/NCI) [C]

Elapsed: 0:00:00.000961

EBMS Tickets