EBMS Tickets

Issue Number 433
Summary Not able to import PMID 28448241 in the EBMS
Created 2017-05-01 14:23:00
Issue Type Bug
Submitted By trivedim
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2017-05-30 12:14:16
Resolution Fixed
Path /home/bkline/backups/jira/oceebms/issue.207450
Description

While importing citations using core journals tag, I could import 29 citations without problem. But while importing PMID 28448241, I got the message 0 article imported; 1 article with errors.
Minaxi

Comment entered 2017-05-01 14:25:57 by Kline, Bob (NIH/NCI) [C]

Was this a one-time failure? Or did you try more than once for that article, getting the same failure message?

Comment entered 2017-05-01 16:08:10 by trivedim

I tried 3 times, and got the same message.
Minaxi

Comment entered 2017-05-01 16:18:53 by Kline, Bob (NIH/NCI) [C]

OK, thanks. Let me see if I can figure out what's going on.

Comment entered 2017-05-01 18:40:46 by Kline, Bob (NIH/NCI) [C]

It looks as if NLM has started doing something pretty bizarre with the Affiliation element for each Author block. Instead of just listing each author's affiliation inside his or her Author block, NLM is now listing the affiliations for all the authors over and over again, repeating the same information inside each Author block. So this particular article has 177 authors (which in itself defies logic: how can 177 different people contribute meaningfully to the authorship of a single article?). The affiliation statement for the authors runs to over 8,000 characters. That's getting repeated 177 times, which means that just the affiliations blocks have over a million and a half characters, which is insane (remember, this is just a single article). That's breaking the PHP regular expression engine. We'll want to talk with NLM to find out what they were thinking when the introduced this wildly redundant approach to conveying the affiliation information. In the meantime, we'll try and see if there's some way to work around this silliness.

Comment entered 2017-05-02 09:25:57 by Kline, Bob (NIH/NCI) [C]

I have reported the problem to NLM, asking if it is necessary to inflate the XML documents for the articles with the redundant information. I have two workarounds which will enable us to import the bloated documents, one of which bumps up the PHP limits on regular expressions (a temporary solution, which will break again with a bigger document) and which I have tested on QA, and a more robust solution which replaces regular expressions with string scanning, tested on DEV. I will ask CBIIT to install the short-term workaround on the production server. That should allow us to import the 177-"author" behemoth.

Comment entered 2017-05-03 07:58:24 by Kline, Bob (NIH/NCI) [C]

NLM responded, acknowledging that the current behavior is a problem. They said they'll work with the publishers to get it fixed (implying, I assume, that they don't have complete control themselves over the data in their database). In the meantime, I'm still waiting for CBIIT to install our short-term workaround. Will keep you posted.

Comment entered 2017-05-03 10:14:57 by trivedim

I would like to report another error encountered yesterday while importing batch file. While importing "prostate cancer nutrition and dietary supplement" file to IACT Board, I got the attached error. Upon investigation, I found first 5 citations imported from the file. I removed 6th citation, PMID 28415774 from the file and imported the rest without any problem. I tried import PMID 28415774 it again today morning as a text file as well as PMID and still get the same error.
Minaxi

Comment entered 2017-05-03 11:54:18 by Kline, Bob (NIH/NCI) [C]

I'm investigating this second problem, which does not appear to be related to the original problem for this ticket. If I confirm that they're unrelated, I'll ask you to open a second ticket for the second failure. Will keep you posted.

Comment entered 2017-05-03 14:27:42 by Kline, Bob (NIH/NCI) [C]

This is a completely separate problem, as I suspected. Please open a new ticket.

Comment entered 2017-05-03 14:33:12 by trivedim

Done 🙂
Minaxi

Comment entered 2017-05-30 12:14:16 by Kline, Bob (NIH/NCI) [C]

Permanent workaround installed on DEV.

Comment entered 2017-06-21 14:14:49 by Juthe, Robin (NIH/NCI) [E]

Bob, how do you recommend testing this one on QA? We know we can import the citation in question (28448241) since you have already done so. Should we find another citation with a lot of authors?

Comment entered 2017-06-21 14:24:13 by Kline, Bob (NIH/NCI) [C]

That's probably the best you can do. It's possible that NLM fixed the problem on their end. It's possible that the publisher(s) fixed it for them. It's possible that other articles with lots of authors weren't handled the same bizarre way as the one which caused the problem. Short answer: there's not much you can do in the way of rock-solid definitive testing. I think it's very unlikely we'll run into this problem again.

Comment entered 2017-06-21 14:29:10 by Juthe, Robin (NIH/NCI) [E]

Minaxi/Cynthia, have you encountered any other citations recently with over 100 authors? (the one in question above had 177 authors)

Comment entered 2017-06-26 17:35:21 by trivedim

No, I have not encountered any other citation with too many authors. I am sure this citation was an exceptional.
Minaxi

Comment entered 2017-06-28 17:12:58 by Juthe, Robin (NIH/NCI) [E]

I'm considering this verified on QA. I haven't come up with a definitive test, but the citation in question can be imported and it's unlikely we'll run into this again.

Comment entered 2017-08-28 15:07:35 by Juthe, Robin (NIH/NCI) [E]

I imported PMID: 28448241 on PROD. I'm considering this verified on PROD.

Attachments
File Name Posted User
ebms import error May2_2017.docx 2017-05-03 10:13:03
prostate_may17_problem cit.txt 2017-05-03 10:13:54

Elapsed: 0:00:00.000690