Issue Number | 433 |
---|---|
Summary | Not able to import PMID 28448241 in the EBMS |
Created | 2017-05-01 14:23:00 |
Issue Type | Bug |
Submitted By | trivedim |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2017-05-30 12:14:16 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/oceebms/issue.207450 |
While importing citations using core journals tag, I could import 29
citations without problem. But while importing PMID 28448241, I got the
message 0 article imported; 1 article with errors.
Minaxi
Was this a one-time failure? Or did you try more than once for that article, getting the same failure message?
I tried 3 times, and got the same message.
Minaxi
OK, thanks. Let me see if I can figure out what's going on.
It looks as if NLM has started doing something pretty bizarre with the Affiliation element for each Author block. Instead of just listing each author's affiliation inside his or her Author block, NLM is now listing the affiliations for all the authors over and over again, repeating the same information inside each Author block. So this particular article has 177 authors (which in itself defies logic: how can 177 different people contribute meaningfully to the authorship of a single article?). The affiliation statement for the authors runs to over 8,000 characters. That's getting repeated 177 times, which means that just the affiliations blocks have over a million and a half characters, which is insane (remember, this is just a single article). That's breaking the PHP regular expression engine. We'll want to talk with NLM to find out what they were thinking when the introduced this wildly redundant approach to conveying the affiliation information. In the meantime, we'll try and see if there's some way to work around this silliness.
I have reported the problem to NLM, asking if it is necessary to inflate the XML documents for the articles with the redundant information. I have two workarounds which will enable us to import the bloated documents, one of which bumps up the PHP limits on regular expressions (a temporary solution, which will break again with a bigger document) and which I have tested on QA, and a more robust solution which replaces regular expressions with string scanning, tested on DEV. I will ask CBIIT to install the short-term workaround on the production server. That should allow us to import the 177-"author" behemoth.
NLM responded, acknowledging that the current behavior is a problem. They said they'll work with the publishers to get it fixed (implying, I assume, that they don't have complete control themselves over the data in their database). In the meantime, I'm still waiting for CBIIT to install our short-term workaround. Will keep you posted.
I would like to report another error encountered yesterday while
importing batch file. While importing "prostate cancer nutrition and
dietary supplement" file to IACT Board, I got the attached error. Upon
investigation, I found first 5 citations imported from the file. I
removed 6th citation, PMID 28415774 from the file and imported the rest
without any problem. I tried import PMID 28415774 it again today morning
as a text file as well as PMID and still get the same error.
Minaxi
I'm investigating this second problem, which does not appear to be related to the original problem for this ticket. If I confirm that they're unrelated, I'll ask you to open a second ticket for the second failure. Will keep you posted.
This is a completely separate problem, as I suspected. Please open a new ticket.
Done 🙂
Minaxi
Permanent workaround installed on DEV.
Bob, how do you recommend testing this one on QA? We know we can import the citation in question (28448241) since you have already done so. Should we find another citation with a lot of authors?
That's probably the best you can do. It's possible that NLM fixed the problem on their end. It's possible that the publisher(s) fixed it for them. It's possible that other articles with lots of authors weren't handled the same bizarre way as the one which caused the problem. Short answer: there's not much you can do in the way of rock-solid definitive testing. I think it's very unlikely we'll run into this problem again.
Minaxi/Cynthia, have you encountered any other citations recently with over 100 authors? (the one in question above had 177 authors)
No, I have not encountered any other citation with too many authors.
I am sure this citation was an exceptional.
Minaxi
I'm considering this verified on QA. I haven't come up with a definitive test, but the citation in question can be imported and it's unlikely we'll run into this again.
I imported PMID: 28448241 on PROD. I'm considering this verified on PROD.
File Name | Posted | User |
---|---|---|
ebms import error May2_2017.docx | 2017-05-03 10:13:03 | |
prostate_may17_problem cit.txt | 2017-05-03 10:13:54 |
Elapsed: 0:00:00.000690