EBMS Tickets

Issue Number 434
Summary Error while importing PMID 28415774
Created 2017-05-03 14:32:19
Issue Type Bug
Submitted By trivedim
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2017-05-30 12:14:48
Resolution Fixed
Path /home/bkline/backups/jira/oceebms/issue.207553
Description

I would like to report another error encountered yesterday while importing batch file. While importing "prostate cancer nutrition and dietary supplement" file to IACT Board, I got the attached error. Upon investigation, I found first 5 citations imported from the file. I removed 6th citation, PMID 28415774 from the file and imported the rest without any problem. I tried import PMID 28415774 it again today morning as a text file as well as PMID and still get the same error.
Minaxi

Comment entered 2017-05-03 15:10:44 by Kline, Bob (NIH/NCI) [C]

I did some digging and found out that the database character set used by the EBMS ("utf8") is not what the DBMS (MySQL) documentation claims it to be ("UTF-8 UNICODE"). In actual fact, that character set only supports a subset of Unicode characters. PMID 28415774 contains a character ("mathematical sans-serif bold italic small chi" (U+1D7C0)) which is not supported by that broken character set. Since the EBMS was originally implemented, CBIIT has upgraded to a version of MySQL which supports Unicode properly. In order to use that support (and be able to import this article), we will need to convert our table to use the fixed character set ("utf8mb4"). For more information see

https://mathiasbynens.be/notes/mysql-utf8mb4
http://stackoverflow.com/questions/202205/how-to-make-mysql-handle-utf-8-properly
http://stackoverflow.com/questions/279170/utf-8-all-the-way-through

Comment entered 2017-05-03 15:32:14 by Kline, Bob (NIH/NCI) [C]

Testing the following DDL on DEV.

ALTER TABLE ebms_article CHANGE abstract abstract TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE ebms_article CHANGE source_data source_data LONGTEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Comment entered 2017-05-04 09:48:27 by Kline, Bob (NIH/NCI) [C]

I haven't been able to get the DDL approach to work on MySQL. So I came up with a different solution, which scrubs the UTF-8 string to replace all four-byte encodings with an &#xxxxxx; entity. Tested successfully on DEV. We'll want to make sure this makes it into the next EBMS release.

Comment entered 2017-05-30 12:14:48 by Kline, Bob (NIH/NCI) [C]

Workaround installed on DEV.

Comment entered 2017-06-21 14:18:16 by Juthe, Robin (NIH/NCI) [E]

I was able to import the citation in question (28415774) above on QA, so this appears to be working well.

Comment entered 2017-06-28 17:11:34 by Juthe, Robin (NIH/NCI) [E]

Verified on QA.

Comment entered 2017-08-28 15:03:43 by Juthe, Robin (NIH/NCI) [E]

, you should now be able to import PMID: 28415774 on PROD if this is still a citation you're interested in reviewing.

Attachments
File Name Posted User
ebms import error May2_2017.docx 2017-05-03 14:31:59
prostate_may17_problem cit.txt 2017-05-03 14:31:30

Elapsed: 0:00:00.000672