Issue Number | 434 |
---|---|
Summary | Error while importing PMID 28415774 |
Created | 2017-05-03 14:32:19 |
Issue Type | Bug |
Submitted By | trivedim |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2017-05-30 12:14:48 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/oceebms/issue.207553 |
I would like to report another error encountered yesterday while
importing batch file. While importing "prostate cancer nutrition and
dietary supplement" file to IACT Board, I got the attached error. Upon
investigation, I found first 5 citations imported from the file. I
removed 6th citation, PMID 28415774 from the file and imported the rest
without any problem. I tried import PMID 28415774 it again today morning
as a text file as well as PMID and still get the same error.
Minaxi
I did some digging and found out that the database character set used by the EBMS ("utf8") is not what the DBMS (MySQL) documentation claims it to be ("UTF-8 UNICODE"). In actual fact, that character set only supports a subset of Unicode characters. PMID 28415774 contains a character ("mathematical sans-serif bold italic small chi" (U+1D7C0)) which is not supported by that broken character set. Since the EBMS was originally implemented, CBIIT has upgraded to a version of MySQL which supports Unicode properly. In order to use that support (and be able to import this article), we will need to convert our table to use the fixed character set ("utf8mb4"). For more information see
https://mathiasbynens.be/notes/mysql-utf8mb4
http://stackoverflow.com/questions/202205/how-to-make-mysql-handle-utf-8-properly
http://stackoverflow.com/questions/279170/utf-8-all-the-way-through
Testing the following DDL on DEV.
ALTER TABLE ebms_article CHANGE abstract abstract TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE ebms_article CHANGE source_data source_data LONGTEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
I haven't been able to get the DDL approach to work on MySQL. So I came up with a different solution, which scrubs the UTF-8 string to replace all four-byte encodings with an &#xxxxxx; entity. Tested successfully on DEV. We'll want to make sure this makes it into the next EBMS release.
Workaround installed on DEV.
I was able to import the citation in question (28415774) above on QA, so this appears to be working well.
Verified on QA.
~trivedim, you should now be able to import PMID: 28415774 on PROD if this is still a citation you're interested in reviewing.
File Name | Posted | User |
---|---|---|
ebms import error May2_2017.docx | 2017-05-03 14:31:59 | |
prostate_may17_problem cit.txt | 2017-05-03 14:31:30 |
Elapsed: 0:00:00.000672