EBMS Tickets

Issue Number 74
Summary [Printing] Non-ascii characters not printing properly in author names in response sheets
Created 2013-09-17 16:36:22
Issue Type Bug
Submitted By alan
Assigned To alan
Status Closed
Resolved 2015-12-15 19:36:35
Resolution Fixed
Path /home/bkline/backups/jira/oceebms/issue.113397
Description

OCEDRUEBMS entered 2013-08-13 by Alan Meyer (in Open status)

Bonnie reported on 8/9/2013 a Pubmed author with an umlaut in his name was not printed properly in the PDF format response sheet. See for example EBMS ID 284577, Pubmed ID 22864112.

I will attach this example to the issue.

Comment entered 2013-10-24 23:33:07 by alan
It turns out that this issue exposes another issue in the EBMS.

Two different algorithms were used in constructing brief citations.  One
algorithm was implemented in Python and used during conversion and the other
was, and is still, used in the EBMS import function.

The Python code stored data in Unicode as utf-8.  The content of the brief
citation was constructed as:

    First author surname; brief journal title; year; Pubmed ID

The PHP code in the import program stored data in ascii using our utf-8 to
ascii conversion.  The content was constructed as:

    Brief journal title, date; volume(issue);pages

One reason for this discrepancy is that, at a certain stage of our initial
software development, we were doing whatever needed to be done to get things
rolling without sufficient coordination with each other or with users.  But
that's water under the bridge.

It appears that my response sheet printing program is the only user of the
stored brief citation.  other planned uses were abandoned when it turned out
that users wanted different format brief citations for different uses.  So we
can abandon the stored brief citations if we wish and only have to fix one
program.

Here are the issues as I currently see them:

 1. Should we continue to use the stored brief citation for printing?

    We could toss the field altogether and construct brief citations on the
    fly from XML as needed.  We do it elsewhere.

    If we keep them, we might need to convert all to one standard.  This would
    presumably require (our first?) global change, and might be a good reason
    to just abandon them and get the brief cites from the XML.

 2. What do we want the brief cite to look like on a response sheet?

 3. Should they use non-English characters or just use plain U.S. ASCII?

    Unicode isn't supported by the off the shelf fpdf.  There are some people
    who have addressed this problem but I'm not sure how well supported their
    solutions are.  For example I found this in a 2004 posting on the net
    regarding one of the solutions:

        "Note: I wrote UFPDF as an experiment, not as a finished product. If
        you have problems using it, don't bug me for support ... I don't have
        much time to maintain this."

    We can probably do very well with brief citations by translating our
    response sheet output into ISO 8859-1, which handles English, French,
    German, Spanish, Italian, Swedish, and a number of other languages.  This
    probably handles almost all Pubmed data for brief citations, where we may
    have few or no greek or cyrillic chars, superscripts, etc.

We should discuss all this before I write any software.
Comment entered 2013-10-24 23:34:22 by alan

Bonnie Ferguson was the person who reported this problem to me and I created the issue some time ago. I tried to add her as a watcher but JIRA said she didn't have permission.

Comment entered 2013-12-24 07:42:40 by Kline, Bob (NIH/NCI) [C]

Added Bonnie as a watcher.

Comment entered 2015-08-13 21:55:01 by alan

We're coming up on two years since this issue was opened. Since we're hoping to start doing EBMS work again soon, users might want to review this and decide whether the issue should be taken off hold and worked on. If so, we'll need to think about priorities.

There are different ways to work on it but I'm thinking it is likely to be a medium difficulty issue - not trivial but not real hard, especially if we decide to only support Roman alphabets with West European accent marks. To do more than that could be real hard but might be completely unnecessary.

Comment entered 2015-12-11 00:39:18 by alan

I discussed this with Bonnie today and then did some work on it. I did some experiments and tried a simple approach to solving the problem. It may have worked, but to be sure I need to generate some more test data by manipulating printable packets to contain data with character set problems. I did appear to demonstrate to myself that our EBMS PDF creation software package can handle the Latin-1 character set gracefully.

I also did some studies of the authors in our database and, if I understand what I saw correctly, there may only be 138 authors' names that can't be represented in Latin-1. They appear to mainly be East European names with a sprinkling of names that just look like plain errors, e.g., "O?Reilly" where the question mark is some non-Latin-1 character. Possibly some non-standard characters from the Windows cp1252 set have crept into the data at NLM.

The actual practical problem may be smaller than 138 names because the non-Latin-1 characters sometimes appear in first names, which are not always printed, for example when the printout just has the last name and initials of the author. There might only be 100 or so names that cause actual problems.

If the Latin-1 strategy works, I can make Eastern European names less ugly by replacing funky unprintable characters with question marks or apostrophes or, if we spend more time on research on each character, with some reasonable mapping into Latin-1.

I'll try to finish this next time I'm in.

Comment entered 2015-12-15 19:00:27 by alan

I have attached a fixed version of the example response sheet. See HockelFix.pdf.

The fix should work for the common west European languages that constitute the great bulk of our articles with non-English characters. It won't work with the small number that have characters that are not in the Latin-1 character set. At last check, there were 138 such names out of a total of 925,000 names.

Comment entered 2015-12-15 19:36:35 by alan

This is in version control and ready for testing on QA (and DEV.)

Comment entered 2016-02-01 10:27:09 by Juthe, Robin (NIH/NCI) [E]

We're still running into problems with the characters. We tested using the same article as mentioned above (Hockel et al., PMID: 22864112) and created a new print job with this article on QA-SG. We ran the print job in test mode to view the documents (without actually printing them) and the author's last name (Hockel) is jumbled in the PDF.

Given the date of Alan's last comment, is it possible that these changes didn't make it to QA-SG? Thanks.

Comment entered 2016-02-01 14:26:01 by Kline, Bob (NIH/NCI) [C]

Hi, Alan. Looks like this didn't make it onto nciws-q552-v. Any reason I can't just sync that server with SVN again, now that you've got the latest changes checked in?

Comment entered 2016-02-01 16:44:54 by Juthe, Robin (NIH/NCI) [E]

This is working now. Alan tested with Bonnie at her workstation this afternoon and I also was able to confirm this on my computer just now, so I assume the changes got moved over after all.

Comment entered 2016-02-01 16:45:05 by Juthe, Robin (NIH/NCI) [E]

Verified on QA.

Comment entered 2016-02-01 17:26:05 by alan

Syncing from svn should be fine however I already put the changes on the two QA servers some days ago when I first discovered that I hadn't yet put them into svn.

Comment entered 2016-02-01 17:58:17 by Kline, Bob (NIH/NCI) [C]

I remembered you saying you were going to do that, but when I ran my script to compare my sandbox with the live site on nciws-q552-v, the print files were flagged as different. It's not unlikely that I just hadn't updated my sandbox (I had assumed I had when the users reported that printing on nciws-q552-v wasn't working correctly), but I can't check now because CBIIT isn't letting us log into either d551 or q552. So I'll check later.

Comment entered 2016-02-01 21:05:32 by Kline, Bob (NIH/NCI) [C]

I had not refreshed my sandbox, and when I did the comparison script verified that the latest changes were indeed on the live QA SG site, so though I don't have an explanation for why the software was misbehaving for the users earlier in the day, it looks like everything is OK now.

Comment entered 2016-02-01 21:30:40 by alan

I think we probably have a time confusion here. Robin made the report earlier in the day but I suspect that the actual test was done some time earlier.

Comment entered 2016-02-01 21:33:33 by Juthe, Robin (NIH/NCI) [E]

We did the first test shortly after 10am this morning on Bonnie's computer...it doesn't sound like that helps but the important thing is that it is working correctly now. 🙂

Comment entered 2016-02-02 10:05:34 by alan

Mysteries abound.

I made no changes to the printing software on any of our servers on Monday. Is it possible that Bonnie ran the print job the week before and then tested by viewing an existing print job?

Oh well, as you say, the important thing is that it's working correctly now - and for the future.

Comment entered 2016-04-07 11:46:43 by Juthe, Robin (NIH/NCI) [E]

We have not had an occasion to test this on PROD; closing issue and will reopen if we have any issues.

Attachments
File Name Posted User
Hockel example.pdf 2013-09-17 16:36:22
HockelFix.pdf 2015-12-15 19:00:27

Elapsed: 0:00:00.000651