PDQ Issues

Issue Number	5025
Summary	Spanish Medical Spellchecker
Created	2021-09-01 12:04:28
Issue Type	New Feature
Submitted By	Osei-Poku, William (NIH/NCI) [C]
Assigned To	Englisch, Volker (NIH/NCI) [C]
Status	Closed
Resolved	2022-03-01 19:50:14
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.297794

Description

The Stedman's spellchecker that we currently use for checking the spelling of medical terms in the CDR is only in English. The vendor does not have a Spanish version of it but the Spanish team has expressed the need for such a tool to use for the Spanish documents in the CDR. The Spanish team currently uses an online subscription version

https://doc.4d.com/4Dv16/4D/16.6/Soporte-de-diccionarios-Hunspell.300-4446033.es.html

We inquired from the owners if they will be able to provide us with the terms in a file format and they said they can provide the terms in CSV format.

We would like to investigate if it is possible to extract the terms from the CSV format into a format that can be used in the CDR. We have not received the CSV file yet. I will attach it to this ticket when we receive it.

Comment entered 2021-09-01 17:26:18 by Englisch, Volker (NIH/NCI) [C]

Digging through the XMetaL forum:

There are two types of dictionary files, the dictionary *.LEX files - these are the three dictionary files from Stedman's, for instance - and the user-maintained *.UWL files. According to the developer, Derek Read, the spell check had been added to XMetaL when Corel owned XMetaL and the *.UWL file format is protected. So, it is unlikely we'll find a way to convert the CSV file to a *.UWL file to be used as a dictionary. The only other option I see at the moment would be to add the Spanish terms one-by-one via the user interface but I don't know how such a task could be automated.

I am still looking to find more information on the dictionary *.LEX file format.

Comment entered 2021-09-01 18:29:43 by Englisch, Volker (NIH/NCI) [C]

Here is a hint from the XMetaL developer on how to add a bunch of dictionary words to a UWL (User Word List):

There is no automated way to add “non-approved” words to the UWL.

Adding words that want to allow is easy but does not apply to your case (things that you consider correct that the spell checker does not). In that case you can simply add all the words to a single file (any XML file), launch the spell checker, click the Add button once, then hold down the Enter key to repeatedly add all the others.

Comment entered 2021-09-01 19:20:48 by Englisch, Volker (NIH/NCI) [C]

Beginning with XMetaL 8.0 it should be possible to use Hunspell/mySpell dictionary files - *.DIC for words and *.AFF for rules - as part of the spell checker. These spellcheck files used by Hunspell are based on free formats used by applications like LibreOffice or Thunderbird.

I'm in the process of testing the use of these dictionary files with XMetaL. If that is successful, we should be able to create a simple *.DIC dictionary file from the CSV file.

Comment entered 2021-09-02 11:38:25 by Englisch, Volker (NIH/NCI) [C]

I successfully added a Spanish Hunspell dictionary file to XMetaL, manually modified that dictionary file by sprinkling German words in it, and then confirmed that those words when added to the summary were recognized as properly spelled.

In other words, we will be able to create our own Hunspell dictionary file and link it to XMetaL.

Here is a page describing the format of the dictionary files:

http://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html

Comment entered 2021-09-02 11:43:57 by Englisch, Volker (NIH/NCI) [C]

~oseipokuw, the link you included in the description points to a site that explains Hunspell dictionary files. It appears to me that this site/vendor❓ is using the same format. If they are willing to provide the data in CSV format maybe they are willing to provide the *.DIC file?

Comment entered 2021-09-02 12:06:30 by Englisch, Volker (NIH/NCI) [C]

One more comment - I found this dictionary on Github. Maybe Linda could have a look and see if this could be useful.

https://gist.github.com/victorhck/9ed2f9b95e675e9e71be142816791afd

Comment entered 2021-09-16 11:48:18 by Osei-Poku, William (NIH/NCI) [C]

Linda finds this very helpful and is researching to see if she can find additional resources online. Thanks!

Comment entered 2021-11-04 13:26:12 by Osei-Poku, William (NIH/NCI) [C]

There is no major update for this ticket but Linda and her team will be coming up with a list of resources we may be able to combine to create a dictionary file. We have also talked about possibly including the list of Spanish glossary terms from the CDR. I will provide more information when we are sure of which resources we want included in the dictionary file.

Comment entered 2022-01-13 12:32:16 by Osei-Poku, William (NIH/NCI) [C]

I met with Linda and her team to discuss this ticket. We decided to move forward with what we have so far since they couldn't find additional resources. We hope that if they find additional resources we can add them to the list. They also requested to add all the terms in the current Spanish dictionaries (the Spanish patient dictionary and the Spanish Genetics dictionary) to the dictionary file to be created. We already have an ad hoc query for getting all the published terms.

Comment entered 2022-01-13 13:50:03 by Kline, Bob (NIH/NCI) [C]

Volker will add the extra terms to the dictionary in the common share.

Comment entered 2022-02-22 16:51:43 by Englisch, Volker (NIH/NCI) [C]

~oseipokuw, looking over this ticket again it seems the task is now to "include all Spanish sources" in our new dictionary file. It's not clear to me, however, what all of the source are. You were talking about a CSV file from a former vendor but it's not clear if we ever received such a file.

From what I see we want to add the following sources into a Spanish dictionary file:

Github file
Genetics dictionary
Dictionary of cancer terms

Is that it for right now?

Comment entered 2022-02-23 14:32:14 by Osei-Poku, William (NIH/NCI) [C]

Hi ~volker Yes, these are the sources we have for now. We were not able to get the spreadsheet that was promised initially so we decided to just go with what we have. We also want to have a way to update the terms from time to time.

Comment entered 2022-02-23 15:06:16 by Englisch, Volker (NIH/NCI) [C]

OK, I will go ahead and create a Spanish dictionary file. However, given your remark that you want to update the terms from time to time I will probably create multiple distinct dictionary files so that we can update the CG glossary terms without having to download the Spanish dictionary file again.

Also, please be aware that we don't have a Spanish affix file. This is the file defining how the stem of a word might be modified and still be recognized as being spelled correctly (i.e. jump --> jumping, jumped, jumps, etc.). That means that only the exact words as spelled in the dictionary will be recognized.

Comment entered 2022-03-01 19:49:52 by Englisch, Volker (NIH/NCI) [C]

I created these three dictionary files:

dic_medicos.dic - the content from the GitHub link
dict_hp.dic - HP dictionary from ad-hoc query interface
dict_pat.dic - Patient dictionary from ad-hoc query interface

The last two files have been created from the data on PROD. The files are located in the OCPL/_Cross/CDR/STEDMANS directory and need to be added to the "Main Word List" under "Options" when starting the spell check for XMetaL

The script to create the local dictionary files is located in our DevTools/Utilities directory:

create_dicts_es.py

Comment entered 2022-04-08 10:27:42 by Osei-Poku, William (NIH/NCI) [C]

Verified. The new files are working as expected. Thank you!

Elapsed: 0:00:00.001035

CDR Tickets