Issue Number | 2568 |
---|---|
Summary | Change Term Type to Lexical Variants |
Created | 2008-07-03 08:08:07 |
Issue Type | Improvement |
Submitted By | Grama, Lakshmi (NIH/NCI) [E] |
Assigned To | alan |
Status | Closed |
Resolved | 2013-07-11 13:53:44 |
Resolution | Won't Fix |
Path | /home/bkline/backups/jira/ocecdr/issue.106896 |
BZISSUE::4174
BZDATETIME::2008-07-03 08:08:07
BZCREATOR::Lakshmi Grama
BZASSIGNEE::Alan Meyer
BZQACONTACT::Sheri Khanna
In the old PDQ system, we needed to have "synonyms" that were actually lexical variants - for example terms with inverted word order. This is causing some problems as we try to implement keyword searching with query expansion using the synonymy on the Clinical trials search form. To help with this, we would like to implement a global change that identifies these terms (most have a comma in them) and then change the OtherNameType in the record from "Synonym" to "Lexical variant"
Example record
stage I squamous cell carcinoma of the lip and oral cavity (preferred
name)
lip and oral cavity squamous cell carcinoma, stage I
oral cavity and lip squamous cell carcinoma, stage I
squamous cell carcinoma of the lip and oral cavity, stage I
Some known issues with how word order variants were entered - there may be
Variants for PreferredName word order
Variants for OtherName word order
BZDATETIME::2008-07-04 00:32:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1
Here are some very preliminary thoughts about the problem. They
aren't well thought out, but perhaps if I think out loud someone
else will be stimulated to jump in with more or better ideas, or
to tell me I'm making too big a problem out of this and we just
want something simple and easy.
o It appears that there is a continuum of relationships
stretching from very clear cases of synonymy to very clear
cases of lexical variation, with shades of gray in between.
For example, are "testicle cancer" and "testicular cancer"
synonyms or lexical variants? How about singulars and
plurals?
o In very clear lexical variants, commas are not always
present, for example:
"type A Thymoma"
"Thymoma Type A"
o There are also lexical variations among other elements in
the
Terms. For example in CDR40002 (type A Thymoma) we have the
following three Subtypes:
malignant thymoma, spindle cell
thymoma, malignant, spindle cell
thymoma, spindle cell malignant
Are these also problematic, or can we ignore them because
they aren't marked as synonyms of the preferred term?
Maybe we can figure out exactly what the problems are in
searching and, based on that, decide what elements need to be
modified and how.
o We may want some criteria to determine what constitutes
progress and what is worth doing.
Maybe 50% of the real lexical variants are, as Bob likes to
say, "low hanging fruit". Maybe we can pick them off without
fear of making mistakes and declare victory.
Or maybe that isn't enough. Maybe we need to really clean up
much more of the data to have a positive impact on searching.
o If we need to go beyond the low hanging fruit and climb the
tree, maybe we need tools that assist a human editor to find
possible lexical variants and present them to him for
approval, rather than automatically fixing things. I wouldn't
be surprised if a human could make 100 decisions per hour if
he were presented with all of the data in an easily used
package and could just say Yay or Nay without having to call
up records in XMetal to make the changes.
One tool we could use to assist with identification of
candidates
for conversion to lexical variants would be to write a program
that measures the degree of similarity between two strings. It
might, for example:
For each pair of strings to compare:
Normalize spacing (convert newlines to space, replace
multiple runs of spaces with one space).
Convert to all one case
Remove punctuation - commas, apostrophes, parentheses,
etc.
Extract the words.
Convert plurals to singulars (that is very hard to do
well, but might be something we could get from a package
like Stedmans or UMLS).
Sort the words.
Then compare the two strings.
Determine whether they are identical and, within limits,
how many characters are different.
Obviously, we're not the first organization to deal with
lexical
variants. NLM has done extensive research on this problem,
published articles about it, and may have tools they can share
with us. A Google search turns up a lot of their research. I'll
want to read some of their publications.
BZDATETIME::2008-07-15 17:19:13
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2
The cancer.gov team has analyzed the issues regarding synonyms
and lexical variants. They have decided that, in the context of
the current software, eliminating lexical variants won't provide
any advantages and they can be left as is.
Cancer.gov has different techniques for searching metadata,
i.e.,
controlled terminology assigned to protocols, and for searching
plain text in the titles or other elements in the documents.
For the second task, there may be some advantages in having a
list of one word synonyms, e.g., "childhood", "pediatric",
"children". They have to be one word synonyms because the
underlying search engine knows how to substitute one word
synonyms but has no built-in capability to substitute whole
phrases in a free text search.
However, Lakshmi speculates that in real searches, adding new
one
word synonyms would have limited value. One problem is that
terms are often only synonymous in a specific context, not in all
contexts. So the synonyms could bring in false hits as well as
true ones. Also, for many of the obvious synonyms, like
"childhood", etc. above, it is often the case that all of the
synonyms already appear in the free text of the same documents.
In those cases, searching on any of the terms will already work
correctly without having to "OR" in synonyms of those terms.
We might wish to mark lexical variants in the future, but the
priority of the task can be lowered since it's not required to
solve the immediate problem of enhancing cancer.gov searches for
clinical trials.
BZDATETIME::2008-07-17 14:54:44
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3
Lowering priority to P8 after discussion at the status meeting.
Decided in weekly status meeting that this isn't worth it.
Elapsed: 0:00:00.001809