CDR Tickets

Issue Number 3294
Summary Update term records with the mapping to NCI T codes
Created 2011-01-11 12:57:29
Issue Type Improvement
Submitted By Beckwith, Margaret (NIH/NCI) [E]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2011-04-12 16:07:59
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107622
Description

BZISSUE::4984
BZDATETIME::2011-01-11 12:57:29
BZCREATOR::Margaret Beckwith
BZASSIGNEE::Bob Kline
BZQACONTACT::William Osei-Poku

We have received a spreadsheet containing a mapping of NCI T codes to PDQ terms and we need to update the term records with the NCI T codes. I am attaching the spreadsheet (which may require help in interpretation from Larry). This is in preparation for having CTRP switch to using NCI T for their trial coding.

Comment entered 2011-01-11 12:57:29 by Beckwith, Margaret (NIH/NCI) [E]

Attachment PDQtoNCItMap.xls has been added with description: NCIT to PDQ map

Comment entered 2011-01-11 14:23:36 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-01-11 14:23:36
BZCOMMENTOR::Bob Kline
BZCOMMENT::1

Would I be right in guessing that:

(a) we need to add the NCI/T concept ID to the
/Term/Definition/DefinitionSource/DefinedTermId element;
(b) we will put 'NCI Thesaurus' in the DefinitionSourceName sibling
of the DefinedTermId element;
(c) we should do this for every occurrence of the /Term/Definition
block in the document (the block can have multiple occurrences);
(d) we should do nothing to a block which already has the concept ID

?

Larry:

It looks like "concept ID" in the description above would come from column Q of the spreadsheet, and it would go into the CDR document whose ID is in column I, right?

Comment entered 2011-01-13 12:16:32 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2011-01-13 12:16:32
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::2

I could be wrong, but I think the information should go in the SourceInformation/VocabularySource block and not the DefinitionSource block:

[SourceInformation]
[VocabularySource]
SourceCode would be NCI Thesaurus
SourceTermType would be ?
SourceTermID would be the NCI T ID

Comment entered 2011-02-03 11:36:33 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-02-03 11:36:33
BZCOMMENTOR::Bob Kline
BZCOMMENT::3

Most of our documents for these terms don't have any Definition blocks, and the DefinitionText child is required if the Definition parent is present.

Comment entered 2011-02-03 17:12:21 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-02-03 17:12:21
BZCOMMENTOR::Bob Kline
BZCOMMENT::4

(In reply to comment #3)
> Most of our documents for these terms don't have any Definition blocks, and the
> DefinitionText child is required if the Definition parent is present.

Please disregard that comment. Don't know what I was thinking.

Since the purpose of this task appears to be to establish a mapping between the concept represented by one of our terminology documents and the concept represented by a concept UI in the NCI thesaurus, without reference to any of the specific names associated with the concept, I propose adding an optional top-level element called NCIThesaurusConcept, whose text content contains the CUI. If it turns out that some of our terminology documents correspond to more than one of the NCI/T concepts, we can have the schema allow multiple occurrences of the new element.

Comment entered 2011-02-18 11:41:52 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-02-18 11:41:52
BZCOMMENTOR::Bob Kline
BZCOMMENT::5

(In reply to comment #4)
> ... I propose adding an optional
> top-level element called NCIThesaurusConcept, whose text content contains the
> CUI. If it turns out that some of our terminology documents correspond to more
> than one of the NCI/T concepts, we can have the schema allow multiple
> occurrences of the new element.

I believe this proposal was approved at the Feb. 10 status meeting. Can you confirm this, Margaret?

Comment entered 2011-02-24 12:11:43 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2011-02-24 12:11:43
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::6

Yes, this is what I remember.

Comment entered 2011-02-28 12:43:09 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-02-28 12:43:09
BZCOMMENTOR::Bob Kline
BZCOMMENT::7

I have installed the new element in the Term schema on Franck. As soon as I get sign-off on the schema change (including placement of the new element) I run a test of the global change for this task.

Comment entered 2011-03-03 12:19:23 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2011-03-03 12:19:23
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::8

It looks okay to me but I do have a couple of questions:
1. Is it multiply occuring? We had decided to make it so that we could put more than one NCI T code in if necessary.
2. For terms that we already have the NCI T code name in the Vocabulary Source/Source Term ID field, will we also store the information to the new element? The reason I ask is just that in the SourceTerm ID field we don't really identify the information as the NCI T Concept ID so if we ever want to use the data we would have to go to two fields with different names.
3. When we do the test mapping could we get a report of all of the concepts that have more than one CDR ID associated with them?

Comment entered 2011-03-07 08:18:16 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-03-07 08:18:16
BZCOMMENTOR::Bob Kline
BZCOMMENT::9

(In reply to comment #8)

> 1. Is it multiply occuring? We had decided to make it so that we could put
> more than one NCI T code in if necessary.

Yes. I had written earlier: "If it turns out that some of our terminology documents correspond to more than one of the NCI/T concepts, we can have the schema allow multiple occurrences of the new element." As the attached report shows, there many instances of Term documents associated with more than one NCI/T concept. I have modified the schema on Franck to accommodate these documents.

> 2. For terms that we already have the NCI T code name in the Vocabulary
> Source/Source Term ID field, will we also store the information to the new
> element?

We decided to expand this global change job to include this request.

> 3. When we do the test mapping could we get a report of all of the concepts
> that have more than one CDR ID associated with them?

The NCI/T service is working again (though the problems reported by William in issue #5004 are more subtle than the ones I was experiencing last week, and I assume his problems are still unresolved; I haven't heard from Larry yet); report is attached, reflecting one-to-many matches in both directions (multiple NCI/T concepts associated with a single CDR Term document, as well as more than one CDR Term document linked to a given NCI/T concept).

Comment entered 2011-03-07 08:18:16 by Kline, Bob (NIH/NCI) [C]

Attachment multi-mappings.xls has been added with description: Report of one-to-many matches

Comment entered 2011-03-07 10:04:18 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-03-07 10:04:18
BZCOMMENTOR::Bob Kline
BZCOMMENT::10

I have run the test-mode global change on Franck. The results can be reviewed here:

http://franck.nci.nih.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2011-03-07_09-04-09

A couple of things to point out:

1. I had the program report on any concept IDs already in the CDR
documents and which didn't match the standard pattern for such
IDs:

CDR504594: funky concept ID: '62767'
CDR514473: funky concept ID: '92171'
CDR547120: funky concept ID: ' C1908 '
CDR590640: funky concept ID: 'Match C74014'
CDR641461: funky concept ID: 'c82379'

2. I also noticed that the schema allows Comment elements at two
different places at the top level of the document. Using
proximity to associate two sibling elements is not the best
way to indicate that the two elements belong with each other
(the proper way to do that is to use a wrapper element which
contains both elements). Having this ambiguity in the schema
made it unnecessarily tricky to identify the proper position
at which the new elements were to be inserted. I recommend
that we consider modifying the schema (and the documents, if
Comment elements have been used at both positions) so that
Comment elements which apply to the entire document are all
located in the same place, and those which don't apply to the
entire document are appropriately nested at the position
where they logically belong. Because of this problem, I
strongly suggest that the review of the converted documents
look at the entire documents (not just the diffs) to confirm
that the program found the right location for the new elements.

Comment entered 2011-03-28 15:03:06 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2011-03-28 15:03:06
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::11

changed QA contact.
(In reply to comment #10)
> I have run the test-mode global change on Franck. The results can be reviewed
> here:
> A couple of things to point out:
>
> 1. I had the program report on any concept IDs already in the CDR
> documents and which didn't match the standard pattern for such
> IDs:
>
> CDR504594: funky concept ID: '62767'
> CDR514473: funky concept ID: '92171'
> CDR547120: funky concept ID: ' C1908 '
> CDR590640: funky concept ID: 'Match C74014'
> CDR641461: funky concept ID: 'c82379'
>

These have been fixed on Bach.

> 2. I also noticed that the schema allows Comment elements at two
> different places at the top level of the document. Using
> proximity to associate two sibling elements is not the best
> way to indicate that the two elements belong with each ........

OCECDR-3331 created to take care of above.

Comment entered 2011-03-31 12:17:24 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2011-03-31 12:17:24
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::12

(In reply to comment #10)
> I have run the test-mode global change on Franck. The results can be reviewed
> here:
>
> http://franck.nci.nih.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2011-03-07_09-04-09
>
Verified. Please run in live mode on Franck.

Comment entered 2011-04-01 12:39:35 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-04-01 12:39:35
BZCOMMENTOR::Bob Kline
BZCOMMENT::13

(In reply to comment #12)

> Verified. Please run in live mode on Franck.

Done; ready for review.

Comment entered 2011-04-05 11:27:48 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2011-04-05 11:27:48
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::14

(In reply to comment #13)
> (In reply to comment #12)
>
> > Verified. Please run in live mode on Franck.
>
> Done; ready for review.

Please check the following to see if I am looking at them correctly.

I was expecting to see at least C10268 in CDR0000040938 & CDR0000040955 but both terms were not updated at all in Franck.

Same issue with CDR0000041350 & CDR0000038701. They seem to share C10491 but none of the CDR records was updated.

Also, the CSS needs to be updated to display the data correctly. I will create a new issue for that.

Comment entered 2011-04-05 17:17:21 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-04-05 17:17:21
BZCOMMENTOR::Bob Kline
BZCOMMENT::15

Too bad this wasn't noticed when running in test mode. There was a bug in the code, which I have fixed. I rewrote the script to avoid updating documents which already had been modified for this task and ran it again in live mode. Should finish within the hour.

Comment entered 2011-04-06 14:32:33 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2011-04-06 14:32:33
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::16

(In reply to comment #15)
> Too bad this wasn't noticed when running in test mode. There was a bug in the
> code, which I have fixed. I rewrote the script to avoid updating documents
> which already had been modified for this task and ran it again in live mode.
> Should finish within the hour.

Verified. Please run in test mode on Bach.

Comment entered 2011-04-06 16:43:18 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-04-06 16:43:18
BZCOMMENTOR::Bob Kline
BZCOMMENT::17

(In reply to comment #16)

> Verified. Please run in test mode on Bach.

Done:

http://bach.nci.nih.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2011-04-06_15-18-54

Comment entered 2011-04-07 10:55:46 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2011-04-07 10:55:46
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::18

(In reply to comment #17)
> (In reply to comment #16)
>
> > Verified. Please run in test mode on Bach.
>
> Done:
>
> http://bach.nci.nih.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2011-04-06_15-18-54

1. I am still seeing some anomalies in the results. I also, compared with the results from Franck and there still seems to be some discrepancies.
Examples:

-Bach
CDR0000039920
+ <NCIThesaurusConcept>
+ C9160
+ </NCIThesaurusConcept>

-Franck
CDR0000039920
+ <NCIThesaurusConcept>
+ C9160
+ </NCIThesaurusConcept>
+ <NCIThesaurusConcept>
+ C9158
+ </NCIThesaurusConcept>

The results from Franck appear to be consistent with what is in the spreadsheet.

Other examples include:
CDR0000038485
CDR0000038567

2. Also, it appears some of the terms are missing from the test results from Bach.
Examples:
CDR0000043292
CDR0000038095

Comment entered 2011-04-08 10:03:00 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-04-08 10:03:00
BZCOMMENTOR::Bob Kline
BZCOMMENT::19

Please review this set:

http://bach.nci.nih.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2011-04-08_08-40-56

In this version the software:

  • avoids adding a concept ID already in /Term/NCIThesaurusConcept

  • copies an NCI concept ID appearing in an OtherName block to a
    new /TermNCIThesaurusConcept element even if the concept ID
    mapping isn't provided by the spreadsheet

This time I believe it's doing it correctly.

Comment entered 2011-04-08 11:07:01 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2011-04-08 11:07:01
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::20

(In reply to comment #19)
> Please review this set:
>
> http://bach.nci.nih.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2011-04-08_08-40-56
>
> In this version the software:
> * avoids adding a concept ID already in /Term/NCIThesaurusConcept
> * copies an NCI concept ID appearing in an OtherName block to a
> new /TermNCIThesaurusConcept element even if the concept ID
> mapping isn't provided by the spreadsheet
>
> This time I believe it's doing it correctly.

Verified. Please run in live mode on Bach.

Comment entered 2011-04-08 13:03:03 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-04-08 13:03:03
BZCOMMENTOR::Bob Kline
BZCOMMENT::21

(In reply to comment #20)

> Verified. Please run in live mode on Bach.

Done.

Comment entered 2011-04-12 16:07:59 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2011-04-12 16:07:59
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::22

(In reply to comment #21)
> (In reply to comment #20)
>
> > Verified. Please run in live mode on Bach.
>
> Done.

Verified on Bach. Issue closed. Thank you!

Attachments
File Name Posted User
multi-mappings.xls 2011-03-07 08:18:16
PDQtoNCItMap.xls 2011-01-11 12:57:29

Elapsed: 0:00:00.001593