PDQ Issues

Issue Number	4153
Summary	Remove repeated info. from imported terms
Created	2016-09-15 12:05:08
Issue Type	Improvement
Submitted By	Osei-Poku, William (NIH/NCI) [C]
Assigned To	Kline, Bob (NIH/NCI) [C]
Status	Closed
Resolved	2016-12-03 07:10:59
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.194137

Description

When a term document is imported from the thesaurus, the OtherName block and Definition block appear to be always repeated and users have to manually remove duplicate and triplicate blocks to process imported terms. I am wondering if this can be prevented. Here is one example on DEV- CDR0000778957 that I just imported.

In cases where the same Other Term Name is provided, import only one Other Name block. That is even if the Source Term ID and the Source Term Type are different or they are not provided for one of the blocks. The same approach should be applied to the Definition blocks as well, we do not want to see repeated definition text.

Comment entered 2016-12-03 07:10:59 by Kline, Bob (NIH/NCI) [C]

Fixed on DEV. I deleted CDR778957 and re-imported as CDR779282. Ready for user testing.

Comment entered 2016-12-16 11:34:45 by Osei-Poku, William (NIH/NCI) [C]

At least for two terms that we imported yesterday and today, they appear to be fine without any duplicated information. However, the problem appears to continue for some of the imported documents, especially cases of triplicated Definition texts. The Other Name is duplicated also but that may be because of the differences in the Source Types. Here are two terms with such problems

CDR0000779358
CDR0000779355

Comment entered 2016-12-16 14:34:27 by Kline, Bob (NIH/NCI) [C]

Is everything you're describing in this comment for activity on DEV? Or is at least some of it referring to PROD (or some other tier)?

Comment entered 2016-12-16 15:15:44 by Osei-Poku, William (NIH/NCI) [C]

Yes, that was from testing on DEV.

Comment entered 2016-12-17 16:31:02 by Kline, Bob (NIH/NCI) [C]

I just looked at CDR779358, and the OtherName blocks are not duplicates of each other.

<OtherName>
  <OtherTermName>Gene Solution</OtherTermName>
  <OtherNameType>Synonym</OtherNameType>
  <SourceInformation>
    <VocabularySource>
      <SourceCode>NCI Thesaurus</SourceCode>
      <SourceTermType>PT</SourceTermType>
      <SourceTermId>C123897</SourceTermId>
    </VocabularySource>
  </SourceInformation>
  <ReviewStatus>Reviewed</ReviewStatus>
</OtherName>
<OtherName>
  <OtherTermName>Gene Solution</OtherTermName>
  <OtherNameType>Synonym</OtherNameType>
  <SourceInformation>
    <VocabularySource>
      <SourceCode>NCI Thesaurus</SourceCode>
      <SourceTermType>SY</SourceTermType>
    </VocabularySource>
  </SourceInformation>
  <ReviewStatus>Reviewed</ReviewStatus>
</OtherName>

Note that:

The SourceTermType values are different
Only one of the blocks has a SourceTermID element

Perhaps you and I are using different definitions of the word "duplicate." From Merriam-Webster:

Definition of duplicate

consisting of or existing in two corresponding or identical parts or examples <duplicate invoices>
being the same as another <duplicate copies>

Comment entered 2016-12-19 12:21:21 by Osei-Poku, William (NIH/NCI) [C]

Perhaps you and I are using different definitions of the word "duplicate." From Merriam-Webster:

I think we're on the same page. If you look at my comment above, I did acknowledge that the issue may be because of differences in the "Source Types" and you also additionally confirmed the differences in the "SourceTermID". I do see that they are different and we may have to continue to deal with it but I am not so sure how useful they are in processing the term documents. I'll ask Mary if the SourceTermID and the Source Term Type data are needed for processing the term documents.

You didn't say anything about the Definition Text though as they seem to have the same information and probably qualify as true triplicates for this particular document or is each definition mapped to each Other Name block?

Comment entered 2016-12-19 13:06:18 by Kline, Bob (NIH/NCI) [C]

I have been addressing the condition identified in the original request description (eliminating "repeated" OtherName blocks). I can implement more complicated logic, but I'll need more specific requirements for exactly which OtherName blocks I must eliminate. I can, for example, drop every OtherName block whose OtherTermName value has already been seen (ignoring case in comparison) in a previous OtherName block's OtherTermName child, or in the document's PreferredName element. I didn't mention definition text because the description for this issue only identified repeated OtherName blocks as the problem.

Comment entered 2016-12-19 13:21:06 by Osei-Poku, William (NIH/NCI) [C]

I just asked Mary about this and she is fine with not repeating other name blocks even in cases where the source type and source term ids are different. So, for this particular term under scrutiny, you can display only the first Other Name block and ignore the other two others. Would that take care of the repeated definition blocks also? Or I need to modify the ticket to add the definition blocks?

Comment entered 2016-12-19 13:27:49 by Kline, Bob (NIH/NCI) [C]

You need to modify the ticket to tell me exactly what you need the import software to do.

Comment entered 2016-12-19 13:43:06 by Osei-Poku, William (NIH/NCI) [C]

I have modified the original request. Please let me know if you need additional information.

Comment entered 2016-12-19 14:00:15 by Kline, Bob (NIH/NCI) [C]

Here are examples of the kinds of questions I need answers to in a complete specification of requirements for this sort of request.

If the other term name repeats the document's preferred name, do you want the other term block?
Is case significant in these comparisons of the name strings?
Should leading or trailing space be included or ignored in these comparisons?
Should the software ignore differences in internal white space (for example, two spaces versus one)?

Comment entered 2016-12-19 19:00:50 by Osei-Poku, William (NIH/NCI) [C]

1. Yes.
2. No, case is not significant.
3. They should be ignored.
4. Yes.

Comment entered 2016-12-20 09:10:14 by Kline, Bob (NIH/NCI) [C]

The logic has been modified as requested. Please test this very thoroughly to ensure that the software is doing exactly what you want it to. C123897 has been re-imported as CDR779386.

Comment entered 2017-01-04 09:07:10 by Osei-Poku, William (NIH/NCI) [C]

The import tool appears to be failing with the following error message (after clicking the import button)

<error>Failure parsing concept: Space required after the Public Identifier, line 1, column 47</error>

Comment entered 2017-01-04 09:13:45 by Osei-Poku, William (NIH/NCI) [C]

It seems the same error message is on PROD as well. I wonder if it has anything to do with HTTPS protocol issue.

Comment entered 2017-01-04 10:48:45 by Kline, Bob (NIH/NCI) [C]

Nothing to do with this ticket or with HTTPS. Their service is down. I'll report the problem.

Comment entered 2017-01-04 13:36:21 by Kline, Bob (NIH/NCI) [C]

FNLCR acknowledged that they broke our use of their API by turning off the URL we were using. I was unable to find any announcement of the switch on their announcement listserv. I put in a workaround for the problem on DEV and imported C29378 on that tier.

FNLCR did indicate that they might look into an approach which would eliminate the volatility of their API URLs.

Comment entered 2017-01-10 19:23:55 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA.

Elapsed: 0:00:00.001601

CDR Tickets