CDR Tickets

Issue Number 4900
Summary [Media] Keyword search report
Created 2020-09-24 09:56:40
Issue Type New Feature
Submitted By Osei-Poku, William (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2021-05-12 16:17:02
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.275466
Description

This is a request for a new report that works like the Standard Wording report. It will primarily be used by the Spanish team to get a good sense of how terms have been translated across several documents to ensure consistency. I am attaching a draft requirements document.

Comment entered 2020-10-08 14:36:40 by Kline, Bob (NIH/NCI) [C]

 will attach a fresh set of requirements.

Comment entered 2021-03-25 10:30:15 by Osei-Poku, William (NIH/NCI) [C]

We decidied to proceed with this request so I will provide a new requirements document soon.

Comment entered 2021-03-31 12:00:57 by Osei-Poku, William (NIH/NCI) [C]

Media Keywords Search Report_revised.docx

Attached new requirements document.

Comment entered 2021-03-31 14:18:54 by Kline, Bob (NIH/NCI) [C]
  1. Could you be more precise for "surrounding text" and "where applicable" in "show the surrounding text of the term or phrase where applicable"; how much of the surrounding text is to be shown? Do we need to reach over to neighboring (but non-searched) elements to find surrounding text? How does the software know when it is "applicable" to show the surrounding text? There is no mention in the requirements of any use of rich text to distinguish the search terms from the surrounding text. Does this mean that we can avoid the extra cost of supporting rich text in the HTML and Excel versions of the report?

  2. The requirements say the report should support wildcards in search terms "similarly to what is done for the standard wording report." I don't see such support in that report. In that report we tokenize the words in the text being searched so that we can avoid false positives picked up by a more naïve substring match (for example, we don't want to report a match on "breast" in the text "two men walking abreast"). If we introduce support for wildcards which can stand for any sequence of intermediate characters, we would need to abandon this tokenizing, word-by-word matching, since some of those intermediate characters matched by the wildcards could be spaces. Is that what you really want?

  3. "Ability to search terms with special characters." This would happen by default. Users must be aware that any diacritics must be entered as part of the search term.

  4. Is exclusion of blocked documents a constant filter? Or is that a user-configurable option?

  5. "The Spanish media documents would have a TranslationOf element in the document. In some cases where there is no Spanish media doc., there could be caption and content description elements that have been translated and marked as ES in the English media doc." It's not clear what the purpose of the second sentence in this excerpt from the requirement is supposed to convey. Does it mean that when we are instructed to report on the English Media documents we are to omit Spanish captions and content descriptions from the elements which we search?

Comment entered 2021-04-07 11:49:54 by Kline, Bob (NIH/NCI) [C]

:

Ability to specify processing status as part of search criteria.

This requirement is somewhat ambiguous. Can you confirm that if the user specifies a processing status value, all documents which have that value will be matched, regardless of what other status values are present (bearing in mind that all status values are captured indefinitely, not necessarily in the order in which the values were entered, and some of which have identical dates). For example:

Also, should the processing status selection be applied when the user specifies a title fragment or document ID?

Comment entered 2021-04-07 11:50:55 by Kline, Bob (NIH/NCI) [C]

Also, please see the questions I posted to the ticket late last month.

Comment entered 2021-04-07 12:49:41 by Osei-Poku, William (NIH/NCI) [C]


This requirement is somewhat ambiguous. Can you confirm that if the user specifies a processing status value, all documents which have that value will be matched, regardless of what other status values are present (bearing in mind that all status values are captured indefinitely, not necessarily in the order in which the values were entered, and some of which have identical dates). 

The active processing status is the topmost processing status block. So, when a user specifies a processing status, look for the status in only the topmost block in cases where there are multiple status blocks. The remaining blocks are for historical purposes only. 


Also, should the processing status selection be applied when the user specifies a title fragment or document ID?

Yes, the processing status selection should be applied.

Comment entered 2021-04-07 12:59:30 by Osei-Poku, William (NIH/NCI) [C]

1.Could you be more precise for "surrounding text" and "where applicable" in "show the surrounding text of the term or phrase where applicable"; how much of the surrounding text is to be shown? Do we need to reach over to neighboring (but non-searched) elements to find surrounding text? How does the software know when it is "applicable" to show the surrounding text? There is no mention in the requirements of any use of rich text to distinguish the search terms from the surrounding text. Does this mean that we can avoid the extra cost of supporting rich text in the HTML and Excel versions of the report?

There is no need to reach out to surrouding elements that are not searched. The searched elements usually don't contain a lot of text so if it is easier to show all the text, I think that should be fine. In terms of the applicability, I was thinking more in terms of the labels which I suspect would not have a lot of surrounding text. So, as I have said above, I think it is okay to show all the text within the searched element but highlighting the searched term within the retrieved text.

2. The requirements say the report should support wildcards in search terms "similarly to what is done for the standard wording report." I don't see such support in that report. In that report we tokenize the words in the text being searched so that we can avoid false positives picked up by a more naïve substring match (for example, we don't want to report a match on "breast" in the text "two men walking abreast"). If we introduce support for wildcards which can stand for any sequence of intermediate characters, we would need to abandon this tokenizing, word-by-word matching, since some of those intermediate characters matched by the wildcards could be spaces. Is that what you really want?

Please ignore this request.

3. "Ability to search terms with special characters." This would happen by default. Users must be aware that any diacritics must be entered as part of the search term.

OK


4. Is exclusion of blocked documents a constant filter? Or is that a user-configurable option?

It should be a contant filter.


5. "The Spanish media documents would have a TranslationOf element in the document. In some cases where there is no Spanish media doc., there could
be caption and content description elements that have been translated and marked as ES in the English media doc." It's not clear what the purpose of the second sentence in this excerpt from the requirement is supposed to convey. Does it mean that when we are instructed to report on the English Media documents we are to omit Spanish captions and content descriptions from the elements which we search?

Please use only the TranslationOf element to identify the Spanish media docs.

Comment entered 2021-04-07 13:06:13 by Kline, Bob (NIH/NCI) [C]

OK, so we'll have to parse all of the documents instead of using the query_term table.

Comment entered 2021-04-08 12:20:14 by Kline, Bob (NIH/NCI) [C]

Implemented on CDR DEV.

Comment entered 2021-04-29 13:25:15 by Osei-Poku, William (NIH/NCI) [C]

When I enter a CDR ID and a term to search, I get the attached error message: CDR0000795013  Term: Prostate

Comment entered 2021-05-05 10:30:46 by Kline, Bob (NIH/NCI) [C]

Typo fixed.

Comment entered 2021-05-10 13:43:18 by Osei-Poku, William (NIH/NCI) [C]

In some cases, it looks like the report searches mp3 files. Please exclude media files that are mp3 files from the searches.

Comment entered 2021-05-10 14:24:29 by Kline, Bob (NIH/NCI) [C]

I assume you mean that the report is searching the Media documents associated with MP3 files, not the property strings embedded in the MP3 binaries themselves. Why wouldn't it search in those documents? They are Media documents. Unless I missed something in the original requirements, and they are being changed to narrow the scope of the report, please put that request into a ticket for the next release.

Comment entered 2021-05-12 16:16:51 by Osei-Poku, William (NIH/NCI) [C]

That is right. There was no requirement to limit it to only images. I will create another ticket with this requirement for a future release.

Comment entered 2021-05-25 13:46:00 by Osei-Poku, William (NIH/NCI) [C]

When a title is provided (instead of a CDR ID), the search returns no results but when the CDR ID of the media document is provided the correct results are displayed. Please see attached screenshots. 

Comment entered 2021-05-25 14:35:06 by Kline, Bob (NIH/NCI) [C]

What happens when you (a) provide the entire title or (b) add a wildcard following your title fragment?

Comment entered 2021-05-25 15:23:04 by Osei-Poku, William (NIH/NCI) [C]

The full title "pancreas anatomy, adolescent" doesn't work but wildcards appear to work when placed after the first word in the title and deleting the rest of the title. Using wildcards is fine but I am not sure why using the full title doesn't work if you have a term with more than one word.

Comment entered 2021-05-25 15:25:11 by Osei-Poku, William (NIH/NCI) [C]

Here is a screenshot of one example that does not give me anything. 

Comment entered 2021-05-25 16:27:25 by Kline, Bob (NIH/NCI) [C]

Here are the full document titles of the documents you're looking for.

  • pancreas anatomy, adolescent;anatomy;JPEG

  • prostate cancer stage I;staging;JPEG

Comment entered 2021-05-25 16:48:18 by Osei-Poku, William (NIH/NCI) [C]

Yes, those are the doc titles but what we would search will be media title

"pancreas anatomy, adolescent"

"prostate cancer stage I"

Could you modify it search the media title and not the doc title or to search the media title in addition to the doc title?

Comment entered 2021-05-25 18:03:49 by Kline, Bob (NIH/NCI) [C]

Requested modification installed on DEV and QA.

Comment entered 2021-05-25 19:09:51 by Osei-Poku, William (NIH/NCI) [C]

Verified on DEV. Thanks!

Comment entered 2021-06-08 16:58:58 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA. Thanks!

Comment entered 2021-06-21 12:23:19 by Osei-Poku, William (NIH/NCI) [C]

Verified on PROD. Thanks!

Comment entered 2021-06-21 12:29:31 by Osei-Poku, William (NIH/NCI) [C]

Verified on PROD. Thanks!

Attachments
File Name Posted User
image-2021-04-07-11-41-15-661.png 2021-04-07 11:41:16 Kline, Bob (NIH/NCI) [C]
media keyword search report_interface_CDR ID_Results.PNG 2021-05-25 13:45:55 Osei-Poku, William (NIH/NCI) [C]
media keyword search report_interface_CDR ID.PNG 2021-05-25 13:45:55 Osei-Poku, William (NIH/NCI) [C]
media keyword search report_interface.PNG 2021-05-25 13:43:32 Osei-Poku, William (NIH/NCI) [C]
Media Keyword search report.PNG 2021-04-29 13:24:35 Osei-Poku, William (NIH/NCI) [C]
Media keyword search report full title.PNG 2021-05-25 15:25:09 Osei-Poku, William (NIH/NCI) [C]
Media Keyword search report mp3.PNG 2021-05-10 13:43:04 Osei-Poku, William (NIH/NCI) [C]
media keyword search report mp3 interface.PNG 2021-05-10 13:43:04 Osei-Poku, William (NIH/NCI) [C]
Media Keywords Search Report_revised.docx 2021-03-31 12:00:54 Osei-Poku, William (NIH/NCI) [C]
Media Keywords Search Report.docx 2020-09-24 10:26:43 Osei-Poku, William (NIH/NCI) [C]

Elapsed: 0:00:00.000766