Issue Number | 4900 |
---|---|
Summary | [Media] Keyword search report |
Created | 2020-09-24 09:56:40 |
Issue Type | New Feature |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2021-05-12 16:17:02 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.275466 |
This is a request for a new report that works like the Standard Wording report. It will primarily be used by the Spanish team to get a good sense of how terms have been translated across several documents to ensure consistency. I am attaching a draft requirements document.
~oseipokuw will attach a fresh set of requirements.
We decidied to proceed with this request so I will provide a new requirements document soon.
Media Keywords Search Report_revised.docx
Attached new requirements document.
Could you be more precise for "surrounding text" and "where applicable" in "show the surrounding text of the term or phrase where applicable"; how much of the surrounding text is to be shown? Do we need to reach over to neighboring (but non-searched) elements to find surrounding text? How does the software know when it is "applicable" to show the surrounding text? There is no mention in the requirements of any use of rich text to distinguish the search terms from the surrounding text. Does this mean that we can avoid the extra cost of supporting rich text in the HTML and Excel versions of the report?
The requirements say the report should support wildcards in search terms "similarly to what is done for the standard wording report." I don't see such support in that report. In that report we tokenize the words in the text being searched so that we can avoid false positives picked up by a more naïve substring match (for example, we don't want to report a match on "breast" in the text "two men walking abreast"). If we introduce support for wildcards which can stand for any sequence of intermediate characters, we would need to abandon this tokenizing, word-by-word matching, since some of those intermediate characters matched by the wildcards could be spaces. Is that what you really want?
"Ability to search terms with special characters." This would happen by default. Users must be aware that any diacritics must be entered as part of the search term.
Is exclusion of blocked documents a constant filter? Or is that a user-configurable option?
"The Spanish media documents would have a TranslationOf element in the document. In some cases where there is no Spanish media doc., there could be caption and content description elements that have been translated and marked as ES in the English media doc." It's not clear what the purpose of the second sentence in this excerpt from the requirement is supposed to convey. Does it mean that when we are instructed to report on the English Media documents we are to omit Spanish captions and content descriptions from the elements which we search?
~oseipokuw:
Ability to specify processing status as part of search criteria.
This requirement is somewhat ambiguous. Can you confirm that if the user specifies a processing status value, all documents which have that value will be matched, regardless of what other status values are present (bearing in mind that all status values are captured indefinitely, not necessarily in the order in which the values were entered, and some of which have identical dates). For example:
Also, should the processing status selection be applied when the user specifies a title fragment or document ID?
Also, please see the questions I posted to the ticket late last month.
This requirement is somewhat ambiguous. Can you confirm that if the user specifies a processing status value, all documents which have that value will be matched, regardless of what other status values are present (bearing in mind that all status values are captured indefinitely, not necessarily in the order in which the values were entered, and some of which have identical dates).
The active processing status is the topmost processing status block. So, when a user specifies a processing status, look for the status in only the topmost block in cases where there are multiple status blocks. The remaining blocks are for historical purposes only.
Also, should the processing status selection be applied when the user specifies a title fragment or document ID?
Yes, the processing status selection should be applied.
1.Could you be more precise for "surrounding text" and "where applicable" in "show the surrounding text of the term or phrase where applicable"; how much of the surrounding text is to be shown? Do we need to reach over to neighboring (but non-searched) elements to find surrounding text? How does the software know when it is "applicable" to show the surrounding text? There is no mention in the requirements of any use of rich text to distinguish the search terms from the surrounding text. Does this mean that we can avoid the extra cost of supporting rich text in the HTML and Excel versions of the report?
There is no need to reach out to surrouding elements that are not searched. The searched elements usually don't contain a lot of text so if it is easier to show all the text, I think that should be fine. In terms of the applicability, I was thinking more in terms of the labels which I suspect would not have a lot of surrounding text. So, as I have said above, I think it is okay to show all the text within the searched element but highlighting the searched term within the retrieved text.
2. The requirements say the report should support wildcards in search terms "similarly to what is done for the standard wording report." I don't see such support in that report. In that report we tokenize the words in the text being searched so that we can avoid false positives picked up by a more naïve substring match (for example, we don't want to report a match on "breast" in the text "two men walking abreast"). If we introduce support for wildcards which can stand for any sequence of intermediate characters, we would need to abandon this tokenizing, word-by-word matching, since some of those intermediate characters matched by the wildcards could be spaces. Is that what you really want?
Please ignore this request.
3. "Ability to search terms with special characters." This would happen by default. Users must be aware that any diacritics must be entered as part of the search term.
OK
4. Is exclusion of blocked documents a constant filter? Or is that a user-configurable option?
It should be a contant filter.
5. "The Spanish media documents would have a TranslationOf element in the document. In some cases where there is no Spanish media doc., there could
be caption and content description elements that have been translated and marked as ES in the English media doc." It's not clear what the purpose of the second sentence in this excerpt from the requirement is supposed to convey. Does it mean that when we are instructed to report on the English Media documents we are to omit Spanish captions and content descriptions from the elements which we search?
Please use only the TranslationOf element to identify the Spanish media docs.
OK, so we'll have to parse all of the documents instead of using the
query_term
table.
Implemented on CDR DEV.
When I enter a CDR ID and a term to search, I get the attached error message: CDR0000795013 Term: Prostate
Typo fixed.
In some cases, it looks like the report searches mp3 files. Please exclude media files that are mp3 files from the searches.
I assume you mean that the report is searching the Media
documents associated with MP3 files, not the property strings embedded
in the MP3 binaries themselves. Why wouldn't it search in those
documents? They are Media
documents. Unless I missed
something in the original requirements, and they are being changed to
narrow the scope of the report, please put that request into a ticket
for the next release.
That is right. There was no requirement to limit it to only images. I will create another ticket with this requirement for a future release.
When a title is provided (instead of a CDR ID), the search returns no results but when the CDR ID of the media document is provided the correct results are displayed. Please see attached screenshots.
What happens when you (a) provide the entire title or (b) add a wildcard following your title fragment?
The full title "pancreas anatomy, adolescent" doesn't work but wildcards appear to work when placed after the first word in the title and deleting the rest of the title. Using wildcards is fine but I am not sure why using the full title doesn't work if you have a term with more than one word.
Here is a screenshot of one example that does not give me anything.
Here are the full document titles of the documents you're looking for.
pancreas anatomy, adolescent;anatomy;JPEG
prostate cancer stage I;staging;JPEG
Yes, those are the doc titles but what we would search will be media title
"pancreas anatomy, adolescent"
"prostate cancer stage I"
Could you modify it search the media title and not the doc title or to search the media title in addition to the doc title?
Requested modification installed on DEV and QA.
Verified on DEV. Thanks!
Verified on QA. Thanks!
Verified on PROD. Thanks!
Verified on PROD. Thanks!
File Name | Posted | User |
---|---|---|
image-2021-04-07-11-41-15-661.png | 2021-04-07 11:41:16 | Kline, Bob (NIH/NCI) [C] |
media keyword search report_interface_CDR ID_Results.PNG | 2021-05-25 13:45:55 | Osei-Poku, William (NIH/NCI) [C] |
media keyword search report_interface_CDR ID.PNG | 2021-05-25 13:45:55 | Osei-Poku, William (NIH/NCI) [C] |
media keyword search report_interface.PNG | 2021-05-25 13:43:32 | Osei-Poku, William (NIH/NCI) [C] |
Media Keyword search report.PNG | 2021-04-29 13:24:35 | Osei-Poku, William (NIH/NCI) [C] |
Media keyword search report full title.PNG | 2021-05-25 15:25:09 | Osei-Poku, William (NIH/NCI) [C] |
Media Keyword search report mp3.PNG | 2021-05-10 13:43:04 | Osei-Poku, William (NIH/NCI) [C] |
media keyword search report mp3 interface.PNG | 2021-05-10 13:43:04 | Osei-Poku, William (NIH/NCI) [C] |
Media Keywords Search Report_revised.docx | 2021-03-31 12:00:54 | Osei-Poku, William (NIH/NCI) [C] |
Media Keywords Search Report.docx | 2020-09-24 10:26:43 | Osei-Poku, William (NIH/NCI) [C] |
Elapsed: 0:00:00.000766