PDQ Issues

Issue Number	4633
Summary	[Glossary-Media] Re-use existing media docs
Created	2019-06-19 22:12:03
Issue Type	New Feature
Submitted By	Osei-Poku, William (NIH/NCI) [C]
Assigned To	Kline, Bob (NIH/NCI) [C]
Status	Closed
Resolved	2020-06-11 16:48:16
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.245711

Description

Since it would be possible now to have terms that are being re-recorded (OCECDR-4507) That is, terms that previously had audio pronunciation media docs but they have been removed from the terms because the terms have been revised. Our preference would be to re-use the existing media docs for the terms instead of creating new media docs (new version on top of existing media doc).

Comment entered 2019-06-19 22:14:24 by Osei-Poku, William (NIH/NCI) [C]

Comment in OCECDR-4507 from Bob:

"For the last request above (reuse of an existing media document), that happens in a different program, so you'll want to create a different ticket (added in Kepler). Since the audio link is no longer present, you'll need to come up with a way to specify the CDR ID of the existing document you want to use for the sound clip in the spreadsheet fed to that program"

Comment entered 2019-06-19 22:16:54 by Osei-Poku, William (NIH/NCI) [C]

Is it possible to get the CDR ID from a previous version of the document? If that is not possible, then we can probably record the CDR id in a new element before we delete the link.

Comment entered 2019-06-20 05:18:09 by Kline, Bob (NIH/NCI) [C]

I recommend putting the ID in the spreadsheet where the import is requested.

Comment entered 2019-06-20 09:48:31 by Osei-Poku, William (NIH/NCI) [C]

Sure. The CDR IDs should be present in the spreadsheet that is uploaded with the recorded files to the FTP site, right ?

Comment entered 2019-06-20 10:04:41 by Kline, Bob (NIH/NCI) [C]

Right. We'll need to agree on which column the software should look in to find the value.

Comment entered 2020-04-13 13:03:04 by Kline, Bob (NIH/NCI) [C]

I propose the following expected column positions for the audio spreadsheets:

column A - CDR ID for TermName document (required integer)
column B - term name string to be pronounced (required)
column C - English or Spanish (required string)
column D - optional pronunciation string
column E - relative path name for MP3 file (required when import stage has been reached)
column F - notes from Vanessa (optional string)
column G - notes from NCI (optional string)
column H - CDR ID for Media document to be updated (optional integer)

Three questions:

Can we agree on this column positioning?
Should the import software abort if it encounters an ID in column H for a non-Media document?
Should the import software abort if an ID is given in column H for which no CDR document exists?

Comment entered 2020-04-13 16:46:31 by Osei-Poku, William (NIH/NCI) [C]

Column H in the current spreadsheet is labeled "Reuse Media ID". I assume what you're proposing is for the same use. If that is the case, can we retain "Reuse Media ID" ?

Comment entered 2020-04-13 17:29:23 by Kline, Bob (NIH/NCI) [C]

So you're suggesting that I find the CDR ID rather than have the spreadsheet provide it? If so, what would be the drawback to having the software always re-use the existing Media document's ID whenever it finds a media link for a given term name string? That way the new column wouldn't be needed at all.

Comment entered 2020-04-14 12:03:35 by Osei-Poku, William (NIH/NCI) [C]

The spreadsheet will still provide the CDR ID of the terms being re-recorded. I think we already agreed that CIAT will use the column to provide the CDR IDs of the terms that are being re-recorded. We had also agreed on what to name the column in this ticket OCECDR-4507.

"Bob will add a column to the spreadsheet with the heading "Reuse Media ID". This will be used to enter the media IDs for any pronunciations that are being re-recorded so that Bob doesn't create new media docs for them."

So my request is to get clarification on that and possibly keep the original column name.

I don't understand your question about the drawback since there wouldn't be a Media Link for the term as stated in the initial request above. The first step in the re-recording process is removing the media link and getting the term on the spreadsheet so if you can clarify what you want the software to do in terms of finding the media link, that would be great.

Comment entered 2020-04-14 14:50:42 by Kline, Bob (NIH/NCI) [C]

Since I hadn't said anything in this ticket about changing the column labels, I thought you were asking if the last column could be repurposed to take a yes/no flag indicating that we should "reuse the media ID" instead of just entering the ID. If that's not the case, and you're sticking with the original plan of stripping the link from the documents as the first step, then let's roll back to my 1:03pm Monday comment and its three questions:

Can we agree on this column positioning?
Should the import software abort if it encounters an ID in column H for a non-Media document?
Should the import software abort if an ID is given in column H for which no CDR document exists?

Comment entered 2020-04-14 18:20:47 by Osei-Poku, William (NIH/NCI) [C]

Yes to all 3 questions.

Comment entered 2020-04-16 08:35:29 by Kline, Bob (NIH/NCI) [C]

Enhancement (not sure why this was entered as a bug report, as I'm certain this was not part of the original requirements given to Alan) installed on DEV.

Comment entered 2020-04-16 09:42:07 by Osei-Poku, William (NIH/NCI) [C]

I have corrected this from a bug report to request for a new feature. Thanks!

Comment entered 2020-05-14 13:06:45 by Osei-Poku, William (NIH/NCI) [C]

On the Audio Review of Glossary Terms page, it looks like the Review Note field has been populated by the CDR IDs of the Media Docs. We use this field for a different purpose.

Comment entered 2020-05-14 15:19:52 by Kline, Bob (NIH/NCI) [C]

Aargh! Jira just threw away a very long comment in which I explained what caused this problem and how I fixed it.

TL;DR: it had to do with how the script handled variations in the structure of the spreadsheets. Should be OK now.

Comment entered 2020-05-18 13:20:10 by Osei-Poku, William (NIH/NCI) [C]

We've completed reviewing Week_701.zip and ready to import but when I run the Audio Import report, that file does not show up. Only Week_800.zip shows up even after submitting Week_800.zip, it continues to show up.

Comment entered 2020-05-18 13:47:50 by Kline, Bob (NIH/NCI) [C]

I'm pretty sure the requirements for the import script have always been to deal with the latest set ("week" — though that's something of a misnomer) only. So I believe the script is behaving the way it's supposed to. If you have found requirements docs which say otherwise, please let me know.

Comment entered 2020-05-18 15:26:29 by Osei-Poku, William (NIH/NCI) [C]

~volker Is it possible for you to remove the Week_800.zip file from the FTP or CDR server? Or it is too late now? Thanks!

Comment entered 2020-05-18 15:39:58 by Englisch, Volker (NIH/NCI) [C]

I'm able to remove the file from the FTP server and I would be able to remove it from the CDR server if we're talking DEV or QA tier.

Which tier are you looking at? Is this for DEV?

Comment entered 2020-05-18 15:50:44 by Osei-Poku, William (NIH/NCI) [C]

This is on DEV. Thanks!

Comment entered 2020-05-18 15:56:15 by Kline, Bob (NIH/NCI) [C]

Removed from the CDR DEV server.

Comment entered 2020-05-18 16:21:45 by Osei-Poku, William (NIH/NCI) [C]

It worked. Thank you!

Comment entered 2020-05-19 21:06:28 by Osei-Poku, William (NIH/NCI) [C]

The import was successful. However, several terms were skipped (please see attached). I am not sure why they were skipped.

Comment entered 2020-05-20 08:26:39 by Kline, Bob (NIH/NCI) [C]

It skips over glossary docs which already have at least one media link.

Comment entered 2020-05-20 09:02:18 by Osei-Poku, William (NIH/NCI) [C]

The recording is for the other block within the same doc without the media link so it shouldn't be skipped.

Comment entered 2020-05-20 09:23:42 by Kline, Bob (NIH/NCI) [C]

The software has always prevented updating glossary docs which have any existing media links. I'll see how tricky it will be to modify that behavior.

Comment entered 2020-05-20 17:53:23 by Kline, Bob (NIH/NCI) [C]

This has taken long enough that I should have had you file another ticket to change that behavior in Maxwell, but I've already done the hard work, so let's just proceed. I have removed the code to block processing of glossary documents which already have any existing media links. What do you want the software to do if the spreadsheet asks to insert a media link for a name string block which already has one?

fail the job?
replace the old link?
add a second link (making the document invalid)?
don't do anything with the link?

Comment entered 2020-05-20 21:32:47 by Osei-Poku, William (NIH/NCI) [C]

That would be an error so the job should fail - #1.

Thank you!

Comment entered 2020-05-21 07:10:10 by Kline, Bob (NIH/NCI) [C]

OK, I have made the modification. Please test.

Comment entered 2020-05-21 20:21:33 by Osei-Poku, William (NIH/NCI) [C]

Verified on DEV. Thank you!

Comment entered 2020-06-09 18:15:57 by Osei-Poku, William (NIH/NCI) [C]

The files appear to have successfully downloaded on QA but we are not able to listen to the audio files on QA. The error message is "There is no item named 'Week_701/44403_en.mp3' in the archive" for each of the terms.

Comment entered 2020-06-09 18:46:11 by Kline, Bob (NIH/NCI) [C]

That's a data problem. Too many levels in the path for the files.

$ unzip -l Week_701.zip
Archive:  Week_701.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2020-06-09 14:21   Week_701/Week_701/
    33843  2020-04-30 18:19   Week_701/Week_701/269467_en.mp3
    22942  2020-04-28 12:28   Week_701/Week_701/269467_es.mp3
    19944  2020-04-30 08:28   Week_701/Week_701/306524_es.mp3
    29776  2020-04-30 08:28   Week_701/Week_701/439317_es.mp3
    38804  2020-04-30 08:28   Week_701/Week_701/439400_es.mp3
    22199  2020-05-21 12:10   Week_701/Week_701/44386_en.mp3
    26587  2020-06-09 11:09   Week_701/Week_701/44386_es.mp3
    28757  2020-04-30 18:16   Week_701/Week_701/44403_en.mp3
    29750  2020-04-30 18:16   Week_701/Week_701/44404_en.mp3
    31080  2020-04-30 18:16   Week_701/Week_701/44449_en.mp3
    28886  2020-04-30 18:16   Week_701/Week_701/44450_en.mp3
    48530  2020-06-08 14:17   Week_701/Week_701/44985_en.mp3
    21989  2020-06-09 11:09   Week_701/Week_701/44985_es.mp3
    38173  2020-04-28 12:28   Week_701/Week_701/455146_es.mp3
    52469  2020-04-30 18:16   Week_701/Week_701/471770_en.mp3
    35776  2020-04-28 12:28   Week_701/Week_701/523437_es.mp3
    30871  2020-04-30 18:16   Week_701/Week_701/642120_en.mp3
    16548  2020-04-28 12:28   Week_701/Week_701/648947_es.mp3
    25819  2020-04-30 18:18   Week_701/Week_701/713798_en.mp3
    17127  2020-04-28 12:28   Week_701/Week_701/713798_es.mp3
    40940  2020-04-30 18:18   Week_701/Week_701/761601_en.mp3
    31313  2020-04-28 12:28   Week_701/Week_701/761601_es.mp3
    52551  2020-04-30 18:15   Week_701/Week_701/793179_en.mp3
    39193  2020-06-09 13:59   Week_701/Week_701/798756_es.mp3
    28184  2020-06-09 13:59   Week_701/Week_701/798765_es.mp3
    17952  2020-06-09 14:21   Week_701/Week_701/Week_701.xlsx
---------                     -------
   810003                     27 files

Comment entered 2020-06-10 09:14:00 by Osei-Poku, William (NIH/NCI) [C]

We've fixed the problem and sent you a new zip file by email. Could you please remove the existing one and reload this new file? Thanks!

Comment entered 2020-06-10 10:18:47 by Kline, Bob (NIH/NCI) [C]

Done.

Comment entered 2020-06-10 10:56:17 by Osei-Poku, William (NIH/NCI) [C]

Unfortunately, we are getting the same error message.

"There is no item named 'Week_701/44403_en.mp3' in the archive"

Comment entered 2020-06-10 12:55:38 by Kline, Bob (NIH/NCI) [C]

I've put it up there again, and I confirmed that the directories have been fixed. Please try it one more time (and we'll hope the database doesn't have the old paths stuck somewhere).

Comment entered 2020-06-10 16:22:16 by Osei-Poku, William (NIH/NCI) [C]

It worked. Thank you!

Comment entered 2020-06-10 16:32:30 by Osei-Poku, William (NIH/NCI) [C]

The import job worked for the most part. However, audios were not linked to the following terms and that is because the program was interrupted by CDR648947 which already had a MediaLink (our error). We corrected this error by removing the MediaLink and then I ran the import job again. This created duplicates for some of the audio files, however, they were not linked to the remaining terms. So while the program works, when there is an error like the one above, and it is corrected, it doesn't look like I can re-run the program successfully.

713798
713798
761601
761601
798756
798765

Comment entered 2020-06-10 16:52:11 by Kline, Bob (NIH/NCI) [C]

Add a Maxwell ticket to handle that condition.

Comment entered 2020-06-10 17:48:52 by Osei-Poku, William (NIH/NCI) [C]

Okay. Will do.

Comment entered 2020-06-10 17:49:23 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA. Thanks!

Comment entered 2020-06-11 11:06:24 by Kline, Bob (NIH/NCI) [C]

Week_702.zip is on the SFTP server.

Comment entered 2020-06-11 15:55:17 by Osei-Poku, William (NIH/NCI) [C]

Thanks! It worked. However, we can't figure out why the two terms below failed during import

CDR798756 unable to find home for 'análisis citogenético ' in CDR798756
CDR798765 unable to find home for 'cateterismo cardíaco ' in CDR798765

Comment entered 2020-06-11 16:48:04 by Kline, Bob (NIH/NCI) [C]

That's because 'análisis citogenético ' (note the extra space in the error message) doesn't match 'análisis citogenético' (same for the other term).

Comment entered 2020-06-11 16:49:22 by Osei-Poku, William (NIH/NCI) [C]

So, this is from the spreadsheet, right?

Comment entered 2020-06-11 16:52:35 by Osei-Poku, William (NIH/NCI) [C]

I see that it is from the spreadsheet. Thanks!

Comment entered 2020-06-11 16:52:48 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA. Thanks!

Comment entered 2020-06-11 16:55:01 by Kline, Bob (NIH/NCI) [C]

yep

Comment entered 2020-07-15 09:19:33 by Kline, Bob (NIH/NCI) [C]

Still testing?

Comment entered 2020-07-16 12:44:40 by Osei-Poku, William (NIH/NCI) [C]

We have not been able to test this on PROD yet . I am closing the ticket now and will reopen if necessary.

Attachments

File Name	Posted	User
Audio Re-use DEV run.JPG	2020-05-19 21:05:17	Osei-Poku, William (NIH/NCI) [C]
GlossaryReviewPage_Dev.PNG	2020-05-14 13:06:41	Osei-Poku, William (NIH/NCI) [C]

Elapsed: 0:00:00.001635

CDR Tickets