PDQ Issues

Issue Number	3373
Summary	[Glossary-Media] Automate process of creating audio pronunciation media documents
Created	2011-06-03 14:28:01
Issue Type	Improvement
Submitted By	Osei-Poku, William (NIH/NCI) [C]
Assigned To	Kline, Bob (NIH/NCI) [C]
Status	Closed
Resolved	2014-11-04 14:58:56
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.107701

Description

BZISSUE::5066
BZDATETIME::2011-06-03 14:28:01
BZCREATOR::William Osei-Poku
BZASSIGNEE::Bob Kline
BZQACONTACT::William Osei-Poku

We discussed briefly in one of the CDR Meetings to automate the process of creating Media documents for the audio pronunciation files, after the initial 14,000 terms have been completed. I am putting in this issue so that we can discuss the details of implementing this solution.
Meanwhile, I am giving it a priority of P6 since there are other higher priority issues.

Comment entered 2012-03-01 11:43:14 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2012-03-01 11:43:14
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::1

I think I was supposed to put a comment in here about what I had in mind. At a very high (translate vague) level, it seems that we have all of the pieces in place to do this, and what we need to do is figure out how to turn the process over to CIAT. My idea is that:

1. Each month CIAT could run a report of new/revised glossary terms without pronunciations and produce a spreadsheet that they could send to Vanessa.
2. After she does the pronunciation audios and sends the spreadsheet back, we would use Alan's process to review and mark the spreadsheet.
3. When the terms were all reviewed, then CIAT could invoke a script (or something along those lines) that would do what Bob is doing now, that is, create media docs for each of the audios and link them to the appropriate glossary term record.

Comment entered 2013-05-09 16:58:43 by Juthe, Robin (NIH/NCI) [E]

BZDATETIME::2013-05-09 16:58:43
BZCOMMENTOR::Robin Juthe
BZCOMMENT::2

Revising the priority based on the discussion in today's CDR meeting.

Comment entered 2013-05-09 17:18:52 by Juthe, Robin (NIH/NCI) [E]

BZDATETIME::2013-05-09 17:18:52
BZCOMMENTOR::Robin Juthe
BZCOMMENT::3

Made this P5 after discussing the LOE with Bob.

Comment entered 2013-05-09 17:54:08 by Juthe, Robin (NIH/NCI) [E]

BZDATETIME::2013-05-09 17:54:08
BZCOMMENTOR::Robin Juthe
BZCOMMENT::4

Oops - now I made this P5.

Comment entered 2014-11-03 16:07:15 by Kline, Bob (NIH/NCI) [C]

Looking into this enhancement while the clinical trials search redesign project is on hold.

Comment entered 2014-11-03 21:34:08 by Kline, Bob (NIH/NCI) [C]

The interface I'm planning will look something like the attached screen shot. The user will upload the zip files from his/her file system, in the order they were received from Vanessa. This order is important, as an audio clip from a later zip file (presumably to correct a mistake) will supersede one from an earlier file. Initially, the form will contain a single file upload field. The "More" button will add an additional file upload field each time it is clicked.

Does anyone see any problems with this plan?

Comment entered 2014-11-04 07:54:23 by Kline, Bob (NIH/NCI) [C]

Another approach would be to use the zip files which were reviewed on the server, instead of having the user upload them from her file system. Variations of this approach would be:

The system presents a picklist of all of the zip files found in the D:\cdr\Audio_from_CIPSFTP directory. The user selects one of the files as the first set from which audio clips are to be imported. A button is available to add another picklist, in case there are two or more zip files to be processed.
The system assumes that the names of the files use a consistent pattern, and locates the files in the audio directory which have names matching "Week_NNN*.zip" where "NNN" is the highest 3-digit number found in the directory. The set is presented to the user for confirmation, and if confirmation is received, the files are processed in this order: Week_NNN.zip, Week_NNN_Rev1.zip, Week_NNN_Rev2.zip, ....
The system collects all sets of files in the audio directory whose names begin with "Week_" followed by a 3-digit number, and end in ".zip," grouping the sets so that all files with the same 3-digit number are in the same set. The user selects the set to be processed, and the files are processed in the order determined as described in the previous option.

If we go with this approach (rather than having the user upload the files), I would recommend the second option. It's the simplest to implement, and results in the cleanest interface. In the unlikely event that the contents of the audio directory prevented the software from finding the right zip files using this technique (for example, somehow a file from a future batch was put in the directory before the audio from the current batch had been imported), then we could just fall back on the process we've been using all along, in which one of the developers does the upload from the bastion host.

Your thoughts?

By the way, just out of curiosity: why are the files named "Week..." when the week numbers clearly don't match any discernible week-numbering scheme? Wouldn't "Batch..." or "Set..." make more sense?

Comment entered 2014-11-04 14:58:56 by Kline, Bob (NIH/NCI) [C]

I have an implementation on DEV, using option 2 of the second approach above, which CIAT can begin testing. To facilitate testing, I have taken the zip file for Week_101, and broken it into lots of small batches, each containing four MP3 files, two each for a couple of GlossaryTermName documents. I have dropped Week_090.zip into the audio directory, so that file will be the one which shows up when you select Audio Import from CIAT/OCCM menu on DEV. There are also Week_091.zip, Week_092.zip, ..., up through Week_100.zip. After we get tired of testing the smaller batches, we can try Week_102, which has a pair of zip files. There's also a larger Week_103, which has a single zip file. Each time you run a successful import job, you'll need to have me drop in the next zip file.

Comment entered 2014-11-04 15:32:19 by Osei-Poku, William (NIH/NCI) [C]

I just imported Week_090.zip. It seems to have worked fine. However, I couldn't find the Audio folder on the FTP site with the zip files you created. I have a feeling that we don't have access to that folder. If you give me the full path, I will try to see if I can access.

Comment entered 2014-11-04 15:45:05 by Kline, Bob (NIH/NCI) [C]

They're not on the FTP server. I created them on the CDR DEV Windows server with a script I wrote to break up "Week_101" into smaller batches for testing. I have attached Week_090.zip to this ticket, as well as Week_091.zip, so you can test with that one now as well.

Comment entered 2014-11-05 13:16:05 by Osei-Poku, William (NIH/NCI) [C]

I ran the import again and everything appears to be fine. I assume that I will have to tell you each time I have finished testing a file and reviewed the documents so that you can load another one?

Comment entered 2014-11-05 14:14:29 by Kline, Bob (NIH/NCI) [C]

Right. I've added Week_092. I don't think there's any need to test every one of the small batches I created. I only made so many in case things went wrong for the first few tests. You can, of course, keep testing them all, but let me know when you're ready to move on to Week_102, which has a pair of larger zip files.

Comment entered 2014-11-05 17:39:14 by Osei-Poku, William (NIH/NCI) [C]

I would want to test another small file before I start testing the bigger ones.

I also have a question about the wording of the status report that is displayed after running the program.
Here are two examples:

CDR755493 Added link to Media doc CDR759345
CDR755493 Added link to Media doc CDR759346

Where it says "Added link to Media doc ....", should it rather be added link to glossary term doc? It sounded a bit confusing to me because the link is created in the glossary doc.

Would it be too much work too much work to create a separate group/permission for both this tool and OCECDR-3606?. If it is going to be too much work, then that's okay.

Comment entered 2014-11-05 18:00:28 by Kline, Bob (NIH/NCI) [C]

The link is FROM the glossary term name document TO the media document. How about if we say something like:

CDR755493 Added link from this document to Media document CDR759345

(you'll see that wording next time you do a batch)?

I have dropped in Week_093 for your next test.

Go ahead and create a new action and group (you have permission to do that) with whatever names you want, and let me know what the action name is. Be sure to assign at least yourself to that group.

Comment entered 2014-11-05 18:27:39 by Osei-Poku, William (NIH/NCI) [C]

Thanks for the explanation and the changes to the wording.

I created a new action called "AUDIO IMPORT" and a group called "AUDIO IMPORT/SPREADSHEET" and assigned a few CIAT users to the group.

Comment entered 2014-11-05 21:21:51 by Kline, Bob (NIH/NCI) [C]

New action plugged into the permissions check.

Comment entered 2014-11-06 12:58:00 by Osei-Poku, William (NIH/NCI) [C]

Thanks! I think we can start testing the larger files now. So far, we haven't found any problems with the smaller files.

Comment entered 2014-11-06 13:13:07 by Kline, Bob (NIH/NCI) [C]

OK, the two original ZIP files for week 102 are dropped in and ready.

Comment entered 2014-11-12 13:08:28 by Osei-Poku, William (NIH/NCI) [C]

This new set appears to have problems but it is not so clear what the problems are.
Two media docs were successfully created (759357 and 759358) for English and Spanish respectively. The next line on the status page said "CDR758612| Added link from this document to media document 759357. However, there is no link in the glossary term name 758612. Then the next line said "CDR758612 | unable to find name node for u'R\xe9gimen OFF' in CDR758612. There is no Spanish/Translated block so it is understandable that there will be one error. But I don't understand why it failed to create the link in the English block.

Comment entered 2014-11-13 15:53:53 by Kline, Bob (NIH/NCI) [C]

[Thought this was posted hours ago, but JIRA doesn't tell you when it's silently timed you out.]

The problem isn't that it failed to create the link in the English block. The software is supposed to back off from making any modifications to a document that it can't process completely correctly, and that's what it's doing. The problem was that the collection of information about "what we've done" wasn't happening in the right way. I've fixed that problem. We'll need to do at least one more test to make sure I haven't broken anything in the process. Ideally, you'd want to remove the Spanish block from one of the two glossary term name documents to recreate the condition which failed in the previous test. I've dropped in another zip file (and suppressed the week 102 files). The two documents in the new test are 756463 and 756464.

Comment entered 2014-11-17 16:51:24 by Osei-Poku, William (NIH/NCI) [C]

Please add more files for additional testing.

I have one question. In the media doc, Under MediaSource, the DateCreated seems to be always 2014-04-24 for the sets I have been testing. I assume this is the date Vanessa created the media doc and you're getting the date from the zip files?

Comment entered 2014-11-18 07:43:23 by Kline, Bob (NIH/NCI) [C]

I have added another zip file. The information in the small test zip files was taken from the original week 101 set (see my comment above posted Tuesday, 4 Nov 2014 02:58 PM for a description of how I created the small test sets).

Comment entered 2014-11-19 18:07:29 by Osei-Poku, William (NIH/NCI) [C]

Verified on DEV.

Comment entered 2015-01-21 17:06:24 by Osei-Poku, William (NIH/NCI) [C]

How should I test this one ?Would you load some of the zip files like you did on DEV ?

Comment entered 2015-01-21 22:09:20 by Kline, Bob (NIH/NCI) [C]

Best thing to do would be to add pronunciations to terms that don't have them yet but are already published.

Comment entered 2015-01-22 07:49:33 by Osei-Poku, William (NIH/NCI) [C]

You may be referring to OCECDR-3606. This issue has to do with the audio files themselves when they have been reviewed and ready to be uploaded into the CDR.

Comment entered 2015-01-22 08:14:56 by Kline, Bob (NIH/NCI) [C]

Ah, right. I've added a couple of zip files to the directory.

Did this question ever get an answer?

By the way, just out of curiosity: why are the files named "Week..." when the week numbers clearly don't match any discernible week-numbering scheme? Wouldn't "Batch..." or "Set..." make more sense?

Comment entered 2015-01-22 08:38:31 by Osei-Poku, William (NIH/NCI) [C]

That is how we have named them since the beginning of the audio recording project when the workload of existing terms was divided into several weeks. After the bulk of the work was done we decided to continue using the same naming convention. If we need to change how we are naming them, that is fine with me. I don't think it will hurt anything.

Comment entered 2015-01-23 08:22:12 by Osei-Poku, William (NIH/NCI) [C]

When I run the report, it does not display any files.

Comment entered 2015-01-27 11:24:59 by Osei-Poku, William (NIH/NCI) [C]

Hi Bob:
I am wondering if you've had the chance to look at this. I am not able to see any files to either review or import.

Comment entered 2015-01-27 14:41:01 by Kline, Bob (NIH/NCI) [C]

Not sure what happened. I put a couple of files in D:\cdr\Audio_from_CIPSFTP and tested to verify that they show up in the admin interface (without following through to actually process them, so they'd still be available for your testing) but they weren't there when I just looked.

Any possibility that build/deployment testing might have altered that directory, Alan? (I wouldn't have thought so, but I thought best to check.)

Comment entered 2015-01-27 14:49:04 by alan

> Any possibility that build/deployment testing might have altered that directory, Alan?
> (I wouldn't have thought so, but I thought best to check.)

No possibility that I can see. Grepping for it ...\branches\Ampere\AnthillPro didn't find it except in a diff output that said it was a directory in the live cdr but not in the replacement directories to be deployed - which is as it should be.

Comment entered 2015-01-29 16:46:40 by Kline, Bob (NIH/NCI) [C]

OK, third time's a charm. This time the files should stick around, as I've set the file modification time stamps to now.

Comment entered 2015-01-29 17:03:05 by Osei-Poku, William (NIH/NCI) [C]

I did see the files and was able to import them but all of them were already processed. So, the results I got was Skipped (already processed) for all all the files. I did check some of them and they had indeed been processed.

Comment entered 2015-01-30 10:54:42 by Juthe, Robin (NIH/NCI) [E]

Does this need to be tested with new audio files that haven't been processed?

Comment entered 2015-01-30 10:58:28 by Osei-Poku, William (NIH/NCI) [C]

I believe so. The process creates new media docs and links them to the GTNs so the media docs should not exist on QA.

Comment entered 2015-01-30 11:09:15 by Juthe, Robin (NIH/NCI) [E]

Bob, do you have new files that could be uploaded? Or could we remove the media docs for old audio pronunciations and re-upload the files to see if the media docs are created appropriately? Just trying to close this out. Thanks.

Comment entered 2015-01-30 11:15:09 by Kline, Bob (NIH/NCI) [C]

I've added "Week" 106. See if that does what you want.

Comment entered 2015-01-30 11:27:15 by Osei-Poku, William (NIH/NCI) [C]

All the media docs were successfully created but the program couldn't link them to the GTNs because of missing Translation blocks. So, it seems to be working as it should. At this point, I have no reason to believe that this won't work in production even though on QA I did not see the one case where the audio was successfully created and attached to the GTN. I am marking this issue as QA Verified. Thank you!

Comment entered 2015-01-30 11:27:35 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA.

Comment entered 2015-02-16 09:17:50 by Osei-Poku, William (NIH/NCI) [C]

Verified on Stage.

Comment entered 2015-02-17 20:16:27 by Osei-Poku, William (NIH/NCI) [C]

Verified on PROD. Thanks!

Attachments

File Name	Posted	User
2014-11-03 21_24_58-ts.nci.nih.gov_1494 - Remote Desktop Connection.jpg	2014-11-03 21:34:08
Week_090.zip	2014-11-04 15:45:05
Week_091.zip	2014-11-04 15:45:05

Elapsed: 0:00:00.000681

CDR Tickets