Issue Number | 4890 |
---|---|
Summary | [Glossary/Media] Modify audio import program |
Created | 2020-09-15 17:31:11 |
Issue Type | Improvement |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2020-10-29 17:51:20 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.274801 |
We are now able to include the audio re-recording process (which was a manual process before Leibniz) in the standard audio recording process (OCECDR-4507). However, for the re-recorded documents, we still need to go into each of the documents to capture administrative information. Please modify the import process so that this information will be entered programmatically when the re-recorded documents are imported into the CDR. I have attached the specs in the attached spreadsheet.
I don't understand what "Update Media Doc Title [With English Media Doc title]" means. Also, what is "rr" in "GTN CDRID_en_rr.mp3"?
~oseipokuw will be posting screen shots/examples to illustrate how we are to derive the English media document title from the English media document title.
That was a mistake. Sorry. I thought this was talking about the Spanish document. I have updated the spreadsheet and posted the revised statement. Also, "rr" stands for "re-recorded". It only shows that the file was re-recorded. In essense, entering the filename as is, is fine.
Whew! I thought I had fallen into an M.C. Escher drawing. 😛
This is another ticket for which 20 story points is probably lowballing the estimate. Until now, we have been able to use the global change harness to apply the necessary document changes (basically adding a media link), and we only had to do that for the glossary term name documents. One of the most valuable features of that package is the very carefully crafted logic to ensure that we are preserving the decisions made by the users about which changes to the document should be made publishable. However, to do what is being requested for this ticket we will have to use a completely different path for updating the documents when a media document is being re-recorded, avoiding the use of the global change harness, and doing it for both the glossary term name documents, as well as for the media documents, than the path we use for a new media document. It is unusual for the software to be making the decision to publish changes which a user has decided should not yet be published, and it's easy to imagine other changes to a document besides linking to a different pronunciation which might get swept up in this blind publishing of changes which were not previously marked as publishable. Are you confident that this is really what you want? If so, do you want the software – when it is dealing with a re-recorded pronunciation – to grab the latest saved copy of the document or the latest version?
Tagging ~mbeckwit so she is aware of the implications of this request.
Can you please remind me what the difference is, between the latest saved copy and the latest version?
If you save changes to a document without checking the box to create a new version, that copy will be different from the latest version. Maybe I don't understand your question. Can you ask it a different way?
That actually answers my question . Volker has also explained it to me a few times 🙂. So, in answer to your question, please use the latest saved copy.
Well, that answers one of my two questions. Can you also explicitly respond to the other, more significant question?
Are you OK with this modification, ~mbeckwit (see comment with analysis above from yesterday at 5:55)?
I got more information from CIAT on the importance of this request and it seems like something that would save a great deal of time:
This is an effort to make our audio re-recording process more efficient. By modifying the audio import program, it will greatly reduce the number of manual steps involved in this process and save us a significant amount of time. The modified program would automate many of the steps involved in this process by updating the data elements in each of the existing audio media docs (media doc titles, content descriptions, dates, comments, and processing statuses) and the GTN docs (dates and comments) as the re-recorded audios are published. The program already links the re-recorded audios to the media docs and GTNs, but this proposed change would update the data elements in each of the docs. We’re hoping this will be possible.
We can discuss this at the CDR meeting tomorrow in terms of development time needed and whether it should be bumped to the next release.
We discussed this ticket in depth at this afternoon's weekly status meeting, and William agreed that the standard global change processing will provide him with what he wants. I am beginning the work for the request.
~oseipokuw: Can you
explain why we would set the DateLastModified
in the GTN
document for a re-recorded MP3 but not when adding a new one? Same
question for the TranslatedNameStatusDate.
For Glossary Terms, we only assign Date Last Modified when an existing document is modified so we do not add the element for new ones. I have seen other document types where the element is added for new ones as well but our business process for Glossary Terms only requires us to add the element (or update the date) when an existing document is modified. The TranslatedNameStatusDate element is added when a new term is added and updated when a term is revised. We add/update this date when we change the TranslatedNameStatus element.
I can understand why you would not want to have the software set the
DateLastModified
value for a Media
document it
has just created. But I'm not asking about those. I'm asking about the
GlossaryTermName
documents, which in all cases
(re-recordings or first recordings) by definition must have existed
already before the spreadsheet was generated to ask Vanessa to record
the pronunciations (otherwise there wouldn't be a CDR ID or the other
data needed for populating the rows in that spreadsheet).
At the other end of the spectrum, you say you update
TranslatedNameStatusDate
when the
TranslatedNameStatus
changes, but you're not having the
software change the latter value.
Finally, when you say "when a term is revised" are you referring just to cases in which the term name has been changed or also when the term name stays the same but the pronunciation is corrected in a re-recording of the same name?
At the other end of the spectrum, you say you update
TranslatedNameStatusDate
when theTranslatedNameStatus
changes, but you're not having the software change the latter value.
Please do not update the TranslatedNameStatus either. So the changes should be.
Glossary Term Name |
Add new Comment after Translated Name Status Date: [Approved audio re-recording linked]. Should be the topmost if there are existing comments. |
Add or update DateLastModified with [System Date] |
Make document publishable with the following version comment: [Spn Pub ver, system date, Audio re-recording re-linked] |
I have removed the line that says "Change Translated Name Status Date to [System date] "
Finally, when you say "when a term is revised" are you referring just to cases in which the term name has been changed or also when the term name stays the same but the pronunciation is corrected in a re-recording of the same name?
That refers to when the term name has been changed.
We agreed yesterday that the global change harness will handle version creation, so you'll want to drop the rows which say "Make document publishable with the following version comment: ..."
Also, you haven't responded to my comment about the
DateLastModified
for all GTN documents (since for this
process all of those documents are pre-existing and
all are being modified).
We agreed yesterday that the global change harness will handle version creation, so you'll want to drop the rows which say "Make document publishable with the following version comment: ..."
That is right.
Also, you haven't responded to my comment about the DateLastModified for all GTN documents (since for this process all of those documents are pre-existing and all are being modified).
I am not sure I understand your question about this one. Could you elaborate? I think the request is to update DateLastModified for both the English and the Spanish.
We agreed yesterday that the global change harness will handle version creation, so you'll want to drop the rows which say "Make document publishable with the following version comment: ..."
Would it be possible to add the comment to the publishable version that the global change harness will create ?
OK, to recap the conversation:
I asked why you want the DateLastModified
only set
for the re-recorded names, instead of for all of the GTN documents
processed by the script
You responded that you don't use that element for new documents but only for existing documents
I pointed out that while some of the Media
documents
are old and some are new for this processing, all of the GTN
documents already exist, and they're all being
modified
The logical conclusion would seem to be that all of the GTN
documents should have the DateLastModified
set, right?
[Jira isn't chaining the replies correctly]
Would it be possible to add the comment to the publishable version that the global change harness will create ?
Not without rewriting the global change harness, which would push this ticket into a 40-pointer.
We can have any comment we want for the documents saved by the global change system. We've never had a requirement before, though, for having customized comments for each version of each document in the batch.
OK, to recap the conversation:
I asked why you want the DateLastModified only set for the re-recorded names, instead of for all of the GTN documents processed by the script
You responded that you don't use that element for new documents but only for existing documents
I pointed out that while some of the Media documents are old and some are new for this processing, all of the GTN documents already exist, and they're all being modified
Okay got it. The GTNs that are not re-recording are essentially new term names that are getting audio pronunciations for the first time. It is possible that there are some outliers but the majority of them are new terms that have been created weeks before generating the spreadsheet. On the other hand, if it makes things simpler to have the DLM set for all documents processed by the script, I think we can consider that. I don't think it makes a huge difference to have the program set the DLM for all GTNs but we will have to review some of our reports to make sure they don't depend on the date so we know what to expect. So, please let me know.
Okay. No problem.
We can have any comment we want for the documents saved by the global change system. We've never had a requirement before, though, for having customized comments for each version of each document in the batch.
So it means the comment will be applied to all the documents saved but not just the specific pub version? If that is the case, we can modify the comment so it will be applied to all the saved documents.
So adding pronunciation audio for the term isn't a modification? We can do it either way. I'm just trying to make sure we're doing what makes sense the FIRST time we implement the ticket. 😉
It is a modification and I think we can add the DLM for all of them.
I would think the same logic would apply to the
TranslatedNameStatusDate
element. If it makes sense to set
that date to the date the recording was linked from the GTN document for
replacement recordings, it should make just as much sense to set that
date to the date the recording was linked from the GTN document for new
recordings. Do you agree?
We decided to drop this change (earlier comments)
I have removed the line that says "Change Translated Name Status Date to [System date] "
Ah, OK. Doesn't look as if that got removed from the latest version of the requirements specification. I will strip that logic from the code.
Do you really want "- Spanish" appended to the English name (with the space) instead of "-Spanish" the way the existing documents appear?
Can you confirm that we are dropping ", VR Voice" for the
Creator
element?
Please confirm that you really want me to drop the path name for the
week from the SourceFilename
element's value.
The ContentDescription
element allows multiple
occurrences, with at least one required. If I don't find one, I'll throw
an exception. If I find more than one should I leave that block
alone?
The requirements spec doesn't have anything to say about where to put
the new ProcessingStatus
elements. Does that mean you don't
care where I put them, as long as they're inside the
MediaProcessingStatuses
block?
Do you really want "- Spanish" appended to the English name (with the space) instead of "-Spanish" the way the existing documents appear?
No. Please append "-Spanish" instead.
Can you confirm that we are dropping ", VR Voice" for the Creator element?
No. Please don't make any changes to the Creator element.
I will be surprised to see more than one content description element for the mp3 files. The images usually have more than one.
If I find more than one should I leave that block alone?
Yes, and report the problem but don't abort the program.
If I don't find one, I'll throw an exception.
Will that abort the program ?
It's difficult to know what questions you're answering, because you've stopped using the Reply links.
Please confirm that you really want me to drop the path name for the week from the SourceFilename element's value.
No. Please don't drop it. We had some ideas before the implementation of the new filename convention was implemented.
The requirements spec doesn't have anything to say about where to put the new ProcessingStatus elements. Does that mean you don't care where I put them, as long as they're inside the MediaProcessingStatuses block?
Please place the new ProcessingStatus elements on top so they become the topmost.
Will that abort the program ?
No, but it will prevent the modified document from being saved. An error is logged and displayed with the job results.
... we can modify the comment so it will be applied to all the saved documents.
Have you decided what wording to use for the common comment used for a global change job?
What should the software do if a single Media document is reused in the same job for two different GTN documents, or for two different name blocks within the same GTN document? Or is this another one of those "we'll never do that" cases?
You can have separate comment strings for the global change job used to transform the GTN documents and the global change job used to transform the recycled Media documents.
What should the software do if a single Media document is reused in the same job for two different GTN documents, or for two different name blocks within the same GTN document? Or is this another one of those "we'll never do that" cases?
So far we have no use cases for the scenarios above.
This is ready for some testing on DEV. The modifications are extensive, and I even had to make some changes to the global change harness, which we had never used before for documents with changed blobs. I have done some preliminary testing, but I have only scratched the surface. You will want to check this very thoroughly.
We've done some testing on DEV and so far no major issues. Essentially all the major processes are covered. However, here are a few observations:
In the Media docs, the newly created files (vs the re-recordings) filename formats follow the new naming convention "Week_2020_43/800019_en.mp3". But the re-recordings do not appear to follow the same filename format but instead showing "798602_es_rr.mp3", for example. This is not a big deal as it doesn't interfere with the process. It is just not consistent so if it can be changed to be consistent with the new filename format, that would be great. Examples: Media Doc ID - CDR 802962. MEDIA DOC ID - 800988
In the media docs for re-recordings, could you please reverse the order of processing statuses so that “Processing Complete” is always on top of “Audio re-recording approved”? Example CDR 800988.
But the re-recordings do not appear to follow the same filename format but instead showing "798602_es_rr.mp3", for example.
I was using the pattern you asked me to use in the latest version of the requirements document.
could you please reverse the order of processing statuses ...
Once again, if you have preferences for details of how a request is implemented, please let us know before we begin implementing.
These changes have been implemented on DEV.
I just finished importing the audios for Week 44. It doesn't look like the documents in the Media Reuse MediaID column were updated, at least judging from the document history report. The new ones appear to have been created and updated correctly.
728116 |
706202 |
713153 |
715166 |
|
708931 |
733277 |
It looks like you left one of the MP3 files out of the zip archive.
KeyError: "There is no item named 'Week_2020_44/44959_es.mp3 ' in the archive"
I will look into modifying the import program to show that failure and stop the program when the global change job aborts.
I have added the code to catch the missing MP3 file and abort the program.
Can I generate a new spreadsheet for week_44 for testing again?
I think that should work. You might want to first test the change I just made with your existing set to confirm that the software now catches the error, reports it, and stops the job.
Let me know when you're done with that check, and I'll clear the old set from the server and the database.
I keep getting the KeyError: "There is no item named 'Week_2020_44/44959_es.mp3 ' in the archive" with the latest files even after making sure that the mp3 file exists. I wonder if it is a trailing space issue in the spreadsheet instead.
I think you must have missed my previous message. The sequence is:
You test the change I just made with your existing set to confirm that the software now catches the error, reports it, and stops the job.
I clear the old set from the server and the database.
You test with a corrected set.
Yes, I missed it. I thought I would be able to test with a corrected set after testing the fix. The fix worked when I used the same set. So, please proceed to clear the old set so I can test with the corrected set.
OK, I have performed the surgery on the file system and the database. You should be able to test now with a correct set.
The current batch ran successfully. However, I am not sure why newly created media docs - example 803000 - have a Date Created of "2020-10-27" instead of today's date. Also, I didn't see it (803000 )in the report I get after the run that it was created as part of the most recent batch.
What's the date on the MP3 file? That's where
DateCreated
comes from. Show me the report, please.
The DateCreated
element is in the
OriginalSource
block, so it doesn't reflect when the CDR
Media
document was created (and never has).
**What's the date on the MP3 file? That's whereDateCreated
comes from. Show me the report, please.
This the report I was referring to and in fact the media doc ID is on there. Sorry for the false alarm. I didn't know the date created is taken from the the mp3 file so I was expecting today's date instead. That is good to know. It seems to me that all the issues have been addressed for this ticket.
CDR ID Processing
CDR711295 updated Media doc for CDR373934 ('radical laparoscopic prostatectomy' [en]) from Week_2020_44.zip
CDR713443 updated Media doc for CDR413891 ('stage I distal bile duct cancer' [en]) from Week_2020_44.zip
CDR798064 updated Media doc for CDR796880 ('positivo para ALK' [es]) from Week_2020_44.zip
CDR798846 updated Media doc for CDR796946 ('positivo para ROS1' [es]) from Week_2020_44.zip
CDR802999 created Media doc for CDR799488 ('Rombo syndrome' [en]) from Week_2020_44.zip
CDR803000 created Media doc for CDR799488 ('síndrome de Rombo' [es]) from Week_2020_44.zip
CDR373934 Updating link from this document to Media document CDR711295
CDR413891 Updating link from this document to Media document CDR713443
CDR796880 Updating link from this document to Media document CDR798064
CDR796946 Updating link from this document to Media document CDR798846
CDR799488 Adding link from this document to Media document CDR802999
CDR799488 Adding link from this document to Media document CDR803000
So 803000 was on the report after all?
Yes, as stated above.
Verified on DEV. Thanks!
The first batch was successfully imported and everything appears to be fine apart from this media doc - 736926. The Date Created has today's date 2020-11-10 instead of the date the MP3 file was created (other media docs that were newly created, correctly retained the date the MP3 file was created). Could you please take a look to see why it is showing the date the media doc was updated?
Because that's what you asked us to do for the re-recorded docs in your requirements?
Please use the date the mp3 was created just like the others so there will be consistency.
Sure. Please create a ticket for that change and add it to Newton.
OK. Will do.
Verified on QA. Thanks!
I am closing this ticket because it may take a while to verify all the changes on PROD. Some of the changes have been verified already but we will have to wait to receive the recordings from the Vanessa and that may take a while. So far we all the changes up the point of receiving the files from Vanessa have been verified.
File Name | Posted | User |
---|---|---|
Audio recording program proposed updates_Corrected.xlsx | 2020-10-09 14:04:55 | Osei-Poku, William (NIH/NCI) [C] |
Audio recording program proposed updates.xlsx | 2020-09-15 17:31:02 | Osei-Poku, William (NIH/NCI) [C] |
image-2020-10-10-14-23-33-249.png | 2020-10-10 14:23:32 | Kline, Bob (NIH/NCI) [C] |
image-2020-10-29-17-16-29-398.png | 2020-10-29 17:16:30 | Kline, Bob (NIH/NCI) [C] |
Elapsed: 0:00:00.001495