CDR Tickets

Issue Number 4946
Summary [Media] Audio import - check for correct mp3 filenames before download
Created 2021-02-23 17:40:10
Issue Type Improvement
Submitted By Osei-Poku, William (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2021-03-16 14:41:52
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.285467
Description

Please look into modifying the audio download program to match the mp3 filenames with the names provided in the spreadsheet to make sure that they correctly match before the files are downloaded. Currently when the filenames don't match (that is an error in naming the mp3 files) what is in the spreadsheet, we get an error message at the point of reviewing the mp3 individual files, which is too late in the process to correct and also makes reviewing the terms more time consuming. We discussed this briefly in this ticket  - OCECDR-4592.

 lf possible, we want to be able see a list of incorrectly named files when attempting to download the file. We will then have the filenames corrected and uploaded again before attempting to download again. 

Comment entered 2021-02-24 14:13:30 by Kline, Bob (NIH/NCI) [C]

Since the original request hasn't provided all the specifics for how the request should be implemented, I'm going to fill in the blanks here (rather than proceeding with an implementation and then having to redo it to match different requirements).

  • The software will refuse to download a zip file if the path names for the MP3 files in a given zip file do not exactly match the path names given in the File Name column of the Term Names spreadsheet. In other words, it will be considered an error if there are any MP3 files in the zip archive which do not appear in the File Name column of the spreadsheet, and it will be considered an error if there are any MP3 file paths in that column which do not have MP3 files with those paths in the zip file. Path comparison is performed with case sensitivity turned on.

  • If there is more than one zip file matching the week-naming convention available for download, all of the zip files will be checked to verify that the MP3 names in the spreadsheet match the MP3 files in the zip file, and if that check fails for any of the zip files, none of the zip files will be downloaded. This will prevent downstream actions being taken on incomplete sets of audio files. The discrepancies will be reported for all zip files, not just the first.

  • If MP3 path discrepancies are found all of the zip files will be left in place (so that the report of the problems can be repeated), to be replaced or deleted by CIAT as appropriate.

  • The MP3 path discrepancies will be reported by listing (for each zip archive) all of the MP3 file paths listed in the File Name column of the spreadsheet which do not have a matching MP3 file in the zip archive, and all of the MP3 files found in the zip archive which do not appear in the File Name column of the spreadsheet. (Identifying which path names were intended to match from each set would be impossible to do with certainty in the general case.)

  • The software already rejects MP3 files whose paths do not match the convention adopted for the naming pattern of CDR glossary files. That check will be retained and folded into the reporting of discrepancies described above.

  • The software will not treat two occurrences of the same MP3 path in the spreadsheet as an error.

Comment entered 2021-02-25 09:14:11 by Osei-Poku, William (NIH/NCI) [C]

All of the above look good to me. Thanks! I was only going to ask to check the count of mp3 files expected (from spreadsheet) vs what is in the zip file to make sure they match but it looks like that is not necessary as there will still be an error message if the number falls short in the zip files.

Comment entered 2021-02-25 09:54:02 by Kline, Bob (NIH/NCI) [C]

Just checking the counts would be insufficient, as subsequent operations would still fail if the numbers matched but the path names didn't.

Comment entered 2021-03-04 11:41:01 by Kline, Bob (NIH/NCI) [C]

I inadvertently asked  a question in the wrong ticket, about "the problems/what needs to happen with the Week 2020_51 files on PROD." He responded:

It is the same problem I reported last Thursday . We have a few documents in the Week_2020_51 batch (on PROD) that were named incorrectly in the archive folder so we couldn't review them successfully as we received error messages when we clicked on the audio in the review report to listen to them. The files have now been corrected and a new archive file generated but I need to be able to test on DEV what impact it will have on importing the files into the CDR. I also needed to find out if the new archive will overwrite the existing file and assess the amount work we need to do manually in order to complete reviewing and importing the files.  I was not able to test with the same archive file on DEV so I need to device a new plan to test this. 

I assume it wasn't actually the documents which were named incorrectly, but instead the MP3 filenames.

So my next question for would be: can you identify which audio files need to be replaced? And can you provide Week_2020_51_Rev2.zip with just those corrected audio files?

Comment entered 2021-03-04 12:05:52 by Osei-Poku, William (NIH/NCI) [C]

Yes, it is the mp3 filenames that were incorrect. There is a Week_2020_51_Rev1.zip which contains all the corrected names and also other documents that had to be re-recorded. We are waiting for a Rev2 file before I can import them. Should I create another Rev file with the corrected audio filenames?

I did a test on QA with much smaller zip files that mimicked the error on PROD and I was able to successfully import the files into the CDR. The incorrect filename was reported in the results but the file was correctly updated/linked to the CDR documents (I assume from the Rev1file which contained the corrected filename). As part of my test, I uploaded the first Week_XXXX_XX.zip file  (with the incorrect filenames corrected) on QA and checked the box for the download to overwrite the original file (with the errors) but I am not sure if that had any impact on the final results. I intend to do the same on PROD unless that is not necessary.

Comment entered 2021-03-09 13:57:17 by Osei-Poku, William (NIH/NCI) [C]

  We have finished reviewing all the files and are ready to import them into the CDR today. I planned to upload a clean copy of Week_2020_51.zip (with corrections) and download it by checking the overwrite checkbox. I had to do this on QA to have it work but I am not sure if it had any effect on the import.  Please let me know if you've had the chance to take a look and whether I should upload the file before importing or not. 

Week_2020_51.zip 
Week_2020_51_Rev1.zip 
Week_2020_51_Rev2.zip

Comment entered 2021-03-09 15:18:46 by Kline, Bob (NIH/NCI) [C]

What exactly did you do on QA and when did you do it?

Comment entered 2021-03-09 16:21:21 by Osei-Poku, William (NIH/NCI) [C]

I prepared a file that had at least one mp3 file named wrongly (did not match what was on the spreadsheet). I wanted to mimic the error we encountered on PROD. I then generated and uploaded the zip file to the ftp server, downloaded it to the CDR, reviewed all the terms and checked to see that the one that was renamed wrongly gave me an error message. I also generated a Rev1 spreadsheet which contained the one mp3 file named wrongly (I had rejected it). I made the correction and generated another zip file (Rev1), uploaded it to the ftp server, downloaded it and reviewed it. At this point, I was ready to import the files into the CDR. But before I imported the files, I took a copy of  the first zip file I created (which contained the error mp3 file in the initial step), corrected the filename of the mp3, generated a new zip file and uploaded it to the ftp server and downloaded it the CDR by checking the overwrite checkbox. I did not have to review any files after that. I just imported the files and it seems to have succeeded.  I did this on 2021-03-03.

So, I guess the one question is, without going through the same process to determine if the import will succeed, I wanted to find out if it was necessary for me to do the last step of uploading a clean copy of the initial file and using the overwrite checkbox during the download?

Comment entered 2021-03-09 16:32:39 by Kline, Bob (NIH/NCI) [C]

I'm running some tests now. I'm pretty sure that if the most recent MP3 file for each term name/language combination has a correct match between the path of the MP3 file in the spreadsheet and the path of the MP3 in the zipfile, you will be OK. That's what I'm trying to verify now. By "latest" I mean that if the Spanish MP3 for "fúbar" as mismatched in Week_2020_51.zip and/or Week_2020_51_Rev1.zip, but correctly matched in Week_2020_51_Rev2.zip, the "latest" would be the version in the Rev2 zipfile.

Comment entered 2021-03-09 16:34:58 by Kline, Bob (NIH/NCI) [C]

In other words, if the latest zipfile that has a given MP3 has matching paths, then the fact that the earlier zipfile had messed up the paths wouldn't make any difference, so you wouldn't have to replace Week_2020_51.zip.

Comment entered 2021-03-09 17:10:58 by Kline, Bob (NIH/NCI) [C]

As far as my tests have been able to determine, the import should work correctly. I have attached a spreadsheet showing which MP3 files will be pulled from which ZIP archives.

Comment entered 2021-03-09 17:13:38 by Kline, Bob (NIH/NCI) [C]

Hmm. Jira is rejecting my upload. I'll keep trying. Or maybe just mail it to you.

Comment entered 2021-03-09 17:14:07 by Kline, Bob (NIH/NCI) [C]

Worked this time.

Comment entered 2021-03-10 10:21:45 by Osei-Poku, William (NIH/NCI) [C]

Yes, It worked. I was able to import all of them last night. Thank you!

Comment entered 2021-03-16 14:41:52 by Kline, Bob (NIH/NCI) [C]

Implemented on CDR DEV.

Comment entered 2021-04-30 12:26:03 by Osei-Poku, William (NIH/NCI) [C]

Verified on DEV. Thanks!

Comment entered 2021-05-27 09:32:50 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA. Thanks!

Comment entered 2021-06-23 10:50:55 by Osei-Poku, William (NIH/NCI) [C]

Closing this ticket for now and will report if we run into any problems. Thanks!

Attachments
File Name Posted User
week51.xlsx 2021-03-09 17:13:51 Kline, Bob (NIH/NCI) [C]

Elapsed: 0:00:00.000838