Issue Number | 3886 |
---|---|
Summary | [Summary] Delete recordings of Board meetings |
Created | 2015-03-24 15:54:16 |
Issue Type | New Feature |
Submitted By | Beckwith, Margaret (NIH/NCI) [E] |
Assigned To | alan |
Status | Closed |
Resolved | 2015-06-18 14:19:50 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.157449 |
We would like to be able to have the recordings of the Board meetings automatically and permanently deleted from the CDR after a time period of 3 months from posting them.
I've done some research on this. It doesn't look too hard but we will have to remove some impediments in the CdrServer that were put there on purpose to prevent anyone from deleting a "blob" (in this case an mp3 media file) connected to a document for which a version has been created. Our idea was that, once a version has been created, even a non-publishable version, that's history and changing history is against the law.
I'll put in a way to override that so that the default behavior is still to prevent deletions and an explicit override is required.
I should probably also add something to the Media document to indicate why the expected media is no longer found. One possibility would be to add another "ProcessingStatus" element with "ProcessingStatusValue" = "Media file deleted", the "ProcessingStatusDate" = date of deletion, "EnteredBy" = "FileSweeper", and "Comment" = something like "Media file deleted after expiration date."
If we do that I'll have to add the new ProcessingStatusValue to the Media schema.
Let me know if that's a good or bad idea, or if there's a better one.
Here is how I plan to select meeting recordings to be deleted:
Find all Media documents (i.e. the XML files) that have the element
path:
'/Media/MediaContent/Categories/Category'
and the element value:
'meeting recording'
Among those, find the oldest date entered for either the current
working document (if there are no versions) or the first version (if
there are versions.) These dates are the dates that the Media XML
document (CWD or version) was actually received by the CDR.
Compare that date to today's date. If it's greater than 3 months
older than today's date, delete all of the blobs associated with CWD
and ALL of the versions. Normally there will only be one blob.
However if someone modified the blob and re-saved it, for example to
compress the recording more for easier downloading, it will be
deleted.
Here is the rationale for this approach:
Using the date of the XML document:
We have no dates on the MP3 recordings themselves. All
information about them, including the dates, is stored in or
with the Media XML documents.
Using the oldest Media version:
If the oldest version of the document is more than three
months old then, even if the current working document is only
one month old because someone edited it, the meeting notes
recording is more than three months old and any recording of
the discussion, even an edited one, should be deleted.
Using the CDR database date:
There is a "DateCreated" element in the Media document but
it's a text string. It is possible for someone to make an
error in the date, or the format of the date, that would cause
the date to be incorrectly handled. For example, if someone
typed 2015-01-03 on January 3, 2016 (because they hadn't yet
gotten in the habit of typing 2016), the recording would be
deleted right away.
If different editors typed "2015-01-03", "1/3/15", "Jan 3,
2015", etc., the program won't be confused because it will
only be using dates generated consistently by the CDR
software.
Deleting ALL of the blobs:
Someone might store a blob in the oldest version and then, a
month or two later, extract it, recompress it to make it
smaller, and then store it again. In that case, the first
version may be four months old and the re-compressed version
only two months old. This plan will consider both of them as
really being four months old and delete them both, which seems
to be what we want.
My approach could do something unexpected, for example:
If someone were to create a Media document well before the meeting
and then attach the recording later, after the meeting, the date
of creation of the Media document would make the recording appear
older than it actually is and cause it to be deleted before three
months.
I will assume that this is a lesser danger than the others and I'll
stick with the above plan unless I hear objections.
This comment is just for Bob and Volker, entered here in JIRA so we'll
have a permanent record of our choices and decisions.
However anyone else who has trouble sleeping might find this useful.
Bob and Volker,
Here are three methods for initiating meeting recording deletions, and
three variations for how to delete them.
Initiating Deletions
--------------------
1. The server knows what to do based on the category of blob.
Method:
FileSweeper performs a repDoc with delBlob=True.
cdr::setBlob sees the request to delete the blob.
If there are no versions, it behaves as it always has,
deleting the row in the doc_blob and doc_blob_usage tables.
Else:
If the docId matches a row in the query_term table with:
/Media/MediaContent/Categories/Category ==
'Meeting Recording'
Then:
Find all blob_ids for the CWD and all versions of the
doc_id.
Act on all rows for these blobs, inluding, as needed
depending on how we handle deletions:
doc_blob (blob_id, data)
doc_blob_usage (doc_id, blob_id)
version_blob_usage (doc_id, vernum, blob_id)
Else:
Throw exception, as now.
Advantages:
The hole we're creating is restricted to just this one
application.
No client side software changes are required, just pass
delBlob=True to repDoc.
Disadvantages:
It's another nail in the coffin of server side document type
agnosticism.
2. The client software sends a special blob value.
Method:
A special value is used for a blob that is to eliminate or
replace all versions.
Example, in cdr.py repDoc, parameters:
blob = "** REMOVE ALL BLOB VERSIONS **"
(That would be a manifest constant in cdr.py.)
When the server sees this, it takes the appropriate action
(see below under alternative actions). Otherwise, everything
remains the same - blobs linked to versions will not be
deleted.
Advantages:
The method is general. It can be used for blobs of any types.
No server side document type knowledge is required.
No client side software changes are required, except for the
definition of a new constant.
Disadvantages:
We've introduced a bigger potential hole in the system concept
of never revising the past. However I don't know if we would
ever fall into that hole by accident.
No client side software changes are required (just pass a
different parameter value.)
3. Create new functions in the client and server for this purpose.
Method:
Create new functions with names like "delBlobVersions". One
in cdr.py to create a new transaction. One in the CdrServer
to process it.
Advantages:
This is another way to make it hard to fall into the new hole
by accident. It's a little less opaque than passing a special
blob parameter.
Disadvantages:
More software to write.
The client is independent of cdr.repDoc(), which is probably
not a good thing since it makes sense to update the media
document when its associated blob is removed or altered.
I can think of other variations but those are the main ones. I'm
inclined towards #2.
We could combine the advantages and disadvantages of 1 and 2 by both
using a special string to signal a blob delete and adding doc type
specific code to the server.
With any of the methods we could also add code to the client, for
example in cdr.repDoc, to prevent passing through this hole unless
we're dealing with meeting recordings.
Alternative Actions - Delete or Replace
---------------------------------------
An alternative to deleting rows in the doc_blob, doc_blob_usage, and
version_blob_usage is to replace the blobs.
a. For any of the methods above, instead of deleting the blob related
rows, replace the value of the 'doc_blob.data' column. For
example with strings like:
"Binary object removed after expiration." or
"Binary object expired."
Method:
Select all blob_ids from doc_blob_usage and version_blob_usage
where the doc_id is our id.
For each blob ID,
Replace the data with our string.
Advantages:
Data disappears but not structure. We can still see the
skeleton of what was there.
It's possible for software (and people) to distinguish the
case of a link to a blob which was never made from the case of
a link that was made and then broken. This isn't trivial
however since SQL Server doesn't support selecting or
comparing on binary image data. Method c. is better for this.
Disadvantages:
Not sure that such a message will be useful. Can't select on
it in SQLServer. Where would we display it?
b. Allow programs to specify the deletion string.
Method:
Using method 2, pass a blob with two parts, for example:
blob = "** REMOVE ALL BLOB VERSIONS **::Binary object
removed after expiration"
or:
blob = "** REMOVE ALL BLOB VERSIONS **::Recording removed
after expiration, 2015-05-21"
If no "::..." is found, we would substitute a default message,
probably "Binary object removed after expiration."
Advantages:
More flexbility on messages.
Can distinguish what actually caused the removal, when, or
anything else the program wishes to record.
Disadvantages:
Same disadvantages as method 1.
Opens the door to more variation.
c. Add a column to the doc_blob table.
Method:
New column:
expired datetime null
Replace the "data" column with an empty string (or with text.)
Replace the "expired" column with the datetime of deletion.
Advantages:
Data disappears but not structure. We can still see the
skeleton of what was there.
Allows us to select data that is or is not expired, better
than methods 2 or 3. We can create views for non-expired or
expired data.
Tells us when this happened.
Allows us to coordinate the event with our audit trail info.
Disadvantages:
Need to update the database on all servers, adding the column
to 100,000+ rows. Not really a big deal.
It's up to the client program (FileSweeper in this case) to
track the reason for what happened.
What I Plan to Do
-----------------
Unless I hear better ideas, I plan to implement method 2c.
I vote for dropping the rows over leaving stubs or using another column, using an explicit delBlobVersions command.
What are are you thinking are the advantages of that approach?
I don't like the idea of magic strings that the software has to recognize, and removing the rows means no client software is going to think it has a real media blob and try to show or play it, only to blow up when what the server handed it is a string. As for using a new command, that seems like the most explicit, and as Tim says, "explicit is better ...." A variant - just about as good - would be to add another parameter to the cdr.repDoc() function. Something like delAllBlobVersions=True. I agree that it wouldn't be a bad idea to update the XML document, and I wouldn't object to something along those lines.
> I don't like the idea of magic strings that the software has to
recognize ...
I forgot how much Drupal you've been exposed to.
I'll go with your suggested modification to cdr.repDoc. If we don't do it that way, it will be necessary to run two transactions with two separate entries in the audit trail, which sounds less desirable if we always update the doc.
I plan to always add another ProcessingStatus to the Media document (see the first comment, "13 Apr 2015 05:54 PM"). As things stand, FileSweeper may be the only program in the life of the system that deletes blobs.
Following the principle of least surprises, I've attached a spreadsheet with two worksheets in it. One shows all of the recordings (CDR ID and Media XML document title) that would be deleted if the program were run today. The other shows all of the meeting recordings.
They're sorted by CDR ID, so if you look at the last numbered ID on the first sheet, everything beyond that number on the second sheet would be what remains after deleting the recordings.
My approach could do something unexpected, for example:
If someone were to create a Media document well before the meeting
and then attach the recording later, after the meeting, the date
of creation of the Media document would make the recording appear
older than it actually is and cause it to be deleted before three
months.
This is highly unlikely to happen as we never create a media document before we actually see the recorded file from the meeting. Essentially we have to see the recording before we initiate the process of creating the media doc.
I've finished all of the code and completed end to end integration testing - right up to the point of invoking the functions to modify the database - just logging what would happen instead of making it happen. I stopped at that point, not wanting to modify the database this late at night without a chance to think it out with a fresher mind.
However I think everything really is done and it only requires uncommenting some lines and stripping out some debug outputs to make it operational.
Changes are in:
FileSweeper.py (signficant)
cdr.py (very minor)
CdrDoc.cpp (minor)
CdrBlobExtern.h (very minor)
CdrBlob.cpp (significant)
There are some new legal keywords in the FileSweeper.cfg. I'll need to document those somewhere. I don't currently recall where documentation for the config file exists - but we should update it both for this and for the tier specific additions that were made.
The only change in a live directory on DEV (and only on DEV) is cdr.py. The other changes are in my sandboxes with backups on the h: drive. Nothing is checked in to svn.
I'll see everyone in thee weeks.
All of the testing I did on this task went right up to the point of deleting the recordings but, instead of actually deleting anything, it just printed a message. The problem is that, once I've deleted the meeting recording blobs on DEV, testing is no longer practical on that server.
To avoid that, I plan to do some artificial testing using very old dates that pick off just one or a few meeting docs at a time instead of testing with docs older than 90 days. This will test the actual deletion logic as well as the selection logic. When everything is testing okay, I'll report back.
There are now 179 meeting recordings stored on DEV, all of them older than 90 days - because DEV was refreshed from Prod more than 90 days ago. Hopefully, I'll be able to do a good test and still have more than a 100 or 150 meeting recordings for further testing if needed.
In order to centralize the information in one place (the JIRA ticket),
here is a list of the files that must be checked in to Curie and
included in the release:
Client changes:
cdr.py:
Added parameter to repDoc() delAllBlobVersions=0.
Added logic to insert a "DelAllBlobVersions" element in the
CdrRepDoc transaction sent to the CDR server.
Although I derived this from Ampere, I've ported it to the
version containing all of the cdr-security-remediation changes.
FileSweeper.py:
Added ability to invoke custom sweep routines.
Added the custom sweep routine expireMeetingRecordings()
Added cleanSession() to handle cleanup of session and database
connection objects. These objects were not needed in the old
FileSweeper but are required by expireMeetingRecordings().
Added some new logging routines for more functional logging.
Server changes:
CdrDoc.h:
Changed declaration of checkTitleChange(). No change in
meaning, only resolving an inconsistency between the way the
declaration was made for this function as opposed to other
functions.
CdrDoc.cpp:
Change to cdrPutDoc() to support DelAllBlobVersions in incoming
transaction. This just parses the command and invokes the
appropriate routine in CdrBlob.cpp.
CdrBlobExtern.h:
Added declaration for delAllBlobVersions().
CdrBlob.cpp:
Added significant function delAllBlobVersions() to delete all
blob versions and associated info when requested.
I've done a number of tests deleting old meeting notes, but not yet all of them older than 90 days on DEV. The bug I found after turning on deletions is now fixed.
The newest meeting records on DEV are quite old since the database was last refreshed from PROD much more than 90 days ago. Unless I hear objections in the next few hours, I will do a test to delete most of the remaining meeting minutes recordings. I'll delete all of those older than Jan 1, 2014 (516 days ago.) That will leave 10 recordings on DEV in case anyone needs data for other testing.
I ran the big test and it worked except for one document (CDR0000749760) that William has checked out on DEV. I fixed the code to handle that gracefully by writing an error message to the log file and continuing on. If this happens in production, the recording would be found again on the next run and, if the lock has been removed, the recording will be processed and removed.
Volker has suggested a future modification to the FileSweeper software to send email when errors like this occur rather than just log them - but that's for a future task,
The process took quite a while, about 6.5 minutes to delete around 165 meeting recordings with associated management information, and to update the Media xml. It's possible that a different design would run faster but the big task will only run once on each server. After that the backlog will be gone and only zero, one, or a few meeting recordings would be deleted during a FileSweeper run, taking perhaps 3 seconds each.
I've updated the FileSweeper documentation on Prod to include the changes for sweeping away meeting recordings. The document had been last updated in 2006, so I also made more edits to document other changes that have been made since then, such as the new ability to have different specifications for different servers in a single configuration file.
The updated doc is in the "System Information" section of the documentation.
There were no CDR documents for some of the higher CDR IDs numbers. I
assume those CDR IDs are from PROD? That is, I assume you generated the
spreadsheet against the database on PROD rather than on DEV, is that
right?
Here are three examples.
766125
766194
767233
Apart of the question above everything else looks good. Thanks!
> There were no CDR documents for some of the higher CDR IDs numbers. I assume those CDR IDs are from PROD?
That's right. The lower tier databases were refreshed from Prod quite a long time ago and do not include any of the recent data. Both the Media documents and the recordings themselves are quite old on Dev.
All code is now re-tested and checked in to subversion.
Hi Alan,
I think we're ready to test this on QA. Could you please run your
program to delete recordings older than 1 year or thereabouts? Then
we'll have room to test again (one or more times) in a smaller time
period.
Thanks,
Robin
This appears to be working. It seems to me that the file sweeper program has been scheduled already. I am currently testing some of the older files and getting the following message..
"No blob found for CDR document.... "
So far, I haven't found any problems.
> Could you please run your program to delete recordings older than 1 year or thereabouts?
I'm too late!
I set up the FileSweeper configuration file to be the official one (i.e., the one with a 90 day expiration on meeting recordings) and stored that in the version repository. When Curie was deployed to QA, that became the config file that the regular FileSweeper runs used. FileSweeper runs daily, so on the first run it deleted everything older than 90 days. Fortunately, there were some younger than that.
There are now 3 meeting recordings left on QA:
770665 Screening and Prevention Meeting Recording 03-25-15;meeting recording;MP3
770933 Supportive and Palliative Care Meeting Recording 03-30-15a;meeting recording;MP3
770934 Supportive and Palliative Care Meeting Recording 03-30-15b;meeting recording;MP3
That's a good thing and means that the FileSweeper is working correctly. As a further test, do nothing and check again 90 days after March 25 (or maybe 91 depending on what time of day FileSweeper runs.) That 03-25-15 recording should disappear. Five days later, the other two should disappear.
I think that's probably a good test.
90 days has happened today, so I'm expecting FileSweeper to run tonight and tomorrow you should see the 3-25 recording disappear.
I remember dithering about whether to put some leeway in the code (what does "after" 90 days mean?) So if it doesn't happen tomorrow, check the next day. There will also be a difference between a FileSweeper run at 11:59 pm and one at 12:01 am, so it may be necessary to wait one more day too.
If it still hasn't happened, let me know and I'll take two aspirin and re-examine the code. 🙂
Thanks, Alan! I'll check for that 3/25 file tomorrow and post a comment here.
I forgot to check for the 3/25 file yesterday, but it is still there today.
I just checked this myself.
I misled you (and myself) when I looked at the meeting date mentioned in the title and said that the recordings should be deleted 90 (or 91) days from that date. The software actually looks at the date of creation of the Media document. Those dates are later than the dates in the titles. Here are the actual dates that the Media docs were created:
CDR ID Actual date Title
770665 2015-03-31 Screening and Prevention Meeting Recording
03-25-15
770933 2015-04-08 Supportive and Palliative Care Meeting Recording
03-30-15a
770934 2015-04-08 Supportive and Palliative Care Meeting Recording
03-30-15b
90 days will come around for the first one on June 29. So that's when we need to start looking.
I checked the FileSweeper log file on QA. The last deletions were on Tuesday morning, June 23, at 1:15 am. Files deleted were:
770396 Pediatrics Meeting Recording 03-19-15 (Day1);meeting
recording;MP3
770397 Pediatrics Meeting Recording 03-20-15 (Day2);meeting
recording;MP3
770398 CAM Meeting Recording 03-13-15;meeting recording;MP3
All of them had been created on March 23. That's the 92nd day after they were created. What happens is that the software is looking back 91 days ("after" 90 days), but the sweep occurred at 1:15 am on that day and the files were created in the afternoon, which made them a few hours short of 91 days for full ripeness. So they were swept away the next morning. I expect the next deletion will actually occur early in the morning on July 1.
If it matters, let me know and I can cut off a day or two from the sweep date before we go to production.
OK, sounds good. I confirmed these are all coming up with the "no blob found..." message. I think 92 days is fine. I'm considering this QA verified. Thank you!
Verified this on PROD and it appears to be working correctly. I wasn't able to verify that they are being deleted on the 90th or the 90+ 2 days but all that the ones I expected to be deleted were deleted and the ones I expected to be there, were there. I am leaving this issue still open just in case further testing is needed.
I also verified this on PROD. Closing issue.
File Name | Posted | User |
---|---|---|
MeetingRecordings.xls | 2015-04-14 23:05:35 |
Elapsed: 0:00:00.001340