Issue Number | 4478 |
---|---|
Summary | Streamline CDR databases |
Created | 2018-05-24 07:28:42 |
Issue Type | Improvement |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2019-08-26 12:11:08 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.226647 |
~bryanp has requested that the footprint of the CDR database be significantly reduced in order to improve processing and backup efficiency, and to make it easier to replicate the CDR environment on developers' systems. This would include some or all of the following:
stub out the XML column for archived CDR versions of obsolete document types (i.e., clinical trial documents) in the cdr_archived_versions.dbo.doc_version_xml table
stub out the XML column for the current working copies of such documents in the CDR.dbo.all_docs table
eliminate rows altogether for some or all of these documents
possibly eliminate some of the older versions of Summary documents
eliminate obsolete tables altogether
Eliminating tables and/or rows will require analysis to determine which relational integrity dependencies need to be addressed.
I have attached an Excel workbook identifying the existing tables in
the two CDR databases. Of the 195 base user tables (there is a 196th
table dtproperties
which is incorrectly listed as a user
table but which is actually a Microsoft-generated table), 121 are
obsolete and can be eliminated. In addition, another two tables are
small lookup tables which can be converted to CHECK CONSTRAINTs. This
would leave a total of 72 tables still in production use. The usage of
each of those 72 tables is identified on the first spreadsheet in the
workbook.
document-version-space.xlsx provides the
information ~bryanp requested
on Tuesday about how much space is taken up by the protocol document XML
in the version archive table. The four CDR documents types for clinical
trial information (CTGovProtocol
,
InScopeProtocol
, ScientificProtocolInfo
, and
OutOfScopeProtocol
collectively represent 161,990,356,057
of the total 178,447,474,938 Unicode characters stored in the
xml
column of the
cdr_archived_versions.dbo.doc_version_xml
table, or
91%.
cwd-space-usage-by-doctype.xlsx shows the same
information for the xml
column of the
CDR.dbo.all_docs
table (the "current working documents"). A
little over 53% of the space in that column is used by clinical trial
("protocol") documents.
Here are the dependencies which come into play when considering
removal of rows from the all_doc_versions
or
all_docs
tables.
all_docs
all_doc_versions
pub_proc_doc
version_blob_usage
audit_trail
audit_trail_added_action
audit_trail_added_action
checkout
doc_blob_usage
doc_type
all_doc_versions
all_docs
external_map_type
grp_action
link_target
link_xml
doc_version_label
external_map
external_map_rule
filter_set_member
glossary_term_audio_request
link_fragment
link_net
pub_proc
pub_proc_cg
pub_proc_cg_work
pub_proc_doc
pub_proc_parm
remailer_ids
pub_proc_cg
pub_proc_cg_work
query_term
query_term_pub
ready_for_review
remailer_ids
summary_translation_job summary_translation_job_history
If table B
is indented under table A
then
there is a relational integrity constraint which prevents rows in table
A
from being deleted if any rows in table B
refer to those rows.
Obsolete tables have been dropped on DEV (script attached).
I have stubbed out the protocol version XML in the archive database. Total bytes for the XML values before stubbing was 166.2GB. After stubbing the number was 15.4 (9.27% of the original). Deleting the rows altogether would bring it down to 15.3GB for the versions of the other document types (9.22% of the original). These figures ignore overhead bytes (which cut both ways, since each row contains a fixed amount of overhead storage independent of its data, whereas a database with no table rows has an irreducible base size).
~bryanp: shall I proceed with eliminating the protocol documents altogether? (I'm going to guess that the answer will be "yes" even though the additional immediate gain is almost a rounding error, given your comments yesterday about the need to aggressively eliminate cruft for the benefit of future incarnations of the CDR.) :-)
I have a script for dropping documents we no longer want to keep, including:
all InScopeProtocol
documents
all CTGovProtocol
documents
all ScientificProtocolInfo
documents
all OutOfScopeProtocol
documents
all GENETICSPROFESSIONAL
documents (from before we
used Person
docs for these)
all GlossaryTerm
documents (superceded by
GlossaryTermConcept
and GlossaryTermName
docs)
all css
documents (we no longer use the database to
store the style sheets for XMetaL)
Person
documents other than GPs and board
members
Organization
documents other than PDQ boards and
orgs linked directly by Person
docs we're keeping
Mailer
documents except for summary and GP
mailers
An earlier version of the script had a bug which caused all of the
Mailer
documents to be deleted. I have a separate JIRA
ticket (OCECDR-4502) for ~volker to refresh the DEV databases from PROD
so I can run the script again before we ask the users to check the
results. I will walk through the fixed script with ~volker first (assuming he has bandwidth).
Here's what the numbers will be like on PROD:
!/attachment/ocecdr_117900_2018-07-31 17_16_18-sdlm-20180731170840.xls [Compatibility Mode] - Excel.png!
I will create a report identifying all of the links in CDR documents on PROD pointing to documents which will be removed.
Here is the promised report. Note that quite a few of the
summary->protocol links are to InScopeProtocol
documents.
And here is a snapshot capturing all of the NCT IDs in the CTGovProtocol documents:
Snapshot capturing all of the NCT IDs in the InScopeProtocol documents:
I have exported all of the rows we'll be deleting from
all_docs
and doc_versions
(last publishable
version, as well as the last version, if they're different) to flat
files for future reference and saved a copy on DEV in d:/cdr/backups (as
well as on my GFE laptop and my DEV VM for safe keeping). Just in case.
:-) I didn't bother with the rows in checkout
or the audit
trail tables, but it doesn't seem likely we'd need that information.
Waiting for the ticket from ~oseipokuw for the global change job.
Copy also at
nciis-p401.nci.nih.gov\Group03\OCPL\OCPL_Cross\CDR\cdr-obsolete.7z
I have asked CBIIT to proceed with the requested compacting and reporting for NCI-RITM0130991 (on DEV).
[From ServiceNow ticket]
Sep 05, 2018 12:36:55 EDT - Samuel Mekonnen Additional comments
Hi Bob,
Shrinking Tha databasefile and log file of CDR and cdr_archived_versions completed successfully here is the size before and after Disk usage.
Before
CDR (Datafile) 99.7 GB
CDR Log (File) 6.99 GB
cdr_archived_versions (Data file) 350 GB
cdr_archived Log (File) 30.5 GB
After.
CDR (Datafile) : 73.4
CDR Log (File): 504 KB
cdr_archived_versions (Datafile) 36 GB
cdr_archived_versions Log (File) 504. KB
I'm a little surprised that the CDR database didn't drop more in
size, given that we cut the size taken up by the documents in half, but
the savings for the cdr_archived_versions
table are
impressive. In total, we dropped from 487 GB to 110 GB (77% drop).
Ignoring the log files, which muddy the picture somewhat, the drop is
from 449.7 GB to 109 GB (a still impressive 76% decrease).
Next steps, ~bryanp, are
clear this task with Lakshmi
decide whether to pursue with the users the pruning of older summary versions
I'll defer work on pushing this up the tiers until those two are done.
Spoke with Lakshmi 2/22/2019 and she gave the verbal go-ahead to proceed with running the script for dropping documents we no longer want to keep on the higher tiers, up through PROD, including:
all InScopeProtocol documents
all CTGovProtocol documents
all ScientificProtocolInfo documents
all OutOfScopeProtocol documents
all GENETICSPROFESSIONAL documents (from before we used Person docs for these)
all GlossaryTerm documents (superceded by GlossaryTermConcept and GlossaryTermName docs)
all css documents (we no longer use the database to store the style sheets for XMetaL)
Person documents other than GPs and board members
Organization documents other than PDQ boards and orgs linked directly by Person docs we're keeping
Mailer documents except for summary and GP mailers
I will draft an email with an outline of the planned pruning for Lakshmi and Margaret, so that they can respond to it to provide documented approval.
Sign-off received. See attached approval email.
Amy will communicate with the CCCT Comm Lead about the document deletions from the CDR.
New task for ~bkline - Separate the XML out of the last dump from early Aug 2018 and store.
New task for ~bkline - Separate the XML out of the last dump from early Aug 2018 and store.
Done.
CCCT's Comm Lead communicated the planned streamline with Samantha Finstad, CCCT's Comm Manager, and CCCT thanked OCPL for the update. See attached email.
I am getting ready to apply the CDR database streamlining to QA.
First step is going to be refreshing the database on QA from PROD.
Database refresh script failed. Ticket NCI-RITM0176661 opened with CBIIT.
CBIIT successfully ran the streamline job on STAGE. Waiting for Bob to confirm after vacation and then run the same job on PROD.
The job ran for about 6 hours 15 minutes.
I included this ticket in the writeup of the Joule release on Collaborate, as the work for the streamlining rolled into the home stretch as Joule was deploying (and at least one of the Joule tickets was in support of this project).
File Name | Posted | User |
---|---|---|
2018-07-31 17_16_18-sdlm-20180731170840.xls [Compatibility Mode] - Excel.png | 2018-07-31 17:17:49 | Kline, Bob (NIH/NCI) [C] |
cdr-tables.xlsx | 2018-05-24 07:29:48 | Kline, Bob (NIH/NCI) [C] |
CTGovProtocolNCTIDs.xlsx | 2018-08-02 17:52:14 | Kline, Bob (NIH/NCI) [C] |
cwd-space-usage-by-doctype.xlsx | 2018-05-24 08:34:50 | Kline, Bob (NIH/NCI) [C] |
document-version-space.xlsx | 2018-05-24 07:49:20 | Kline, Bob (NIH/NCI) [C] |
drop-obsolete-tables.py | 2018-07-26 10:38:41 | Kline, Bob (NIH/NCI) [C] |
InScopeProtocolNCTIDs.xlsx | 2018-08-03 10:24:58 | Kline, Bob (NIH/NCI) [C] |
links-to-obsolete-docs-prod.xlsx | 2018-08-02 17:11:30 | Kline, Bob (NIH/NCI) [C] |
RE_ CDR Streamline - document removal.pdf | 2019-04-11 08:20:39 | Dugan, Amy (NIH/NCI) [C] |
RE_ Clinical Trial Protocol documents in the CDR.pdf | 2019-04-11 08:18:54 | Dugan, Amy (NIH/NCI) [C] |
Elapsed: 0:00:00.001977