CDR Tickets

Issue Number 4478
Summary Streamline CDR databases
Created 2018-05-24 07:28:42
Issue Type Improvement
Submitted By Kline, Bob (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2019-08-26 12:11:08
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.226647
Description

has requested that the footprint of the CDR database be significantly reduced in order to improve processing and backup efficiency, and to make it easier to replicate the CDR environment on developers' systems. This would include some or all of the following:

  1. stub out the XML column for archived CDR versions of obsolete document types (i.e., clinical trial documents) in the cdr_archived_versions.dbo.doc_version_xml table

  2. stub out the XML column for the current working copies of such documents in the CDR.dbo.all_docs table

  3. eliminate rows altogether for some or all of these documents

  4. possibly eliminate some of the older versions of Summary documents

  5. eliminate obsolete tables altogether

Eliminating tables and/or rows will require analysis to determine which relational integrity dependencies need to be addressed.

Comment entered 2018-05-24 07:37:40 by Kline, Bob (NIH/NCI) [C]

I have attached an Excel workbook identifying the existing tables in the two CDR databases. Of the 195 base user tables (there is a 196th table dtproperties which is incorrectly listed as a user table but which is actually a Microsoft-generated table), 121 are obsolete and can be eliminated. In addition, another two tables are small lookup tables which can be converted to CHECK CONSTRAINTs. This would leave a total of 72 tables still in production use. The usage of each of those 72 tables is identified on the first spreadsheet in the workbook.

Comment entered 2018-05-24 08:00:11 by Kline, Bob (NIH/NCI) [C]

document-version-space.xlsx provides the information requested on Tuesday about how much space is taken up by the protocol document XML in the version archive table. The four CDR documents types for clinical trial information (CTGovProtocol, InScopeProtocol, ScientificProtocolInfo, and OutOfScopeProtocol collectively represent 161,990,356,057 of the total 178,447,474,938 Unicode characters stored in the xml column of the cdr_archived_versions.dbo.doc_version_xml table, or 91%.

Comment entered 2018-05-24 09:17:03 by Kline, Bob (NIH/NCI) [C]

cwd-space-usage-by-doctype.xlsx shows the same information for the xml column of the CDR.dbo.all_docs table (the "current working documents"). A little over 53% of the space in that column is used by clinical trial ("protocol") documents.

Comment entered 2018-05-24 09:30:38 by Kline, Bob (NIH/NCI) [C]

Here are the dependencies which come into play when considering removal of rows from the all_doc_versions or all_docs tables.

all_docs
  all_doc_versions
    pub_proc_doc
    version_blob_usage
  audit_trail
    audit_trail_added_action
  audit_trail_added_action
  checkout
  doc_blob_usage
  doc_type
    all_doc_versions
    all_docs
    external_map_type
    grp_action
    link_target
    link_xml
  doc_version_label
  external_map
    external_map_rule
  filter_set_member
  glossary_term_audio_request
  link_fragment
  link_net
  pub_proc
    pub_proc_cg
    pub_proc_cg_work
    pub_proc_doc
    pub_proc_parm
    remailer_ids
  pub_proc_cg
  pub_proc_cg_work
  query_term
  query_term_pub
  ready_for_review
  remailer_ids
  summary_translation_job
  summary_translation_job_history

If table B is indented under table A then there is a relational integrity constraint which prevents rows in table A from being deleted if any rows in table B refer to those rows.

Comment entered 2018-07-26 10:39:34 by Kline, Bob (NIH/NCI) [C]

Obsolete tables have been dropped on DEV (script attached).

Comment entered 2018-07-27 09:24:33 by Kline, Bob (NIH/NCI) [C]

I have stubbed out the protocol version XML in the archive database. Total bytes for the XML values before stubbing was 166.2GB. After stubbing the number was 15.4 (9.27% of the original). Deleting the rows altogether would bring it down to 15.3GB for the versions of the other document types (9.22% of the original). These figures ignore overhead bytes (which cut both ways, since each row contains a fixed amount of overhead storage independent of its data, whereas a database with no table rows has an irreducible base size).

: shall I proceed with eliminating the protocol documents altogether? (I'm going to guess that the answer will be "yes" even though the additional immediate gain is almost a rounding error, given your comments yesterday about the need to aggressively eliminate cruft for the benefit of future incarnations of the CDR.) :-)

Comment entered 2018-07-31 16:30:59 by Kline, Bob (NIH/NCI) [C]

I have a script for dropping documents we no longer want to keep, including:

  • all InScopeProtocol documents

  • all CTGovProtocol documents

  • all ScientificProtocolInfo documents

  • all OutOfScopeProtocol documents

  • all GENETICSPROFESSIONAL documents (from before we used Person docs for these)

  • all GlossaryTerm documents (superceded by GlossaryTermConcept and GlossaryTermName docs)

  • all css documents (we no longer use the database to store the style sheets for XMetaL)

  • Person documents other than GPs and board members

  • Organization documents other than PDQ boards and orgs linked directly by Person docs we're keeping

  • Mailer documents except for summary and GP mailers

An earlier version of the script had a bug which caused all of the Mailer documents to be deleted. I have a separate JIRA ticket (OCECDR-4502) for to refresh the DEV databases from PROD so I can run the script again before we ask the users to check the results. I will walk through the fixed script with first (assuming he has bandwidth).

Comment entered 2018-07-31 17:19:20 by Kline, Bob (NIH/NCI) [C]

Here's what the numbers will be like on PROD:

!/attachment/ocecdr_117900_2018-07-31 17_16_18-sdlm-20180731170840.xls [Compatibility Mode] - Excel.png!

Comment entered 2018-08-02 13:52:42 by Kline, Bob (NIH/NCI) [C]

I will create a report identifying all of the links in CDR documents on PROD pointing to documents which will be removed.

Comment entered 2018-08-02 17:12:16 by Kline, Bob (NIH/NCI) [C]

Here is the promised report. Note that quite a few of the summary->protocol links are to InScopeProtocol documents.

links-to-obsolete-docs-prod.xlsx

Comment entered 2018-08-02 17:52:54 by Kline, Bob (NIH/NCI) [C]

And here is a snapshot capturing all of the NCT IDs in the CTGovProtocol documents:

CTGovProtocolNCTIDs.xlsx

Comment entered 2018-08-03 10:25:25 by Kline, Bob (NIH/NCI) [C]

Snapshot capturing all of the NCT IDs in the InScopeProtocol documents:

InScopeProtocolNCTIDs.xlsx

Comment entered 2018-08-03 16:53:33 by Kline, Bob (NIH/NCI) [C]

I have exported all of the rows we'll be deleting from all_docs and doc_versions (last publishable version, as well as the last version, if they're different) to flat files for future reference and saved a copy on DEV in d:/cdr/backups (as well as on my GFE laptop and my DEV VM for safe keeping). Just in case. :-) I didn't bother with the rows in checkout or the audit trail tables, but it doesn't seem likely we'd need that information. Waiting for the ticket from for the global change job.

Comment entered 2018-08-04 10:12:43 by Kline, Bob (NIH/NCI) [C]

Copy also at
nciis-p401.nci.nih.gov\Group03\OCPL\OCPL_Cross\CDR\cdr-obsolete.7z

Comment entered 2018-08-21 11:16:30 by Kline, Bob (NIH/NCI) [C]

I have asked CBIIT to proceed with the requested compacting and reporting for NCI-RITM0130991 (on DEV).

Comment entered 2018-09-06 07:17:42 by Kline, Bob (NIH/NCI) [C]

[From ServiceNow ticket]

Sep 05, 2018 12:36:55 EDT - Samuel Mekonnen Additional comments
Hi Bob,

Shrinking Tha databasefile and log file of CDR and cdr_archived_versions completed successfully here is the size before and after Disk usage.

Before

  • CDR (Datafile) 99.7 GB

  • CDR Log (File) 6.99 GB

  • cdr_archived_versions (Data file) 350 GB

  • cdr_archived Log (File) 30.5 GB

After.

  • CDR (Datafile) : 73.4

  • CDR Log (File): 504 KB

  • cdr_archived_versions (Datafile) 36 GB

  • cdr_archived_versions Log (File) 504. KB

Comment entered 2018-09-06 07:30:54 by Kline, Bob (NIH/NCI) [C]

I'm a little surprised that the CDR database didn't drop more in size, given that we cut the size taken up by the documents in half, but the savings for the cdr_archived_versions table are impressive. In total, we dropped from 487 GB to 110 GB (77% drop). Ignoring the log files, which muddy the picture somewhat, the drop is from 449.7 GB to 109 GB (a still impressive 76% decrease).

Next steps, , are

  1. clear this task with Lakshmi

  2. decide whether to pursue with the users the pruning of older summary versions

I'll defer work on pushing this up the tiers until those two are done.

Comment entered 2019-02-22 16:25:09 by Dugan, Amy (NIH/NCI) [C]

Spoke with Lakshmi 2/22/2019 and she gave the verbal go-ahead to proceed with running the script for dropping documents we no longer want to keep on the higher tiers, up through PROD, including:

  • all InScopeProtocol documents

  • all CTGovProtocol documents

  • all ScientificProtocolInfo documents

  • all OutOfScopeProtocol documents

  • all GENETICSPROFESSIONAL documents (from before we used Person docs for these)

  • all GlossaryTerm documents (superceded by GlossaryTermConcept and GlossaryTermName docs)

  • all css documents (we no longer use the database to store the style sheets for XMetaL)

  • Person documents other than GPs and board members

  • Organization documents other than PDQ boards and orgs linked directly by Person docs we're keeping

  • Mailer documents except for summary and GP mailers

I will draft an email with an outline of the planned pruning for Lakshmi and Margaret, so that they can respond to it to provide documented approval.

Comment entered 2019-03-07 13:16:13 by Dugan, Amy (NIH/NCI) [C]

Sign-off received. See attached approval email.

Comment entered 2019-03-28 14:06:25 by Dugan, Amy (NIH/NCI) [C]

Amy will communicate with the CCCT Comm Lead about the document deletions from the CDR.

 

New task for - Separate the XML out of the last dump from early Aug 2018 and store.

Comment entered 2019-03-29 10:48:00 by Kline, Bob (NIH/NCI) [C]

New task for  - Separate the XML out of the last dump from early Aug 2018 and store.

Done.

Comment entered 2019-04-11 08:03:36 by Dugan, Amy (NIH/NCI) [C]

CCCT's Comm Lead communicated the planned streamline with Samantha Finstad, CCCT's Comm Manager, and CCCT thanked OCPL for the update. See attached email.

Comment entered 2019-05-23 09:37:24 by Kline, Bob (NIH/NCI) [C]

I am getting ready to apply the CDR database streamlining to QA.

Comment entered 2019-05-23 10:24:17 by Kline, Bob (NIH/NCI) [C]

First step is going to be refreshing the database on QA from PROD.

Comment entered 2019-05-23 10:53:42 by Kline, Bob (NIH/NCI) [C]

Database refresh script failed. Ticket NCI-RITM0176661 opened with CBIIT.

Comment entered 2019-06-07 11:42:08 by Englisch, Volker (NIH/NCI) [C]

CBIIT successfully ran the streamline job on STAGE.  Waiting for Bob to confirm after vacation and then run the same job on PROD.

The job ran for about 6 hours 15 minutes.

Comment entered 2019-08-26 12:11:08 by Kline, Bob (NIH/NCI) [C]

I included this ticket in the writeup of the Joule release on Collaborate, as the work for the streamlining rolled into the home stretch as Joule was deploying (and at least one of the Joule tickets was in support of this project).

Attachments
File Name Posted User
2018-07-31 17_16_18-sdlm-20180731170840.xls [Compatibility Mode] - Excel.png 2018-07-31 17:17:49 Kline, Bob (NIH/NCI) [C]
cdr-tables.xlsx 2018-05-24 07:29:48 Kline, Bob (NIH/NCI) [C]
CTGovProtocolNCTIDs.xlsx 2018-08-02 17:52:14 Kline, Bob (NIH/NCI) [C]
cwd-space-usage-by-doctype.xlsx 2018-05-24 08:34:50 Kline, Bob (NIH/NCI) [C]
document-version-space.xlsx 2018-05-24 07:49:20 Kline, Bob (NIH/NCI) [C]
drop-obsolete-tables.py 2018-07-26 10:38:41 Kline, Bob (NIH/NCI) [C]
InScopeProtocolNCTIDs.xlsx 2018-08-03 10:24:58 Kline, Bob (NIH/NCI) [C]
links-to-obsolete-docs-prod.xlsx 2018-08-02 17:11:30 Kline, Bob (NIH/NCI) [C]
RE_ CDR Streamline - document removal.pdf 2019-04-11 08:20:39 Dugan, Amy (NIH/NCI) [C]
RE_ Clinical Trial Protocol documents in the CDR.pdf 2019-04-11 08:18:54 Dugan, Amy (NIH/NCI) [C]

Elapsed: 0:00:00.001977