EBMS Tickets

Issue Number 392
Summary EBMS database is growing too large
Created 2016-07-07 13:12:48
Issue Type Task
Submitted By henryec
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2016-09-08 11:01:23
Resolution Won't Fix
Path /home/bkline/backups/jira/oceebms/issue.187700
Description

Background:
The EBMS database has grown very large, continues to grow and the size will likely become a problem in the future. There are currently already millions of rows. This issue was highlighted when a new developer was unable to setup EBMS on their development resource due to the size of the database.

Objective:
Find a business and/or technical solution that would keep the database size in check or grow more slowly.

Also, work with the users to determine if the business needs everything currently in the database (maybe there is a retention period for different types of data).

Comment entered 2016-07-07 16:12:30 by Kline, Bob (NIH/NCI) [C]

Two points of clarification:

  1. The numbers I pulled out of my head in this morning's standup were way too high: there are approximately 400K articles in the system right now.

  2. For some perspective, the CDR databases are roughly two orders of magnitude larger than the EBMS database (roughly half a terabyte)

I'm populating a copy of the EBMS DEV DB on my developer VM, and as far as I can tell (hasn't finished yet) there haven't been any problems.

Comment entered 2016-08-30 11:32:15 by Kline, Bob (NIH/NCI) [C]

and : as I recall, the business decision was that we should retain the older articles for which decisions have already been made. This ticket asks you to revisit that decision. Should I assign story points for the ticket?

Comment entered 2016-08-30 11:46:08 by Juthe, Robin (NIH/NCI) [E]

We do still need to retain older articles for which decisions have been made. We often refer back to those articles when performing comprehensive reviews of the summaries, and we are now allowing our Board members to access these articles as well. In fact, some of us are retroactively adding articles that are currently cited in our summaries to the database, so this is definitely going to continue to grow.

There may be other types of data (summaries, for example) that could be deleted after a specified time, but I'm not sure how much that would help with the problem. It would be helpful to know if there are other types of data that are contributing to the problem besides articles.

I think it's worth discussing to see if there's anything that can be done from either the business or technical side, so please add whatever story points you feel are necessary. Thanks.

Comment entered 2016-08-31 16:54:19 by Kline, Bob (NIH/NCI) [C]

The EBMS database tables on PROD take up 12.1 GB of space right now. The largest table is the ebms_article table, occupying 10.3 GB. The top six tables in size all carry information about the articles (authors, states, topics, etc.) and represent 11.7 GB, or roughly 97% of the total size of the database. So there are no significant space savings available without getting rid of articles.

Comment entered 2016-09-08 11:01:23 by Kline, Bob (NIH/NCI) [C]

Closed as requested by users.

Elapsed: 0:00:00.000752