CDR Tickets

Issue Number 4437
Summary Investigate Performance of CDR STAGE
Created 2018-03-13 13:35:21
Issue Type Task
Submitted By Englisch, Volker (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2018-07-19 13:57:30
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.222577
Description

Since our latest CDR Gauss release our STAGE server is our slowest server although it used to be faster than DEV or QA before Gauss.

I want to monitor the performance on all four CDR servers for about 2-3 more weeks before submitting a CBIIT ticket to look into the performance issues.

Comment entered 2018-04-18 17:25:56 by Englisch, Volker (NIH/NCI) [C]

A CBIIT ticket has been submitted to investigate the performance of our publishing jobs on STAGE. Chuck Solie is looking into this.

Comment entered 2018-07-19 13:52:32 by Englisch, Volker (NIH/NCI) [C]

Email message send to Chuck Solie:

Hi Chuck, (mental note - don't press enter for line breaks). Thank you for picking up this ticket again. Yes, you are correct. This ticket certainly is a low priority task because it doesn't affect production. However, I wouldn't categorize it under "curiosity" because it is important to know why one of our servers is significantly slower than the others. Given that the servers - especially PROD and STAGE - are supposed to be fairly identical we see red flags if those behave differently and we like to get to the bottom of this.

Anyway, you asked if the runtime is still impacting the production jobs on the different tiers? I checked the logs for our four tiers and noticed that our lower tiers are now processing documents at almost identical and fast times. In fact, between 6/27 and 6/28 something has changed to significantly improve performance not just for the STAGE server alone but for the DEV and QA servers as well.
Before this change, our nightly job ran on average for 14 minutes on STAGE and 10-11 minutes on DEV and QA. Since 6/28 the runtime is reduced to about 6.5 minutes on all lower tiers.
Processing time for the longer weekend jobs changed from 31-34 minutes to 17-19 minutes on all lower tiers.

Processing time on the PROD server does not appear to have been affected.

Comment entered 2018-07-19 13:57:22 by Englisch, Volker (NIH/NCI) [C]

Summary of my last message:
Something changed on the lower tier servers on 6/27 which improved processing time on all of our lower tier servers. The nightly publishing job now finishes in under 7 minutes on all three tiers which is 40% faster compared to QA/DEV and 55% faster compared to old STAGE.
Similar improvements can be seen for the CDR weekend job.

I consider this ticket as completed.

Comment entered 2018-07-19 14:21:16 by Englisch, Volker (NIH/NCI) [C]

Adding spreadsheet with publishing performance times. Note that earlier times (prior to 6/8) are identical to PROD times due to database refreshes on lower tiers.

Comment entered 2018-07-19 15:18:21 by Englisch, Volker (NIH/NCI) [C]

I'm including the explanation from CBIIT for the performance increase. I do not understand how virtual machines are working but I'm not convinced by this argument:

The hardware under a virtual host can vary as it is rebooted or migrated between virtualization hardware. The hardware is not all identical in its generation or performance, so i suspect the slower performance was linked to slower CPU/RAM on underlying hardware.

ncias-s1186-v - 6/27 was powered off at 7:55p and restarted just after 8pm. and then rebooted 8:35pm during systems maint (note: powering off a virtual host and powering it back up increases the likelihood that it will be started on different underlying hardware than it was previously on vs rebooting it)

nciws-d141-v - 6/27 was powered off at 6:19p and restarted at 7:28pm. (note: powering off a virtual host and powering it back up increases the likelihood that it will be started on different underlying hardware than it was previously on)

nciws-q179-v - I don't find any logs of power off/on but suspect similar change Let me know if this is sufficient info for you and the ticket ok to close. Thanks.

Comment entered 2018-07-19 15:19:19 by Englisch, Volker (NIH/NCI) [C]

I'm closing the ticket since the task to improve the performance on STAGE has been completed.

Attachments
File Name Posted User
Pub-Performance.xlsx 2018-07-19 14:19:24 Englisch, Volker (NIH/NCI) [C]

Elapsed: 0:00:00.001707