CDR Tickets

Issue Number 4546
Summary Unable to Run Summary Mailer
Created 2018-11-05 14:59:44
Issue Type Bug
Submitted By Englisch, Volker (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2018-11-06 16:55:39
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.235548
Description

Both, Bonnie and tried to run summary mailers today. Both times the mailers failed to finish and the scheduler service restarted as a result of these mailers.
It appears that the last time a mailer ran successfully was on Oct 30th (Job16808).

Comment entered 2018-11-05 15:20:07 by Kline, Bob (NIH/NCI) [C]

I ran the pending job on QA from the command line and it succeeded. My guess, therefore, is that CBIIT's gremlins did something to the file permissions again (probably on or around Halowe'en). I have run fix-permissions on d:\Python. Please try again on QA. If it still doesn't work, I'll run fix-permissions on d:\cdr\Mailers and we'll try a third time (I'm not doing both at once in the hopes of find out where the breakage was with more granularity).

Comment entered 2018-11-05 15:25:01 by Englisch, Volker (NIH/NCI) [C]

Thank you!
I was just thinking I'm going crazy since the job finished after 30 minutes! Good to know it was you who pushed the job through.

Wait a minute - I'll submit another mailer.

Comment entered 2018-11-05 15:54:54 by Englisch, Volker (NIH/NCI) [C]

Let me know if/when you need me to submit another mailer.

Comment entered 2018-11-05 16:10:37 by Kline, Bob (NIH/NCI) [C]

Sure, give it another shot. This time I ran fix-permissions on

  • d:/cdr/Mailers

  • d:/cdr/Output/Mailers

  • d:/cdr/Scheduler

Comment entered 2018-11-05 16:17:50 by Englisch, Volker (NIH/NCI) [C]

Another mailer has been submitted but the scheduler still crashes.

Comment entered 2018-11-05 16:28:11 by Kline, Bob (NIH/NCI) [C]

Well, looks like we have three solutions we can pursue:

  1. give the two of us login accounts on the upper-tier servers, and a couple of pagers, and the users just call us up and have us log in and run the job by hand;

  2. give the users phone numbers for on-call CBIIT staff, who will log into the servers and run the jobs by hand; or

  3. have CBIIT fix the Windows permissions problems

Your thoughts?

Comment entered 2018-11-05 16:53:51 by Englisch, Volker (NIH/NCI) [C]

I thought you ran the fix-permissions job but it didn't fix the problem.
What is it that makes the scheduler service crash and why is the mailer job working OK when run from the command line? Why aren't other jobs failing? William did run a couple hot-fix jobs earlier today without problems.

Comment entered 2018-11-05 20:47:29 by Kline, Bob (NIH/NCI) [C]

Don't know. We can investigate further tomorrow.

Comment entered 2018-11-06 16:55:25 by Englisch, Volker (NIH/NCI) [C]

Bob was able to identify and fix the problem. The change has been copied to DEV and QA.

The following file has been updated:

Please verify on DEV or QA.

Comment entered 2018-11-06 16:56:25 by Englisch, Volker (NIH/NCI) [C]

Adding to perform the testing of the mailers.

Comment entered 2018-11-07 10:37:23 by Juthe, Robin (NIH/NCI) [E]

I've asked Bonnie to try running a summary mailer on QA.

This is also affecting our correspondence mailers. I've tested those on QA and it they seem to be working fine.

Comment entered 2018-11-07 10:58:20 by Juthe, Robin (NIH/NCI) [E]

Bonnie ran an AB mailer successfully, although she wasn't able to view it. I think that may just be because the script she uses ("ViewPrintJob...") in the command line is set up to view the mailers on PROD and would need to be adjusted for QA. Is that right? If so, I think we can proceed with moving this to PROD.

Comment entered 2018-11-07 11:08:10 by Englisch, Volker (NIH/NCI) [C]

That is correct! As long as Bonnie can confirm the mailer job finished successfully (and isn't stuck with the status "in process") we're good to go.
Viewing the mailer job wasn't broken and we didn't make any changes to that portion of the process.

I will submit a ticket to CBIIT to update the changes on STAGE and PROD.

Comment entered 2018-11-07 11:30:50 by Englisch, Volker (NIH/NCI) [C]

I've submitted a CBIIT ticket to fix the mailer issue on STAGE and PROD: NCI-RITM0148443

Comment entered 2018-11-07 13:21:00 by Englisch, Volker (NIH/NCI) [C]

The fix has been copied to STAGE and PROD. I tested a mailer on STAGE.

Could you please test a mailer on PROD, and ?

Comment entered 2018-11-07 18:25:46 by Englisch, Volker (NIH/NCI) [C]

The fix did fix the original problem on PROD but there is now another problem of the same type but for a different part of the process. Creating the LaTeX file is now failing but only on PROD.
I am working with CBIIT to have this fixed since the PROD server should be setup just like the lower tier servers (in theory).

Comment entered 2018-11-08 13:18:38 by Englisch, Volker (NIH/NCI) [C]

Yesterday's test by Bonnie failed on PROD but today we were trying again and the mailer ran without problems. A case of gremlins.
The issue appears to be fixed - at least for the moment.

Please verify on PROD and close this ticket.

Comment entered 2018-11-08 15:45:55 by Juthe, Robin (NIH/NCI) [E]

Updating the file associations on my computer fixed the problem with .rtf files that I reported in today's meeting, but we can keep this open for another day or so in case we run into any more gremlins. :-)

Comment entered 2018-11-29 13:13:09 by Englisch, Volker (NIH/NCI) [C]

The gremlins are hibernating. No more issues since Nov. 8th.
Closing ticket.

Elapsed: 0:00:00.002642