Issue Number | 4546 |
---|---|
Summary | Unable to Run Summary Mailer |
Created | 2018-11-05 14:59:44 |
Issue Type | Bug |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2018-11-06 16:55:39 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.235548 |
Both, Bonnie and ~juther
tried to run summary mailers today. Both times the mailers failed to
finish and the scheduler service restarted as a result of these
mailers.
It appears that the last time a mailer ran successfully was on Oct 30th
(Job16808).
I ran the pending job on QA from the command line and it succeeded. My guess, therefore, is that CBIIT's gremlins did something to the file permissions again (probably on or around Halowe'en). I have run fix-permissions on d:\Python. Please try again on QA. If it still doesn't work, I'll run fix-permissions on d:\cdr\Mailers and we'll try a third time (I'm not doing both at once in the hopes of find out where the breakage was with more granularity).
Thank you!
I was just thinking I'm going crazy since the job finished after 30
minutes! Good to know it was you who pushed the job through.
Wait a minute - I'll submit another mailer.
Let me know if/when you need me to submit another mailer.
Sure, give it another shot. This time I ran fix-permissions on
d:/cdr/Mailers
d:/cdr/Output/Mailers
d:/cdr/Scheduler
Another mailer has been submitted but the scheduler still crashes.
Well, looks like we have three solutions we can pursue:
give the two of us login accounts on the upper-tier servers, and a couple of pagers, and the users just call us up and have us log in and run the job by hand;
give the users phone numbers for on-call CBIIT staff, who will log into the servers and run the jobs by hand; or
have CBIIT fix the Windows permissions problems
Your thoughts?
I thought you ran the fix-permissions job but it didn't fix
the problem.
What is it that makes the scheduler service crash and why is the mailer
job working OK when run from the command line? Why aren't other jobs
failing? William did run a couple hot-fix jobs earlier today without
problems.
Don't know. We can investigate further tomorrow.
Bob was able to identify and fix the problem. The change has been copied to DEV and QA.
The following file has been updated:
Please verify on DEV or QA.
Adding ~juther to perform the testing of the mailers.
I've asked Bonnie to try running a summary mailer on QA.
This is also affecting our correspondence mailers. I've tested those on QA and it they seem to be working fine.
Bonnie ran an AB mailer successfully, although she wasn't able to view it. I think that may just be because the script she uses ("ViewPrintJob...") in the command line is set up to view the mailers on PROD and would need to be adjusted for QA. Is that right? If so, I think we can proceed with moving this to PROD.
That is correct! As long as Bonnie can confirm the mailer job
finished successfully (and isn't stuck with the status "in process")
we're good to go.
Viewing the mailer job wasn't broken and we didn't make any changes to
that portion of the process.
I will submit a ticket to CBIIT to update the changes on STAGE and PROD.
I've submitted a CBIIT ticket to fix the mailer issue on STAGE and PROD: NCI-RITM0148443
The fix has been copied to STAGE and PROD. I tested a mailer on STAGE.
Could you please test a mailer on PROD, ~juther and ~fergusonbc?
The fix did fix the original problem on PROD but there is now another
problem of the same type but for a different part of the process.
Creating the LaTeX file is now failing but only on PROD.
I am working with CBIIT to have this fixed since the PROD server should
be setup just like the lower tier servers (in theory).
Yesterday's test by Bonnie failed on PROD but today we were trying
again and the mailer ran without problems. A case of
gremlins.
The issue appears to be fixed - at least for the moment.
Please verify on PROD and close this ticket.
Updating the file associations on my computer fixed the problem with .rtf files that I reported in today's meeting, but we can keep this open for another day or so in case we run into any more gremlins. :-)
The gremlins are hibernating. No more issues since Nov. 8th.
Closing ticket.
Elapsed: 0:00:00.002642