PDQ Issues

Issue Number	4354
Summary	Scheduler submits Job twice
Created	2017-12-20 11:22:22
Issue Type	Bug
Submitted By	Englisch, Volker (NIH/NCI) [C]
Assigned To	Kline, Bob (NIH/NCI) [C]
Status	Closed
Resolved	2018-10-18 14:05:56
Resolution	Won't Fix
Path	/home/bkline/backups/jira/ocecdr/issue.218671

Description

On occasion, when we submit a publishing job manually via the scheduler interface that job is submitted twice.

Comment entered 2017-12-20 11:27:46 by Englisch, Volker (NIH/NCI) [C]

This problem doesn't seem to be happening every time one submits a job via the scheduler. When a nightly publishing job is submitted, for instance, the publishing job starts but a few seconds or maybe minutes that same job is submitted again. In the case of a publishing job the second job fails immediately with a message that a job is currently running. This is expected behavior but the second job shouldn't be submitted at all.

Here is the log for the last time the problem showed up:

=========== Opening Log ===========
TIER: STAGE  DATETIME: Mon Dec 18 15:26:57 2017
!2888: Mon Dec 18 15:26:57 2017: SubmitPubJob - Started
!2888: Mon Dec 18 15:26:57 2017: Arguments: ['D:/cdr/publishing/SubmitPubJob.py', '--live', '--interim']
!2888: Mon Dec 18 15:26:57 2017: running in LIVE mode
!2888: Mon Dec 18 15:26:57 2017: running in INTERIM mode
!2888: Mon Dec 18 15:26:57 2017: Checking job queue ...
!2888: Mon Dec 18 15:26:58 2017:    OK to submit
!2888: Mon Dec 18 15:26:58 2017: Submitting publishing job...
!2888: Mon Dec 18 15:27:01 2017: Pub job started as Job15972
!2888: Mon Dec 18 15:27:01 2017: Waiting for publishing job to complete...
 
=========== Opening Log ===========
TIER: STAGE  DATETIME: Mon Dec 18 15:33:17 2017
!3044: Mon Dec 18 15:33:17 2017: SubmitPubJob - Started
!3044: Mon Dec 18 15:33:17 2017: Arguments: ['D:/cdr/publishing/SubmitPubJob.py', '--live', '--interim']
!3044: Mon Dec 18 15:33:17 2017: running in LIVE mode
!3044: Mon Dec 18 15:33:17 2017: running in INTERIM mode
!3044: Mon Dec 18 15:33:17 2017: Checking job queue ...
!3044: Mon Dec 18 15:33:17 2017: 
Interim-Export job is still running.
!3044: Mon Dec 18 15:33:17 2017: Job15972 status: In process
!3044: Mon Dec 18 15:33:17 2017: Job15972 type  : Interim-Export
!3044: Mon Dec 18 15:33:17 2017: *** Standard Failure - Job15972 still in process (Interim-Export: In process)
Traceback follows:
Traceback (most recent call last):
  File "D:/cdr/publishing/SubmitPubJob.py", line 322, in <module>
    (currentJobs[0], pubSubset, currentJobs[2]))
Exception: Job15972 still in process (Interim-Export: In process)
 
Mon Dec 18 15:33:17 2017
=========== Closing Log ===========
 
!2888: Mon Dec 18 15:42:03 2017:     Status: In process (900 sec)
!2888: Mon Dec 18 15:51:05 2017: Publishing job started at 2017-12-18 15:26:58
!2888: Mon Dec 18 15:51:05 2017:          and completed at 2017-12-18 15:50:33
!2888: Mon Dec 18 15:51:20 2017: Push job started as Job15973
!2888: Mon Dec 18 15:53:20 2017:    Pushing job started at 2017-12-18 15:50:33
!2888: Mon Dec 18 15:53:20 2017:          and completed at 2017-12-18 15:52:43
!2888: Mon Dec 18 15:53:20 2017: Submitting Email: OK
 
Mon Dec 18 15:53:20 2017
=========== Closing Log ===========

Comment entered 2018-03-01 19:19:29 by Englisch, Volker (NIH/NCI) [C]

I started a nightly publishing job tonight on QA and ran into this issue again. The job was submitted at 18:25h and I received the following emails in my inbox - obviously from 2 publishing jobs:

18:25h: CBIIT-QA: Nightly Publishing Started
18:26h: CBIIT-QA: *** [CBIIT-QA] SubmitPubJob.py - Standard Failure
18:26h: CBIIT-QA: *** Error in SubmitPubJob.py
18:30h: CBIIT-QA: Status and Error Report for Nightly Publishing
18:30h: CBIIT-QA: Nightly Publishing Finished

Comment entered 2018-05-18 17:00:11 by Kline, Bob (NIH/NCI) [C]

Still happening?

Comment entered 2018-05-22 10:42:43 by Englisch, Volker (NIH/NCI) [C]

Yes, I believe I've seen this once or twice on QA or DEV. I'd have to check the logs to say for sure.

Comment entered 2018-09-14 11:13:16 by Englisch, Volker (NIH/NCI) [C]

Adding one more data point:
This week I repeatedly ran the scheduled job Glossary Translation Job Notification via the Custom Run button of the scheduler. None of those jobs were submitted twice. We may want to keep this in mind and compare the differences between the weekly publishing job and that Glossary Tranlation job.

Comment entered 2018-10-16 09:26:39 by Kline, Bob (NIH/NCI) [C]

Yes, I believe I've seen this once or twice on QA or DEV. I'd have to check the logs to say for sure.

Checking the logs on QA, I see a "still in process" error message on October 8, caused by a publishing job which didn't finish three days before and hadn't been marked with a failure status. The incident before that was on August 20, when Volker tried to start an interim update job manually with a push job still running. Prior to that we had a string of "still in process" errors in June caused by a stalled job which was left for several days without a terminal status. Pretty much the same thing in May when a push job got stalled in the "Verifying" state (presumably because of a Gatekeeper problem).

I also looked at the most recent "still in process" errors in the logs on DEV, and they also looked like they weren't caused by the scheduler.

You might want to go behind me, ~volker, to see if I'm mis-interpreting the log files.

Comment entered 2018-10-16 11:28:51 by Englisch, Volker (NIH/NCI) [C]

Going through the logs I see the event happening on these days:

On STAGE
- Wed Nov 29 10:19:44 2017
- Mon Dec 18 15:26:57 2017
- Mon Dec 18 16:01:20 2017
On QA
- Wed Oct 25 12:38:45 2017
- Mon Dec 04 18:02:13 2017
- Thu Dec 14 18:12:34 2017
- Tue Jan 2 13:17:52 2018
- Wed Jan 10 17:36:11 2018
- Tue Feb 13 10:35:43 2018
- Thu Mar 1 18:18:55 2018
- Mon Apr 2 18:44:29 2018
- Fri Apr 6 10:21:29 2018
- Mon Apr 16 18:36:21 2018
- Mon Aug 20 10:43:20 2018
On DEV
- Fri Jul 28 16:12:36 2017
- Mon Sep 18 15:08:15 2017
- Thu Oct 26 11:11:06 2017
- Wed Jan 24 18:10:17 2018
- Mon Apr 2 18:10:19 2018
- Fri Apr 27 17:01:15 2018
- Mon May 14 18:14:20 2018
- Thu Jun 7 18:09:26 2018
- Fri Jun 8 18:03:28 2018
- Mon Aug 13 16:18:43 2018
- Mon Aug 20 10:06:44 2018
- Mon Oct 1 13:03:09 2018

It is correct that we cannot identify from looking at the logs alone if a job had been submitted a second time by one of us or if the second submission was a "scheduler problem". However, I am 100% certain that at least 90% of the events above were caused by manually submitting a publishing job via the scheduler.

Comment entered 2018-10-17 10:56:56 by Kline, Bob (NIH/NCI) [C]

Honing in more closely on the problem, I created a tool which aggregates relevant entries from several logs, adjusted time stamp formats and time zones (IIE logs using Zulu time and a non-ISO format), and sorted the entries. Here's what I see for yesterday.

2018-10-16 11:30:09.377 [INFO] nightly publishing task started (How did this happen? (1))
2018-10-16 11:30:10 JOB16736 CREATED (interim-export created by the task started a second earlier)
2018-10-16 11:37:17 JOB16737 CREATED (push job for 16736)
2018-10-16 11:39:26-04:00 volker 10.133.9.220 Publishing Job Nightly (custom run from scheduler (2))
2018-10-16 11:47:17 JOB16738 CREATED (republish export job (3))
2018-10-16 11:53:14 JOB16739 CREATED (push job for 16738 - failed because of GK error)
2018-10-16 11:55:25 JOB16740 CREATED (republish export job - also failed)
2018-10-16 12:03:01.285 [INFO] nightly publishing task started (again, how?)
2018-10-16 12:03:02 Job16739 STILL IN PROGRESS
2018-10-16 12:03:03-04:00 volker 10.133.9.220 Publishing Job Nightly
2018-10-16 12:05:27.073 [INFO] nightly publishing task started
2018-10-16 12:05:28 JOB16741 CREATED
2018-10-16 12:11:16 JOB16742 CREATED
2018-10-16 12:12:44-04:00 volker 10.133.9.220 Publishing Job Nightly
2018-10-16 17:21:25-04:00 bkline 10.133.190.1 Emailer Tracking Update
2018-10-16 18:05:00.096 [INFO] nightly publishing task started
2018-10-16 18:05:01 JOB16743 CREATED
2018-10-16 18:14:08 JOB16744 CREATED

(1) This job is regularly scheduled to run at 18:05 Monday through Thursday, and there is no log by IIS of a request to run it from the web scheduler admin interface until more than nine minutes later.
(2) The job which was created eight minutes later is not the nightly export job, but according to IIS, Volker didn't tell the scheduler to run the nightly publishing job until after the push job (16737) had finished.
(3) Presumably requested from the web publishing admin interface.

Comment entered 2018-10-17 11:16:23 by Kline, Bob (NIH/NCI) [C]

Ah-hah! IIS doesn't log the date/time it received the request, but the time it finished. Looks like there's a milliseconds-elapsed field I should be able to use to calculate when the request was received.

Comment entered 2018-10-17 11:42:42 by Kline, Bob (NIH/NCI) [C]

This is more like it. Here's yesterday's aggregated/processed logs:

2018-10-16 11:30:08 volker 10.133.9.220 Publishing Job Nightly (557857 ms)
2018-10-16 11:30:09 nightly publishing task started
2018-10-16 11:30:10 Interim-Export Job 16736 started (Success)
2018-10-16 11:37:17 Push_Documents_To_Cancer.Gov_Interim-Export Job 16737 started (Success)
2018-10-16 11:47:17 Republish-Export Job 16738 started (Success)
2018-10-16 11:53:14 Push_Documents_To_Cancer.Gov_Republish-Export Job 16739 started (Failure)
2018-10-16 11:55:25 Republish-Export Job 16740 started (Failure)
2018-10-16 12:03:00 volker 10.133.9.220 Publishing Job Nightly (2775 ms)
2018-10-16 12:03:01 nightly publishing task started
2018-10-16 12:03:02 Job16739 *** STILL IN PROGRESS ***
2018-10-16 12:05:25 volker 10.133.9.220 Publishing Job Nightly (438378 ms)
2018-10-16 12:05:27 nightly publishing task started
2018-10-16 12:05:28 Interim-Export Job 16741 started (Success)
2018-10-16 12:11:16 Push_Documents_To_Cancer.Gov_Interim-Export Job 16742 started (Success)
2018-10-16 17:21:23 bkline 10.133.190.1 Emailer Tracking Update (1456 ms)
2018-10-16 18:05:00 nightly publishing task started
2018-10-16 18:05:01 Interim-Export Job 16743 started (Success)
2018-10-16 18:14:08 Push_Documents_To_Cancer.Gov_Interim-Export Job 16744 started (Success)

So this STILL IN PROGRESS message looks much less suspicious than it did when the log message lines were not in the right sequence. According to push-job-verifier.log, job 16739 wasn't marked as having failed until 12:05, so it would be correct that the request to kick off the nightly publishing job manually from the scheduler would be rejected. I will go back and perform similar analysis on the earlier STILL IN PROGRESS messages.

Comment entered 2018-10-17 12:31:20 by Kline, Bob (NIH/NCI) [C]

Going back to the 1st of this month, things look more suspicious.

2018-10-01 12:53:27 volker 10.133.9.220 Publishing Job Weekly (completed 12:53:40) (web request)
2018-10-01 12:53:30 weekly publishing task started (scheduler task logs its start)
2018-10-01 12:53:36 Job16708 *** STILL IN PROGRESS *** (correctly notices that the previous day's job is still unresolved)
2018-10-01 13:03:08 weekly publishing task started (*** WHERE DID THIS COME FROM? ***)
2018-10-01 13:03:09 Export Job 16709 started (Success)
2018-10-01 13:09:17 volker 10.133.9.220 Publishing Job Weekly (completed 13:09:21)
2018-10-01 13:09:18 weekly publishing task started
2018-10-01 13:09:20 Job16709 *** STILL IN PROGRESS ***
2018-10-01 13:20:17 Push_Documents_To_Cancer.Gov_Export Job 16710 started (Success)
2018-10-01 18:05:00 nightly publishing task started (our regularly scheduled show)
2018-10-01 18:05:02 Interim-Export Job 16711 started (Success)
2018-10-01 18:13:16 Push_Documents_To_Cancer.Gov_Interim-Export Job 16712 started (Success)

So in this case, Volker submitted one request via the web scheduler admin interface at 12:53:27, and as far as I can tell that resulted in invoking PublishingTask.Perform() twice: once at 12:53:30 (as expected) and a second time approximately 10 minutes later. This isn't exactly the condition reported by this ticket, because the first invocation of PublishingTask.Perform() failed (correctly) and the second invocation succeeded, whereas this ticket is reporting instances in which the first invocation succeeds and the second invocation fails because of the job(s) created by the first invocation. It's conceivable (but hard to verify, given the very thin documentation we're dealing with for the underlying ndscheduler package) that the scheduler is adding some resilience, recognizing that the first attempt failed, and is trying again. Let's give it the benefit of the doubt (particularly since in this case the behavior resulted in a desirable outcome) and I'll keep walking backwards through the logs, continuing this analysis.

Comment entered 2018-10-17 12:55:29 by Englisch, Volker (NIH/NCI) [C]

This isn't exactly the condition reported by this ticket,

This is exactly the situation reported if you consider the job started at 13:03h as the one starting two jobs. The job submission on 12:53h is not related to the one at 13:03h. The two jobs starting at 13:03h and 13:09h are the related ones.
The job starting on 18:05h is the regular nightly publishing job unrelated to the two running around 1pm.

Comment entered 2018-10-17 16:20:46 by Kline, Bob (NIH/NCI) [C]

My understanding of what you were reporting in the original description for this ticket was (paraphrasing): "I click the button in the web scheduler admin interface to run a job by hand and the request is sometimes processed twice." In the log entries above, the ones that say volker 10.133.9.220 Publishing Job Weekly (completed ....) are the ones which represent the "I click the button ..." part. So you clicked that button twice on October 1st: once at 12:53 and once at 1:09. The entries which say nightly|weekly publishing task started represent the first logging entry created by the scheduled task's class when its Perform() method is invoked. This happened four times on that day: twice as a result of your first manual request from the web (once a few seconds after you clicked the button and a second time about ten minutes later), once in response to your second manual job request (a second after you clicked the button), and once for the nightly scheduled job at 6:05pm, unrelated to any manual web page clicks.

So the two invocations you're saying are related (13:03 and 13:09) are actually in response to two different events on the scheduler web page. Does this make sense? To simplify the log entries further:

12:53 - first "run" click on the web page
12:53 - first invocation of the weekly publishing task (failed)
13:03 - second invocation of the weekly publishing task (succeeded)

13:09 - second "run" click on the web page
13:09 - third invocation of the weekly publishing task (failed, 13:03 job still running)

18:05 - invocation of the nightly publishing task (scheduled, no "run" click needed)

If this is still muddy, we can walk through it on the white board together tomorrow.

Comment entered 2018-10-17 16:59:51 by Englisch, Volker (NIH/NCI) [C]

In my opinion, the text for 13:03 and 13:09 should be swapped because:

I submit a job via the scheduler
The job fails because the previous job is still running
I manually fail the job still running (this is not captured in the log file)
I submit a job again via the scheduler (this job will succeed)
The scheduler submits a second job (this job will fail)
The nightly publishing job is starting

While reading your comment I was wondering about the following possible scenario: Is it possible that the "double-submission" only appears after a previously failed job? I'll have to double-check the logs for this.

Yes, let's use a white board to discuss the scenarios.

Comment entered 2018-10-17 17:28:56 by Kline, Bob (NIH/NCI) [C]

In my opinion, the text for 13:03 and 13:09 should be swapped because: ...

You might wish things had happened in such a way that the descriptions could be swapped, but that's not what actually happened. In your fourth bullet you have "I submit a job again via the scheduler (this job will succeed)" but the job which succeeded was created six minutes before you "submit the job again via the scheduler." No?

Comment entered 2018-10-18 09:12:59 by Kline, Bob (NIH/NCI) [C]

I'm beginning to get a somewhat clearer picture of at least part of what's going on. For a successful weekly publishing job (and maybe for a nightly publishing job, but that's less likely), the processing takes too long for IIS to wait for completion when the job is requested manually from the scheduler's web admin interface. So as it turns out, there were actually three clicks on the Run button on October 1, not two clicks. The second click didn't ever get recorded in the IIS logs, because by the time the job completed successfully, IIS had given up. We would probably have seen a record of this in the error logs for a normal web server, but CBIIT has error logging turned off for IIS. We do have evidence for all three requests, however, in our proxy log, which captures quite a bit of information. In the screen shot below you can see the proxy log entries on the left and the logs from the scheduler's implementation of the publishing task on the right. In the proxy log window, the lines which say "Performing 'POST' request ..." are recorded when the request is handed off to the scheduler on port 8888. There are three such lines for the 1st of October. The lines which say "POST /api/v1/executions/..." are recorded when control is returned to the web server and its CGI scripts (and right before the script ends and IIS records the request in the access log). That line is missing for the second POST request, because IIS has abandoned the thread on which the request was being processed. But you can see from the logs on the right side that the publishing job runs to completion (see the arrow running down the right side). IIS can walk away from the request, but it has no way to stop the scheduler from completing the work it was given to do. In any case, all of the requests were handled properly by the scheduler on October 1st. I will continue going back through the logs to try and find when we had the most recent instance of the condition reported by this ticket.

Comment entered 2018-10-18 10:16:13 by Kline, Bob (NIH/NCI) [C]

Learning more about the vagaries of the logging performed by IIS. On August 20, there were two POST requests from the scheduler's web admin page for a nightly publishing job, one submitted at 10:06:43 (finishing at 10:15:01) and the other submitted at 10:12:52 (finishing at 10:12:55). Only the second request was recorded in the IIS access logs, though the proxy log has a full set of log entries for both requests (including the "End Request" log message recorded right before IIS gets control back and should presumably be recording the request in the access log). This case comes closest to the behavior reported by this ticket. The first manual request from the web browser created a nightly publishing job which completed successfully. The second manual request was rejected because of the ongoing job created by the first request. The odd bit about this case is that IIS did not record the first request. There are multiple possible interpretations of the data for this day.

The user (Volker, in this case) clicked the "Run" button twice, once at 10:06:43 and once at 10:12:52. We don't know why IIS did not record the first click.
The user only clicked the "Run" button once, at 10:06:43. The web server gets the request and hands it to the CGI script, which hands the request off to the scheduler service. The job runs successfully, control returns to the proxying CGI script, which completes its logging of the processing of the job and returns a success return code to IIS. In the meantime (some time before 10:12:52) the browser cancels the request for some unknown reason and sends it again. IIS does not log the first request, but for some reason does not stop the CGI script from continuing.
A variation of the previous explanation, but instead of the browser cancelling the request, the web server for some unknown reason loses track of the CGI script that it kicked off for the original request, and starts a second instance of the CGI script to handle the request, and records the request as having been submitted at the point when it launches the second instance of the CGI script.

None of these seems like very likely, compelling explanations but I don't have anything better right now. Going to reach out to ~bryanp and ~learnb to ask if they have any theories.

Comment entered 2018-10-18 14:05:56 by Kline, Bob (NIH/NCI) [C]

After spending a lot of time reviewing/analyzing the logs and discussing the behavior of the browser, the web server, and the scheduler with Bryan and Blair, we have concluded that it is not worth spending any more time on this ticket, as it doesn't happen very often, and (at least for the types of jobs for which we have seen the reported behavior) the effect is benign.

Attachments

File Name	Posted	User
Screenshot 2018-10-18 08.49.26.png	2018-10-18 08:50:58	Kline, Bob (NIH/NCI) [C]

Elapsed: 0:00:00.000382

CDR Tickets