Issue Number | 4970 |
---|---|
Summary | Export Publishing Job Failure - Stop Processing Documents |
Created | 2021-04-16 15:32:41 |
Issue Type | Bug |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2021-04-22 14:09:03 |
Resolution | Cannot Reproduce |
Path | /home/bkline/backups/jira/ocecdr/issue.289125 |
We are occasionally seeing weekly (and nightly) publishing job failures when the publishing job stops processing documents and doesn't exit. There is no entry whatsoever in any of our log files indicating an error or an indication the job is still running.
Once a job gets into this stage, it typically stops due to the time limits set for the parent job "SubmitPubJob.py" which is set to 3 hours elapsed time from the time the job has been started.
It would be nice if we could catch whatever problem is causing the job from idling for 3 hours without successfully completing.
We initially though that these failures of the weekly publishing jobs, in which the job never completes but instead times out after three hours, were happening more frequently, but analysis of the publishing job table shows that over the more recent years (when publishing was redesigned to use multi-processing), these failures have on average only occurred a couple of times per year. (There are also occasionally similar failures of the more frequent nightly publishing jobs, but these failures are less problematic, as they don't involve notification of our database partners, and can even be left to self-correct the following day.)
I have spent several days trying to track down the cause of these failures. Most of the work was done on the CDR QA server, in order to avoid disrupting development work for the Newton release, but eventually I moved to the DEV server, which is the only tier where we have Microsoft's debugger.
The CDR publishing job hands of the actual export of the published
documents (running them through the denormalizing export XSL/T filters
and writing the results to the file system) off to a separate script. It
does this by spinning up a configurable number of threads, each of which
runs in a loop, selecting the next N documents (where N is also
configurable) and launching a new process with this
export-docs.py
script. That script filters and writes each
of the documents, logging its progress as it goes, and then exits with a
result code indicating success. I noticed that there were a couple of
small holes in the exception-handling coverage in that script, so I
closed those holes, and added a logging call right before the call to
exit(0)
. This additional logging allowed us to confirm that
when the publishing job times out this way it is because one of these
launched export worker processes finished all of its assigned work and
invoked exit(0)
but the process never stopped, which means
that the master process waits forever for this zombie process to report
completion (until it is killed by the publishing job's configured
timeout). I looked on https://bugs.python.org and haven’t
found a bug report which matches our behavior. Closest I found on
StackOverflow was this
post, but as noted in my own comments, that’s not an exact match
with our problem.
I can tell from the additional logging I introduced since the previous
failure that the process which was still running had only the single
(main) thread. According to the Python
documentation,
Since
exit()
ultimately “only” raises an exception, it will only exit the process when called from the main thread, and the exception is not intercepted.
So we’ve verified the first condition (“called from the main [and in
our case, only] thread”). It may be impossible to determine whether some
other code intercepts the exception without being able to run the Python
interpreter in a debugger.
I created a pair of scripts in an attempt to put together a repro case I
can use for a bug report. The first script creates ten threads. Each
thread runs an infinite loop which launches the second script, creates a
log entry, and then checks all of the threads (with appropriate locking
to avoid having the threads step on each other) to make sure no launched
process has hung. The second script does a little bit of file system I/O
(database/network I/O would be problematic for a repro case I’d be able
to pass off to Microsoft), logs itself, and exits. I ran it on QA,
letting it launch several hundred processes without a single one failing
to exit when it’s done its work. I also ran it on one of my own Linux
systems, with over 393,000 processes launched with no failures.
It took me over 24 hours and over a couple dozen full export jobs, with 12,822 processes launched, but I was finally able to get the weekly publishing export job to fail on DEV. So now I have process 5092 still showing up in Task Manager on a machine where I have a debugger. So I launched the debugger (as administrator, so I can debug a process which isn’t owned by my own account), and attached to the process. I got all the settings adjusted so (for example) it wasn’t just debugging our own code, and it was set to show the disassembly when it didn’t have debugging symbols, but the disassembly and stack trace windows are empty. So either the process has come so close to exiting that there are no pending instructions, or something else strange is going on (I used to do this low-level assembly-language debugging much more frequently earlier in my career, but not so much any more, and who knows what additional magic Microsoft has injected, or what I’ve forgotten in my dotage?). So at this juncture our options would include
rebuild the Python interpreter with debugging symbols and try again to debug
rewrite export-docs.py as a C++ (or .NET C#) program
port the CDR to Linux (should have done that back at the beginning of the century)
live with the bug
None of the options are wildly appealing. Anyone want to weigh in with
an opinion?
After everything that you have done to this point my vote would be to go with option 4.
My opinion is based on the following:
The error, if and when it occurs, does not do any damage to the data
If we catch the error early enough, i.e. before the end of the current week and we're able to rerun it, the only "damage" is the time it takes to restart the publishing job manually, which is about 60 seconds.
If we don't catch the error early enough it is fairly easy to recover from the problem on Monday.
I feel that any additional work invested in tracking down the issue more than you already have, wouldn't be worth the extra benefit. The additional time you may spent going forward with tracking down the issue would likely not be recovered within the next 5 years of running publishing jobs.
I'm not saying I wouldn't be thrilled if we could find out what the issue is but the cost/benefit analysis is telling me there is (are?) bigger fish to fry.
We have done as much as it was profitable to do. We're going to just live with the problem, as the failures are very difficult to reproduce, and happen very infrequently (roughly twice a year on average).
Elapsed: 0:00:00.001302