PDQ Issues

Issue Number	4970
Summary	Export Publishing Job Failure - Stop Processing Documents
Created	2021-04-16 15:32:41
Issue Type	Bug
Submitted By	Englisch, Volker (NIH/NCI) [C]
Assigned To	Kline, Bob (NIH/NCI) [C]
Status	Closed
Resolved	2021-04-22 14:09:03
Resolution	Cannot Reproduce
Path	/home/bkline/backups/jira/ocecdr/issue.289125

Description

We are occasionally seeing weekly (and nightly) publishing job failures when the publishing job stops processing documents and doesn't exit. There is no entry whatsoever in any of our log files indicating an error or an indication the job is still running.

Once a job gets into this stage, it typically stops due to the time limits set for the parent job "SubmitPubJob.py" which is set to 3 hours elapsed time from the time the job has been started.

It would be nice if we could catch whatever problem is causing the job from idling for 3 hours without successfully completing.

Comment entered 2021-04-16 16:29:17 by Kline, Bob (NIH/NCI) [C]

We initially though that these failures of the weekly publishing jobs, in which the job never completes but instead times out after three hours, were happening more frequently, but analysis of the publishing job table shows that over the more recent years (when publishing was redesigned to use multi-processing), these failures have on average only occurred a couple of times per year. (There are also occasionally similar failures of the more frequent nightly publishing jobs, but these failures are less problematic, as they don't involve notification of our database partners, and can even be left to self-correct the following day.)

I have spent several days trying to track down the cause of these failures. Most of the work was done on the CDR QA server, in order to avoid disrupting development work for the Newton release, but eventually I moved to the DEV server, which is the only tier where we have Microsoft's debugger.

The CDR publishing job hands of the actual export of the published documents (running them through the denormalizing export XSL/T filters and writing the results to the file system) off to a separate script. It does this by spinning up a configurable number of threads, each of which runs in a loop, selecting the next N documents (where N is also configurable) and launching a new process with this export-docs.py script. That script filters and writes each of the documents, logging its progress as it goes, and then exits with a result code indicating success. I noticed that there were a couple of small holes in the exception-handling coverage in that script, so I closed those holes, and added a logging call right before the call to exit(0). This additional logging allowed us to confirm that when the publishing job times out this way it is because one of these launched export worker processes finished all of its assigned work and invoked exit(0) but the process never stopped, which means that the master process waits forever for this zombie process to report completion (until it is killed by the publishing job's configured timeout). I looked on https://bugs.python.org and haven’t found a bug report which matches our behavior. Closest I found on StackOverflow was this post, but as noted in my own comments, that’s not an exact match with our problem.
I can tell from the additional logging I introduced since the previous failure that the process which was still running had only the single (main) thread. According to the Python documentation,

Since exit() ultimately “only” raises an exception, it will only exit the process when called from the main thread, and the exception is not intercepted.

So we’ve verified the first condition (“called from the main [and in our case, only] thread”). It may be impossible to determine whether some other code intercepts the exception without being able to run the Python interpreter in a debugger.

I created a pair of scripts in an attempt to put together a repro case I can use for a bug report. The first script creates ten threads. Each thread runs an infinite loop which launches the second script, creates a log entry, and then checks all of the threads (with appropriate locking to avoid having the threads step on each other) to make sure no launched process has hung. The second script does a little bit of file system I/O (database/network I/O would be problematic for a repro case I’d be able to pass off to Microsoft), logs itself, and exits. I ran it on QA, letting it launch several hundred processes without a single one failing to exit when it’s done its work. I also ran it on one of my own Linux systems, with over 393,000 processes launched with no failures.

Comment entered 2021-04-16 16:30:22 by Kline, Bob (NIH/NCI) [C]

It took me over 24 hours and over a couple dozen full export jobs, with 12,822 processes launched, but I was finally able to get the weekly publishing export job to fail on DEV. So now I have process 5092 still showing up in Task Manager on a machine where I have a debugger. So I launched the debugger (as administrator, so I can debug a process which isn’t owned by my own account), and attached to the process. I got all the settings adjusted so (for example) it wasn’t just debugging our own code, and it was set to show the disassembly when it didn’t have debugging symbols, but the disassembly and stack trace windows are empty. So either the process has come so close to exiting that there are no pending instructions, or something else strange is going on (I used to do this low-level assembly-language debugging much more frequently earlier in my career, but not so much any more, and who knows what additional magic Microsoft has injected, or what I’ve forgotten in my dotage?). So at this juncture our options would include

rebuild the Python interpreter with debugging symbols and try again to debug
rewrite export-docs.py as a C++ (or .NET C#) program
port the CDR to Linux (should have done that back at the beginning of the century)
live with the bug

None of the options are wildly appealing. Anyone want to weigh in with an opinion?

Comment entered 2021-04-19 14:17:01 by Englisch, Volker (NIH/NCI) [C]

After everything that you have done to this point my vote would be to go with option 4.

My opinion is based on the following:

The error, if and when it occurs, does not do any damage to the data
If we catch the error early enough, i.e. before the end of the current week and we're able to rerun it, the only "damage" is the time it takes to restart the publishing job manually, which is about 60 seconds.
If we don't catch the error early enough it is fairly easy to recover from the problem on Monday.
I feel that any additional work invested in tracking down the issue more than you already have, wouldn't be worth the extra benefit. The additional time you may spent going forward with tracking down the issue would likely not be recovered within the next 5 years of running publishing jobs.

I'm not saying I wouldn't be thrilled if we could find out what the issue is but the cost/benefit analysis is telling me there is (are?) bigger fish to fry.

Comment entered 2021-04-22 14:09:03 by Kline, Bob (NIH/NCI) [C]

We have done as much as it was profitable to do. We're going to just live with the problem, as the failures are very difficult to reproduce, and happen very infrequently (roughly twice a year on average).

Elapsed: 0:00:00.000580

CDR Tickets