Issue Number | 5379 |
---|---|
Summary | Nightly Publishing Job Failing |
Created | 2025-04-02 18:21:07 |
Issue Type | Task |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Kline, Robert (NIH/NCI) [C] |
Status | Resolved |
Resolved | 2025-04-04 09:24:45 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.500437 |
For the second day in a row our nightly publishing job failed with a "query timeout".
Job failed: ('HYT00', '[HYT00] [Microsoft][ODBC Driver 17 for SQL Server]Query timeout expired (0) (SQLExecDirectW)')
Adding ~bkline as a watcher.
Both times the job failed when running a large query to identify new summaries to be published. I tried to run the query manually and it ran successfully.
Now I'm re-running the nightly publishing job and it appears that the query ran fine. My guess is the publishing job will finish without issue.
Let's assemble a cumulative record of these failures so we'll have something to present to CBIIT if we suspect that the failures might be related to the down-scaling of resources for the database server, and we need to convince them to restore some of those resource levels.
We may want to fold in the failure of the GovDelivery report as reported in OCECDR-5377.
The publishing job failed for the 3rd time in a row:
2025-04-01 18:12:32
2025-04-02 18:12:33
2025-04-03 18:12:28
Bummer! Same problem (database timeout)?
Go ahead and file a ServiceNow ticket so we can get them digging
in first thing in the morning. I'll file the ServiceNow ticket,
assuming we actually need to file one. I'm doing some digging into the
problem. First, a bit of good news: I don't expect today's weekly
publishing job to fail. Looking over the logs for the last several
months, I noticed that the query to select summaries to publish is much
more complicated for the nightly jobs than for the weekly jobs. This is
because we're much more restrictive about which summaries we want to
publish during the week, knowing that our wider net will pick up all of
the publishable summaries on Friday. During the period whose data I
analyzed, the average length of time for the weekly summary selection
query was roughly 2.5 seconds, whereas the query used during the week to
on average just under 80 seconds.SQL Server will give up on that query
after 10 minutes, which is what happened the past three evenings. So as
I say, I'm reasonably confident today's weekly job will succeed.
Meanwhile I'll keep digging and see if I can pull an index or two out of
my hat (or some other trick) to speed things up.
I have installed a new index on all four tiers, speeding up the slow publishing queries by two or three orders of magnitude.
Confirming that last night's publishing job ran successfully. The new index is doing the trick. We should be able to close the ticket if we're not running into any more timeout failures by the end of the week.
Elapsed: 0:00:00.001436