PDQ Issues

Issue Number	5379
Summary	Nightly Publishing Job Failing
Created	2025-04-02 18:21:07
Issue Type	Task
Submitted By	Englisch, Volker (NIH/NCI) [C]
Assigned To	Kline, Robert (NIH/NCI) [C]
Status	Resolved
Resolved	2025-04-04 09:24:45
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.500437

Description

For the second day in a row our nightly publishing job failed with a "query timeout".

Job failed: ('HYT00', '[HYT00] [Microsoft][ODBC Driver 17 for SQL Server]Query timeout expired (0) (SQLExecDirectW)')

Comment entered 2025-04-02 18:22:37 by Englisch, Volker (NIH/NCI) [C]

Adding ~bkline as a watcher.

Comment entered 2025-04-02 18:37:14 by Englisch, Volker (NIH/NCI) [C]

Both times the job failed when running a large query to identify new summaries to be published. I tried to run the query manually and it ran successfully.

Now I'm re-running the nightly publishing job and it appears that the query ran fine. My guess is the publishing job will finish without issue.

Comment entered 2025-04-03 05:37:28 by Kline, Robert (NIH/NCI) [C]

Let's assemble a cumulative record of these failures so we'll have something to present to CBIIT if we suspect that the failures might be related to the down-scaling of resources for the database server, and we need to convince them to restore some of those resource levels.

Comment entered 2025-04-03 11:04:12 by Kline, Robert (NIH/NCI) [C]

We may want to fold in the failure of the GovDelivery report as reported in OCECDR-5377.

Comment entered 2025-04-03 20:12:13 by Englisch, Volker (NIH/NCI) [C]

The publishing job failed for the 3rd time in a row:

2025-04-01 18:12:32
2025-04-02 18:12:33
2025-04-03 18:12:28

Comment entered 2025-04-03 21:01:27 by Kline, Robert (NIH/NCI) [C]

Bummer! Same problem (database timeout)?

Comment entered 2025-04-03 21:02:14 by Kline, Robert (NIH/NCI) [C]

~~Go ahead and file a ServiceNow ticket so we can get them digging in first thing in the morning.~~ I'll file the ServiceNow ticket, assuming we actually need to file one. I'm doing some digging into the problem. First, a bit of good news: I don't expect today's weekly publishing job to fail. Looking over the logs for the last several months, I noticed that the query to select summaries to publish is much more complicated for the nightly jobs than for the weekly jobs. This is because we're much more restrictive about which summaries we want to publish during the week, knowing that our wider net will pick up all of the publishable summaries on Friday. During the period whose data I analyzed, the average length of time for the weekly summary selection query was roughly 2.5 seconds, whereas the query used during the week to on average just under 80 seconds.SQL Server will give up on that query after 10 minutes, which is what happened the past three evenings. So as I say, I'm reasonably confident today's weekly job will succeed. Meanwhile I'll keep digging and see if I can pull an index or two out of my hat (or some other trick) to speed things up.

Comment entered 2025-04-04 09:24:45 by Kline, Robert (NIH/NCI) [C]

I have installed a new index on all four tiers, speeding up the slow publishing queries by two or three orders of magnitude.

https://github.com/NCIOCPL/cdr-server/commit/e330e09

Comment entered 2025-04-09 13:11:49 by Englisch, Volker (NIH/NCI) [C]

Confirming that last night's publishing job ran successfully. The new index is doing the trick. We should be able to close the ticket if we're not running into any more timeout failures by the end of the week.

Elapsed: 0:00:00.001135

CDR Tickets