PDQ Issues

Issue Number	3627
Summary	Investigate SQL Server connection failures and improve management of them
Created	2013-07-23 11:17:44
Issue Type	Task
Submitted By	alan
Assigned To	alan
Status	Closed
Resolved	2013-08-28 10:44:45
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.110756

Description

We continue to experience failures connecting to SQL Server. We know that at least some of these failures are due to the DBMS being unavailable, e.g., during a reboot or undergoing maintenance in single user mode. For these cases we need to work with CBIIT to come up with a plan that will ensure that outages are scheduled in advance when possible, and reported to CDR staff and users as soon as possible when not.

We don't know if all of the failures are in this category, so we have to find out the true cause of as many of the failures as possible.

If the failures persist, it is also desirable to make the CdrServer more resistant to these failures. It should probably stay alive (right now it will crash after a small number of connection failures), return useful error messages to users ("The CdrServer was unable to connect to the database") and perhaps wait 30 seconds and try again.

Several previous issues reported as failures in different parts of the CDR are actually caused by connection failures, so this issue should replace them. Two to be replaced from the Data Center Migration project are:

OCEDCM-620
OCEDCM-622

Comment entered 2013-07-23 11:30:35 by chengep

I added links to OCEDCM-620 and OCEDCM-622 so that they can be easily referenced, but have marked them as "closed" to simplify tracking. Once we resolve the SQL Server connection failures, we should verify that they symptoms described in the related issues have also been addressed.

Comment entered 2013-07-23 23:09:54 by alan

Min has done some excellent work in researching the errors. She
noticed the following:

1. Not all errors in the CdrServer logs appear in the SQL
Server logs.

Min's speculation about this is that these requests never
reached SQL Server, and that's why the errors don't appear
in the logs. She suggested, for example that the network
delay was too long. That makes sense to me and also accords
perfectly with the error messages we're getting.

There are 138 errors with state message information recorded
in the CDR logs, including:

135 "Login timeout expired"

3 "General network error"

2. The CDR is the only SQL Server client using ODBC.

This could mean that there is something in the ODBC client
that isn't working perfectly, or something in the ODBC
configuration (like a timeout value), that is shorter than
the value used by our .NET and Java applications.

I did some searching on the Net to try to find out what I can
about timeouts. Several possible causes were cited:

1. Network slowness.

This could be caused by random network congestion. We have
a pretty complicated network and, potentially, a lot of
traffic and, perhaps, a lot of security scanning overhead.

It is possible that we see the problem and other
applications don't if our default ODBC connection timeout
value is lower than the .NET and JDBC timeout values.

If that is the problem, we can either solve it or ameliorate
it by increasing the connection timeout. This can
apparently be done by calling:

SQLSetConnectAttr(...)

We might also be able to set a higher default value for our
ODBC client.

I like the idea of setting it in our own code because then,
if a new server is configured, we don't have to worry about
remembering to set an elevated default timeout value in the
ODBC configuration. I don't know however whether
connections made under the covers by ODBC as a result of a
dropped connection will always just use the default.

For test purposes, it will be quicker and require no code
changes to just bump the timeout value in the client
configuration if that can be done with the configuration
GUI.

There's really no reason not to try a higher timeout value.
Waiting longer for a connection certainly wastes less time
than crashing!

2. Running out of connections in the connection pool.

I don't understand why this manifests as a timeout. Perhaps
the ODBC client gives a misleading error code. However at
least two putative SQL Server mavens claimed that timeout
errors will result from this.

This should not be happening. If it is, I would think it is
due to a bug in which some function is getting a connection
and not releasing it.

We can put some logging code in our connection routines to
test for it, for example by incrementing a sequential number
and storing it in each Connection object, then decrementing
it when the connection is released in the object destructor.
We can store the last high point reached and each time that
gets bumped up, we can log it. If the high water mark gets
large before a crash, then we know to look for a resource
leak.

If the high water mark keeps rising, we may need more
logging code to find the offending routine.

Ideally I'd like to consult with Bob, who wrote all of our
connection code, before I modify the CdrServer to do the above.
However I can take a stab at it on Thursday without him if we
want to get this started more quickly.

I found one other item of interest in my research that I'd like
to report.

When more than a couple of connection failures occur, the
CdrServer crashes with a Windows/C++ error code indicating an
"unhandled exception".

I have searched high and low for this without success. I
canvased the code for connection attempts outside of a try block
and found none. I looked for places where a connection attempt
is made but not checked for success. I didn't find any of those.
I performed experiments to see if I could simulate any kind of
exception from a low level divide by zero or null pointer
dereference to several types of C++ exception. Everything was
caught.

It might be possible to pinpoint it if CBIIT were to install some
specialized debugging tools for us, but it's a daunting task
working at a very deep level in the machine language image with a
steep learning curve.

However, if one suggestion I found on the Net is right, the
reason I'm not finding the error is that it's not coming from our
code. It could be that ODBC, for example if it runs out of
connections in the pool or maybe for some other reason, is
aborting the process. Digging into the error logs earlier in
July when my looping bug blew out the stack, I saw stack overflow
errors that looked like they came from the ODBC client followed
by similar crash errors reported by the CdrServer.

I'll consult with everyone on Thursday about all of this.

Comment entered 2013-07-23 23:11:57 by alan

I tried adding Min to the watcher list for this issue but someone would have to give her permission. For now, I'll just email her a copy of my analysis, and put her spreadsheet here as an attachment.

Comment entered 2013-07-23 23:14:40 by alan

This is a spreadsheet prepared by Min showing excerpts from the SQL Server logs for July 22 and 23.

Comment entered 2013-07-23 23:19:09 by alan

Extract of the CdrLogErrs file from our production server for the same period covered by Min's spreadsheet in the previous attachment.

Comment entered 2013-07-24 10:04:41 by Gao, Zhuomin (NIH/NCI) [C]

prod errors since 7/23 evening, besides "SQL Error: 18056, State: 29" and "network error 0x2746", a new type of error which I have not seen before on Prod: Login failed for user 'cdrsqlaccount'. Reason: Current collation did not match the database's collation during connection reset. [CLIENT: 10.133.98.66]

Comment entered 2013-07-24 22:41:27 by alan

That's a weird one. I have no idea what that means.

Comment entered 2013-07-25 11:03:39 by Juthe, Robin (NIH/NCI) [E]

I just received another python error when pulling up publish preview of CDR62863. I'm letting you know in case this is helpful. Here's the error message:
A problem occurred in a Python script.

d:\cdr\Log\tmp_z_6_e.html contains the description of this error.

Comment entered 2013-07-25 23:04:13 by alan

I looked at the description of the error. It sure looks like our problem. The CdrServer error log records a failure connecting to the database at just this time (11:01 am) and the CdrServer bounced itself, which is what broke the existing connection recorded in the description.

There were a ton of errors today and, in fact, a ton every day.

Min and I have been running tests and observations on DEV and aren't seeing any problems. I wrote a script that submitted huge numbers of transactions to the DEV copy of the CdrServer, probably 5-15 per second for a few minutes. It bogged down the whole machine but did not generate a single connection error.

[At the status meeting I said I hadn't seen any errors today but I was thinking of DEV. I hadn't looked in PROD and mistakenly thought things were going well.]

So it looks like we're going to have to do serious testing on PROD, probably with a modified CdrServer to test with.

I'm guessing that one of two things is happening. Either we have a bug in our code that only affects a few transaction types, and possibly only after a large number of transactions (for example we may be leaking connection objects in one of the six places where we get a connection), or we have a configuration or other issue, very possibly in the network between the PROD host and SQL Server.

I'll report the problems to CBIIT and let them know what I'm thinking we need to try.

Comment entered 2013-07-25 23:12:58 by alan

Incidentally, for the record, the ODBC manager on DEV allowed me to see some characteristics of the connection management.

It appears that our ODBC client is configured to maintain a maximum of 60 connections its shared pool. It seems to me that we should never need more than half that.

If a connection is inactive for 60 seconds, the driver closes it.

If a connection attempt fails, the driver is configured to wait 120 seconds before trying again. That's what Windows tells us, but it's not what we're seeing. We sometimes reestablish connections very quickly.

Comment entered 2013-07-25 23:14:13 by alan

I notice that the logging done by the CdrServer on DEV is not the same as on PROD. Perhaps we've got different executables running on the two servers.

Comment entered 2013-07-26 09:52:23 by Gao, Zhuomin (NIH/NCI) [C]

I put Prod SQL connectivity errors in attachment ProdConnectivityError0726. This is gathered using sys.dm_os_ring_buffers. The record column shows the detailed information about the errors, e.g. (inside the xml the record time -4 is the actual record time)
<Record id="124" type="RING_BUFFER_CONNECTIVITY" time="536269377">
<ConnectivityTraceRecord>
<RecordType>LoginTimers</RecordType>
<Spid>115</Spid>
<SniConnectionId>5935FF0A-5E1E-430D-AE83-17A071959D3C</SniConnectionId>
<SniConsumerError>17830</SniConsumerError>
<SniProvider>7</SniProvider>
<State>102</State>
<RemoteHost>10.133.98.66</RemoteHost>
<RemotePort>45277</RemotePort>
<LocalHost>10.133.189.16</LocalHost>
<LocalPort>55733</LocalPort>
<RecordTime>7/23/2013 23:50:45.358</RecordTime>
<TdsBuffersInformation>
<TdsInputBufferError>0</TdsInputBufferError>
<TdsOutputBufferError>10054</TdsOutputBufferError>
<TdsInputBufferBytes>210</TdsInputBufferBytes>
</TdsBuffersInformation>
<LoginTimers>
<TotalLoginTimeInMilliseconds>10669</TotalLoginTimeInMilliseconds>
<LoginTaskEnqueuedInMilliseconds>17</LoginTaskEnqueuedInMilliseconds>
<NetworkWritesInMilliseconds>2848</NetworkWritesInMilliseconds>
<NetworkReadsInMilliseconds>2427</NetworkReadsInMilliseconds>
<SslProcessingInMilliseconds>5267</SslProcessingInMilliseconds>
<SspiProcessingInMilliseconds>0</SspiProcessingInMilliseconds>
<LoginTriggerAndResourceGovernorProcessingInMilliseconds>1956</LoginTriggerAndResourceGovernorProcessingInMilliseconds>
</LoginTimers>
</ConnectivityTraceRecord>

Comment entered 2013-07-26 11:26:27 by Gao, Zhuomin (NIH/NCI) [C]

Attachment prodconnectivityerr0726.xls shows all SQL server connectivity errors since the server from 7/17 to 7/26 morning. This covers all connectivity errors. It covers errors not logged in SQL server error log and is more comprehensive than SQL default trace. If type is “connectionclosed”, is more related client closing the connection earlier, can be ignored(from SQL server troubleshooting point of view). If type is “error”, in our case, it is mostly login error due to reasons like password mismatch etc., straightfoward to find the cause.The type “LoginTimers” happens most frequently and is quite difficult to pinpoint exact cause.
There are two types of “LoginTimers” erros:
1) <SniConsumerError>17830</SniConsumerError>
<State>102</State>
Is essentially the equivalient of “Network error code 0x2746 occurred while establishing a connection; the connection has been closed. This may have been caused by client or server login timeout expiration.”

as listed in attachments cdrProddberr and cdrProderr0723.

This type of error is more related to cient driver and network.

2) <SniConsumerError>17830</SniConsumerError>
<State>11</State>
Is essentially the equivalient of “Error: 18056, Severity: 20, State: 29. 2013-07-24 00:30:26.77 spid64 The client was unable to reuse a session with SPID 64, which had been reset for connection pooling. The failure ID is 29.”

As listed in attachments cdrProddberr and cdrProderr0723.

From what I have been monitoring the prod SQL server so far, this type errors are more related to the state of server immediately before this type of error occurs. Quite sometime it happens closely after “Network error code 0x2746” and can be a side effect of “Network error code 0x2746”. Error 18056 failure ID 29 as explained by Microsoft CSS engineer http://blogs.msdn.com/b/psssql/archive/2010/08/03/how-it-works-error-18056-the-client-was-unable-to-reuse-a-session-with-spid-which-had-been-reset-for-connection-pooling.aspx), is a bit generic and needs more investigation.
Since the ring_buffer gives stack trace in the XML and detailed state information. Can we contact Microsoft for clearer explaination of the errors?

Comment entered 2013-07-29 15:46:43 by Gao, Zhuomin (NIH/NCI) [C]

More detailed error information about TdsInputBufferError and TdsOutputBufferErrorIn can be found in attachment ProdConnectivity0729 (all Prod SQL server connectivity errors 7/17-7/29 afternoon). Most of login timeout errors of CDR (remotehost 10.133.98.66) is due to 10054 [Connection reset by peer]. So just as Alan suggested we can try increasing CDRServer timeout. What is the current timeout?

The root cause of why some login takes more time needs more investigation.

I checked DEV, still have not seen the same error happening on DEV.

Comment entered 2013-07-29 15:48:56 by alan

On DEV only, I have set a login timeout property of 60 seconds on each connection object just prior to the connection attempt. It may have helped. It's too early to tell I think.

Comment entered 2013-07-30 16:11:57 by Gao, Zhuomin (NIH/NCI) [C]

I added
TotalLoginTimeInMilliseconds LoginTaskEnqueuedInMilliseconds NetworkWritesInMilliseconds NetworkReadsInMilliseconds SslProcessingInMilliseconds

information to all the login timeout error in attachment ProdLoginTimer0730. TotalLoginTime seems vary widely from 0/1/2 to 20235 from CDR, (something like 93806 from a login from p050 prod bastion host is not CDR I think).

When running queries on Prod, one thing I did notice is for 2 times today, it took me more than 5 seconds to login. My account is Windows login not SQL login as cdr is.

Comment entered 2013-08-01 14:17:34 by Gao, Zhuomin (NIH/NCI) [C]

Attachment DEVLoginTimer0801 shows all DEV Login timeouts since 7/20. There are total two login time out for CDR on DEV and both happened last evening.

(IP addresses beginning with 137 seems something like security scanner as Wenling mentioned.)

Comment entered 2013-08-02 12:17:01 by Gao, Zhuomin (NIH/NCI) [C]

new login timeout on DEV 8/1

Notification_time TotalLoginTimeInMilliseconds
time2013-08-01 19:01:08.597 70
time2013-08-01 19:00:48.917 41
time2013-08-01 17:53:21.957 13
time2013-08-01 17:48:30.310 27
time2013-08-01 17:48:24.893 16

Comment entered 2013-08-08 20:31:11 by alan

(This comment was also sent out separately as an email to CDR and CBIIT technical staff)

Volker and I tested on DEV tonight. We fed the CdrServer a bad password to force a connection failure, waited some length of time, then restored the correct password. We made three tests, a 45 second delay between setting bad then good passwords, a 2 minute delay, and a 5+ minute delay.

The actual outage times for the CdrServer are longer than the delay times between setting conditions for connection failure and connection success. A sequence of failures will bring down the CdrServer, but resetting the conditions for success won't restart it instantly. The service has to notice the failure of the CdrServer and restart it. Then the CdrServer has to go through all of its initializations. Then it becomes available for clients. The elapsed time between the reinstatement of good conditions for a database connection and the actual processing of incoming requests may be around one minute.

I found a bug in my Python software and fixed it. After that, the behavior was very good. A publishing job survived the 45 second delay and the two minute delay (~3 minute outage) continuing just fine when the CdrServer was operating correctly again. On the 5+ minute outage, the publishing software gave up at the end of 5 minutes and quit - as expected. There is no logging of ordinary transactions, but extensive logging of errors that records the number of connection failures, the number of send/recv errors, and the time (to the millisecond) and exception message for each error, together with a message at the end of an error sequence telling when the program was finally successful or finally gave up.

We can set the allowable delay before giving up to anything we want. I don't think it should be a problem to set a 10, 20, 30, or however many minutes delay we wish before aborting publishing. I've set the default delay to 5 minutes (with retries every 3 seconds), but we can change the defaults for all processes and/or change the delay time for a single process as desired.

I think this will ameliorate the biggest source of failures in publishing and other batch jobs.

I have installed the new software in DEV, QA, and version control. Volker may do some testing on QA tonight. However, I did a comparison of software in d:\cdr\lib\Python vs. version control and there are many differences. I'm not sure we should do much of anything in QA without refreshing the software there from svn.

My next step will be to try to do for the Python ADO-DB connections what I did for the Python CdrServer connections. When that's done, and if it works as well as the other changes I did, I'm hoping we can get all of our batch jobs running with much higher reliability.

Note: This will not do anything for the interactive XMetal or CdrAdmin users. They'll still face work interruptions with the new software. We still need to solve the underlying problem of failing database connections.

I'll post this message in one of the JIRA issues too, probably under the OCECDR rubric.

Comment entered 2013-08-28 10:44:26 by chengep

Based on our 8/27/2013 CDR standup meeting, this item can be closed. CBIIT applied the SQL hotfix and the CDR development team has confirmed that there are no more connection failures. I am closing this item as resolved.

Attachments

File Name	Posted	User
CdrLogErrsExcerpt01.txt	2013-07-23 23:19:09
CDRProdDBerr.xls	2013-07-23 23:14:40
CDRProdErr0723.xls	2013-07-24 10:04:41
DEVLoginTimer0801.xls	2013-08-01 14:17:34
ProdConnectivityErr0726.xls	2013-07-26 09:52:23
ProdConnectivityErr0729.xls	2013-07-29 15:46:43
ProdLoginTimer0730.xls	2013-07-30 16:11:57

Elapsed: 0:00:00.000665

CDR Tickets