Issue Number | 2600 |
---|---|
Summary | [Internal] Investigate CDR service failure reports |
Created | 2008-07-25 16:03:49 |
Issue Type | Bug |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | alan |
Status | Closed |
Resolved | 2013-07-11 15:27:52 |
Resolution | Won't Fix |
Path | /home/bkline/backups/jira/ocecdr/issue.106928 |
BZISSUE::4225
BZDATETIME::2008-07-25 16:03:49
BZCREATOR::Bob Kline
BZASSIGNEE::Alan Meyer
BZQACONTACT::Lakshmi Grama
Big Brother is occasionally reporting that the CDR service is not running. Please look into the reports to determine if anything serious is going on. You may need to work with John to find out more about how Big Brother determines that a service is running. He may be just invoking http://bach.nci.nih.gov/cgi-bin/cdr/cdrping.py; if so, adding some logging to that script might be a useful first step. As far as I know, users have not reported any problems connected with the interruptions of service Big Brother is describing, so I've set the severity as 'minor' but I put the priority at P4 to make sure this issue doesn't languish too far down in your pile.
BZDATETIME::2008-07-29 17:36:12
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1
John's script invokes two different functions on each of the three systems, cdrping.py and dbping.py, both accessible via http in the cgi directory.
cdrping.py invokes cdr.getDoctypes and reports an error if that fails.
dbping talks directly to the CDR database, executing SQL to get a count of documents. However, it sets a three second timeout and reports failure if there is a timeout. Three seconds seems short to me. I can imagine that 99.9% of the time the response comes back in three seconds, but maybe in .1% of the time it doesn't.
I propose to set it to 15 seconds on all servers and see if that fixes the problem. I'll also add logging to the standard \cdr\log\debug.log to both programs in the event of failure.
I'll do this after I fix the higher priority problems.
BZDATETIME::2008-08-07 11:20:28
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2
I'm going to put new versions of cdrping and dbping on Mahler.
The new versions log errors with additional information to our
default debug.log.
When I see that they're running without generating new errors,
I'll install them on Bach and Franck.
BZDATETIME::2008-08-07 19:05:43
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3
The modified cdrping and dbping are installed on Bach
and Franck. I've also put them in CVS.
I'm lowering the priority from P4 to P7. We can raise
it again (or close this issue and recreate it) if the
problem recurs. Hopefully, if it recurs, we'll have more
information at that time.
BZDATETIME::2008-08-08 13:54:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4
We got a big brother report today saying that the CDR database
was not available on Bach. I checked the log file (debug.log)
and found that the attempt to get a count of documents in
dbping.py timed out.
I had raised the timeout from 3 seconds to 10, but it still
timed out. I don't know why, but I am inclined to accept the
veracity of the timeout claim and assume that SQL Server was
down, busy, or blocked for that 10 second period.
At any rate, it doesn't look like a problem in Big Brother or
in dbping, both of which did what they are supposed to do.
BZDATETIME::2008-10-21 10:46:29
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5
We received the following error report today:
[3204710] bach.nci.nih.gov.dbping red Tue Oct 21 10:14:35 EDT
2008
BACH CDR DB ping is NOT OK.
http://bach.nci.nih.gov/cgi-bin/cdr/dbping.py
HTTP/1.1 200 OK Content-Type: text/plain Server:
Microsoft-IIS/6.0
X-Powered-By: ASP.NET Date: Tue, 21 Oct 2008 14:14:37 GMT Connection:
close
CDR DATABASE UNAVAILABLE_ NO BLOCKERS
Followed by:
[0000000] bach.nci.nih.gov.dbping recovered Tue Oct 21 10:19:26 2008 Problem has been resolved after 591 seconds
I happened to be running a global change at the same time and saw the
following in the log file:
...
2008-10-21 10:13:14: saveDoc(258787, ver='Y' pub='Y' val='Y' new
pub)
2008-10-21 10:15:51: Processing CDR0000269292
2008-10-21 10:15:51: saveDoc(269292, ver='Y' pub='Y' val='Y' new
pub)
2008-10-21 10:15:56: Processing CDR0000269433
2008-10-21 10:15:58: Processing CDR0000269673
2008-10-21 10:15:59: saveDoc(269673, ver='Y' pub='Y' val='Y' new
pub)
2008-10-21 10:16:02: Processing CDR0000270380
2008-10-21 10:16:02: saveDoc(270380, ver='Y' pub='Y' val='Y' new
pub)
...
It's hard to say whether the global change was affected by the failure (or even caused it). There was a long delay in processing document 258787, but it turns out that that is a very large document (646,588 bytes in its CdrDoc form), and we already know that validation of such documents takes a very long time.
In any case, the global change did not fail during this period. Clearly, whatever is causing the dbping failure reports is not actually disrupting service on Bach.
The global change continued to run past the point where the resolution occurred. So it appears that running the global change is not an unambiguous cause of failures.
I'm going to try increasing the dbping timeout time yet again, to 15 seconds this time (up from 10).
BZDATETIME::2008-11-13 06:15:17
BZCOMMENTOR::Bob Kline
BZCOMMENT::6
There were at least three dbping failures reported this morning between 2:52 and 3:52, and the CT.gov import job failed with a database timeout exception:
CT.gov import failures encountered; see logs for more information:
Job interrupted: ('Cursor.execute: Query: " SELECT value\n FROM query_term\n WHERE path = '/Organization/OrganizationType'\n AND doc_id = ?" Params: (340962,)', (u'Timeout expired',))
BZDATETIME::2008-11-20 16:11:18
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7
I did some analysis of the dbping timeout reports in debug.log
on
Bach. I looked at the first failure on each day. The results
may be instructive.
The log file shows 101 distinct days with dbping failures. Of
those, 86 occurred between 2:30 and 2:49. Another 7 occurred
during from 1:30 - 1:49 during the period from Oct 26 - Nov 1
when Python was off by one hour due to the daylight savings time
error. So a total of 93 out of 101 of the first failures of the
day occurred during the real clock time of 2:30-2:49.
I looked at the SQL Server Agent jobs on Bach and found that
the
closest match to these times is the RSSImport program. In fact
the relationship is quite close.
If we look at the lines in RSSImport.log that have the message
SiteImporter running in real mode
we find a close correspondence with the dbping failures.
I then looked at the days when the first failure occurred
before
2:30 that weren't daylight savings time problems. I didn't check
every one, but among the ones I checked, it looks like they had a
dbping error before 2:30 and then another one around 2:30, or else
there was no RSS job run that night.
The next closest match to our timeframe is the DB Backup Job
for
DB Maintenance, which runs at 2:00 am. That may be the cause, or
it may be that it is still running when RSSImport runs and the
interaction of the two is the cause.
At any rate, those are my best leads so far.
Why RSSImport would cause such a problem I don't know. Is it
possible that it locks the document table, or some other SQL
Server resource that blocks the dbping command from completing?
BZDATETIME::2008-11-20 16:22:38
BZCOMMENTOR::Bob Kline
BZCOMMENT::8
(In reply to comment #7)
> Why RSSImport would cause such a problem I don't know. Is
it
> possible that it locks the document table, or some other SQL
> Server resource that blocks the dbping command from completing?
I won't ever give a categorical "no" to a question like this, but it seems unlikely, given that all of the heavy lifting is done by the common SiteImporter module, which is used with all of our other protocol data exchange partners except NLM. I've looked at the code specific to RSS and I don't see any interaction with the database. I'd lean more heavily toward your theory about the backup overlapping with the RSS import job.
BZDATETIME::2008-11-20 16:41:34
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9
The RSSImport program runs for about an hour and 20 minutes or
so. I think we might get good information by trying the
following two experiments.
1. Switch the time of RSSImport with a batch job that takes a
similar amount of time, or maybe even less time.
If we still see the problem around 2:30, or whenever the new
job starts writing to the database, then we have a smoking
gun.
2. Try moving all job away from the time period where errors
have
occurred.
The errors seem to more or less stop by around 3:30 pm. We
would need to not run jobs at least between 2:30 and 3:30, and
maybe between 2:00 and 3:30.
If the problem disappears then we have good reason to believe
it's the backups.
Assuming the backups are the problem, we have several choices:
1. Ignore it.
The only "problem" we have is the dbping report. The jobs
seem to run fine.
2. "Fix" dbping to handle this.
We could try a still longer timeout (we're at 15 seconds now),
or somehow check to see if backups are running.
3. Don't run any of our jobs during the backup.
This might make the backup run faster and our jobs run faster,
so the total time lost for batch jobs might not be
significant.
It's also possible that two CDR batch jobs running
concurrently will interfere with each other less, and run
in less total time, than running one during the backup.
BZDATETIME::2008-12-03 10:02:35
BZCOMMENTOR::Bob Kline
BZCOMMENT::10
(In reply to comment #9)
> 2. Try moving all job away from the time period where errors
have
> occurred.
>
> The errors seem to more or less stop by around 3:30 pm. We
> would need to not run jobs at least between 2:30 and 3:30,
and
> maybe between 2:00 and 3:30.
I moved some jobs try this experiment.
RSS 2:22 -> 4:00
COG 2:45 -> 4:30
CTGov 3:00 -> 5:00
BZDATETIME::2008-12-16 17:40:27
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11
(In reply to comment #10)
> ...
> I moved some jobs try this experiment.
> ...
After the change, the dbping reports were shifted. We're
now seeing them occurring between 4:12 and 4:52, with a
few outliers at later times. So shifting the jobs to
eliminate conflicts doesn't seem to be the answer.
It appears that something is happening during the RSS
Download that causes this relatively insignificant event.
I've changed dbping on Bach to now wait 45 seconds instead
of 15 before declaring an error. We'll see if that cuts
down on spurious log entries.
BZDATETIME::2009-03-31 17:23:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::12
I've made some changes to the dbping program to try to gather
more information about what's going on. The changes are:
After a database failure (usually a timeout), the program
will now:
Write the error messages and blocker information to a new
log file, d:\cdr\Log\dbping.log.
Add the output of the Windows tasklist.exe command, which
lists all programs that are running.
Create an empty file named d:\cdr\Log\dbpingErrFlag as a
way of remembering that an error occurred.
Else when there is no failure the program will:
If the dbpingErrFlag file exists:
Remove the error flag file.
Add the output of the tasklist command to the
logfile.
Errors will be recorded each time a failure occurs. If, for
example, the blockage lasts 15 minutes and there are three dbping
invocations during that time, there will be three task lists
recorded in the log file, plus one more recorded after the first
successful following invocation.
I chose tasklist.exe after also considering d:\bin\pstat.exe,
d:\bin\ps.exe, and d:\cygwin\bin\ps.exe. I couldn't get any of
them to do what I wanted, which is to show me the complete
command line that started a program, so I arbitrarily chose
tasklist.
Without a complete command line we can tell that python is
running, but can't tell what python is doing - which is something
we would like to know. But it might still be better than
nothing. We'll find out.
I've also dropped back from 45 seconds to 30 seconds on the
timeout limit. 30 is probably a reasonable limit and 45 didn't
seem to help us. We used to use 10 seconds.
I've tested by running from the command line on Mahler. I'll
test on Bach and install it as the official dbping there.
BZDATETIME::2009-05-12 19:35:32
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13
I have looked at this again.
My plan for running a task listing program failed. It took some
research to find out why. I don't know the specific reason, but
after chasing a few wild geese, I believe the general reason is
that any program that lists tasks must call an internal Windows
system routine that will only run if the caller has
administrative permission.
The problem is not with the file permissions on the task
listing
program itself, nor is it with file permissions on the shell used
to execute the program. Rather the permission problem occurs at
a lower level within a system call made by any of the task
listers. It may be in the "Windows Management Instrumentation"
(WMI) package, which apparently runs as a COM object and is picky
about who it allows to talk to it about what.
tasklist.exe is the least successful of the programs. It
returns
no output at all. d:\bin\ps.exe (original source unknown) will
produce a row of headers, but no output. d:\cygwin\bin\ps.exe
will actually produce output, but apparently only for jobs
running under the account of the program that calls it. The
parameter requesting information on all Windows processes, which
works when I run it, does not work when IIS runs it.
As a person with administrative privileges, I can run these
programs and see all the tasks running on the system. But the
account running IIS, and hence dbping, cannot. I'm going to
disable the task lister now in the program.
I could get around this by running a scheduled job, multiple
times, under my own account around the time of interest, but I
think it's not going to be that productive. Since the fruit is
not low hanging, I'm not going to bother climbing the tree to get
what I don't expect to be very nutritious anyway.
BZDATETIME::2011-08-11 17:30:53
BZCOMMENTOR::Volker Englisch
BZCOMMENT::14
(In reply to comment #0)
> Big Brother is occasionally reporting that the CDR service is not
running.
> Please look into the reports to determine if anything serious is
going on.
Just wondering:
Is this still a problem? We're not using BB anymore, are we?
BZDATETIME::2011-08-11 17:40:54
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15
(In reply to comment #14)
> (In reply to comment #0)
> > Big Brother is occasionally reporting that the CDR service is
not running.
> > Please look into the reports to determine if anything serious
is going on.
>
> Just wondering:
> Is this still a problem? We're not using BB anymore, are we?
It appears that NAGIOS is invoking dbping.py and the errors are still occurring as they were in the past, at the same times as they did in the past. So the issue is unresolved.
We have still not found the cause, nor have we found any harm associated with the database timeouts that dbping reports.
So I'm inclined to leave this issue open and leave the priority low.
This issue is irrelevant in the CBIIT environment.
Elapsed: 0:00:00.001173