Issue Number | 3854 |
---|---|
Summary | CTGov Download Failure |
Created | 2015-01-07 08:52:36 |
Issue Type | Bug |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | alan |
Status | Closed |
Resolved | 2015-03-05 13:55:03 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.144422 |
The CTGov download job has failed two days in a row with the following error message:
Failure in main CTGov download loop: ('Cursor.execute: Query: " SELECT COUNTā\n FROM query_term\n WHERE path = '/CTGovProtocol/IDInfo/NCTID'\n AND value = ?" Params: (u'NCT00088166',)', (u'Query timeout expired',))
Assigning this ticket to Alan because people coming back from vacation always get the new issues. :-)
Adding a data point to this ticket: The download job is running on DEV every night with the same queries but did not fail on either of these two days.
The download job (and the import job) failed this morning too:
Failure in main CTGov download loop: ('Cursor.execute: Query: " SELECT value\n FROM query_term\n WHERE path = '/InScopeProtocol/CTGovOwnershipTransferInfo'\n + '/CTGovOwnerOrganization'\n AND doc_id = ?" Params: (659192,)', (u'Query timeout expired',))
This is a mystery.
I looked at the CTGovDownload.log file. Everything was running
perfectly on January 8, 9, and 10 until, on each of those days, a
database query timed out.
On Thursday Jan 8 typical records took less than 1 second to process
until 5:38:35 am. Then a simple query that looked no different from the
others and should have returned in much less than a second had still not
returned at the end of 25 minutes.
The same thing happened at 6:07:36 am on Friday Jan 9 and at 5:53:33 am
on Saturday Jan 10.
The CdrServer did not record any errors at those times and I saw no
other log entries from other programs showing problems at those times.
In each case, after the timeout/failure, the CTGovImport program ran
normally. On the theory that all of us humans were asleep, or at any
rate not at work, at the times of the errors, it would seem that
whatever had caused the problem was temporary and corrected itself with
no human interaction.
The queries that failed did not involve the same query each time or the
same record each time. I've included them below. I ran each of them by
hand and had no problems at all. I therefore think it likely that the
failure was at the database or network level and had nothing to do with
the specifics of the queries or the data.
At this point I don't know what else to look at in the CDR for more
information. My guess is that something went wrong in the database
(perhaps a reboot of the database server or restart of the DBMS?), or
maybe in the network. I will ask the CBIIT WEBTEAM and DBATEAM to look
in their log files at those specific times and see if they show
anything.
I'm not real hopeful that we're going to learn what happened. However
the program completed successfully on Jan 11 and 12, so we can cross our
fingers and hope that whatever it was, it won't happen again.
Here are the queries that timed out:
SELECT COUNT(*)
FROM query_term
WHERE path = '/CTGovProtocol/IDInfo/NCTID'
AND value = 'NCT00088166'
SELECT value
FROM query_term
WHERE path =
'/InScopeProtocol/CTGovOwnershipTransferInfo/CTGovOwnerOrganization'
AND doc_id = 659192
SELECT COUNT(*)
FROM query_term
WHERE path = '/CTGovProtocol/IDInfo/NCTID'
AND value = 'NCT00448357'
I've created two JIRA issues for CBIIT to ask them to check their log files:
WEBTEAM-5330
DBATEAM- 1524
I've made Volker and Bob watchers on the two issues.
The download has program failed for the past two days.
Just to keep things up to date here, the following has happened:
Bob discovered an inefficiency in the way the CdrServer updates the query_term tables. It mainly affects documents with large numbers of query terms - which includes the CTGov and CTRP protocols. He has a fix for that that might resolve the whole problem.
CBIIT has also offered to loan us a couple of additional CPU cores for two weeks. That would increase our Prod CdrServer from 2 to 4 CPUs. I think we should take them up on their offer. If it makes a significant difference in performance we'll want to analyze what it is that is affected and decide whether we can optimize whatever it is, or if not, whether we want to ask CBIIT to make the additional horsepower permanent.
Another update for the record. The query term indexing optimization and the addition of two CPUs to the server occurred on 2/17/2015 between 7 and 8 pm.
So far, the batch jobs seem to be running faster. The contention problem that is believed to have caused the download failure may have been reduced to the point that timeout failures will not occur again. We'll see.
Haven't had any crashes since the optimization was promoted.
Confirmed. There have not been any crashes or periods where the CDR has been slow to the extent that users cannot save or log in. Thank you! Closing issue.
Elapsed: 0:00:00.001792