PDQ Issues

Issue Number	3854
Summary	CTGov Download Failure
Created	2015-01-07 08:52:36
Issue Type	Bug
Submitted By	Osei-Poku, William (NIH/NCI) [C]
Assigned To	alan
Status	Closed
Resolved	2015-03-05 13:55:03
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.144422

Description

The CTGov download job has failed two days in a row with the following error message:

Failure in main CTGov download loop: ('Cursor.execute: Query: " SELECT COUNT⭐\n FROM query_term\n WHERE path = '/CTGovProtocol/IDInfo/NCTID'\n AND value = ?" Params: (u'NCT00088166',)', (u'Query timeout expired',))

Comment entered 2015-01-07 17:15:40 by Englisch, Volker (NIH/NCI) [C]

Assigning this ticket to Alan because people coming back from vacation always get the new issues. :-)

Comment entered 2015-01-07 17:28:22 by Englisch, Volker (NIH/NCI) [C]

Adding a data point to this ticket: The download job is running on DEV every night with the same queries but did not fail on either of these two days.

Comment entered 2015-01-08 11:40:27 by Osei-Poku, William (NIH/NCI) [C]

The download job (and the import job) failed this morning too:

Failure in main CTGov download loop: ('Cursor.execute: Query: " SELECT value\n FROM query_term\n WHERE path = '/InScopeProtocol/CTGovOwnershipTransferInfo'\n + '/CTGovOwnerOrganization'\n AND doc_id = ?" Params: (659192,)', (u'Query timeout expired',))

Comment entered 2015-01-12 15:19:20 by alan

This is a mystery.

I looked at the CTGovDownload.log file.  Everything was running
perfectly on January 8, 9, and 10 until, on each of those days, a
database query timed out.

On Thursday Jan 8 typical records took less than 1 second to process
until 5:38:35 am.  Then a simple query that looked no different from the
others and should have returned in much less than a second had still not
returned at the end of 25 minutes.

The same thing happened at 6:07:36 am on Friday Jan 9 and at 5:53:33 am
on Saturday Jan 10.

The CdrServer did not record any errors at those times and I saw no
other log entries from other programs showing problems at those times.
In each case, after the timeout/failure, the CTGovImport program ran
normally.  On the theory that all of us humans were asleep, or at any
rate not at work, at the times of the errors, it would seem that
whatever had caused the problem was temporary and corrected itself with
no human interaction.

The queries that failed did not involve the same query each time or the
same record each time.  I've included them below.  I ran each of them by
hand and had no problems at all.  I therefore think it likely that the
failure was at the database or network level and had nothing to do with
the specifics of the queries or the data.

At this point I don't know what else to look at in the CDR for more
information.  My guess is that something went wrong in the database
(perhaps a reboot of the database server or restart of the DBMS?), or
maybe in the network.  I will ask the CBIIT WEBTEAM and DBATEAM to look
in their log files at those specific times and see if they show
anything.

I'm not real hopeful that we're going to learn what happened.  However
the program completed successfully on Jan 11 and 12, so we can cross our
fingers and hope that whatever it was, it won't happen again.

Here are the queries that timed out:

SELECT COUNT(*)
  FROM query_term
 WHERE path = '/CTGovProtocol/IDInfo/NCTID'
   AND value = 'NCT00088166'

 SELECT value
  FROM query_term
  WHERE path = 
      '/InScopeProtocol/CTGovOwnershipTransferInfo/CTGovOwnerOrganization'
    AND doc_id = 659192

SELECT COUNT(*)
  FROM query_term
 WHERE path = '/CTGovProtocol/IDInfo/NCTID'
   AND value = 'NCT00448357'

Comment entered 2015-01-12 15:48:40 by alan

I've created two JIRA issues for CBIIT to ask them to check their log files:

WEBTEAM-5330
DBATEAM- 1524

I've made Volker and Bob watchers on the two issues.

Comment entered 2015-01-21 14:01:47 by Osei-Poku, William (NIH/NCI) [C]

The download has program failed for the past two days.

Comment entered 2015-02-12 10:38:56 by alan

Just to keep things up to date here, the following has happened:

Bob discovered an inefficiency in the way the CdrServer updates the query_term tables. It mainly affects documents with large numbers of query terms - which includes the CTGov and CTRP protocols. He has a fix for that that might resolve the whole problem.

CBIIT has also offered to loan us a couple of additional CPU cores for two weeks. That would increase our Prod CdrServer from 2 to 4 CPUs. I think we should take them up on their offer. If it makes a significant difference in performance we'll want to analyze what it is that is affected and decide whether we can optimize whatever it is, or if not, whether we want to ask CBIIT to make the additional horsepower permanent.

Comment entered 2015-02-19 22:41:54 by alan

Another update for the record. The query term indexing optimization and the addition of two CPUs to the server occurred on 2/17/2015 between 7 and 8 pm.

So far, the batch jobs seem to be running faster. The contention problem that is believed to have caused the download failure may have been reduced to the point that timeout failures will not occur again. We'll see.

Comment entered 2015-03-05 13:55:03 by Kline, Bob (NIH/NCI) [C]

Haven't had any crashes since the optimization was promoted.

Comment entered 2015-03-24 14:47:25 by Osei-Poku, William (NIH/NCI) [C]

Confirmed. There have not been any crashes or periods where the CDR has been slow to the extent that users cannot save or log in. Thank you! Closing issue.

Elapsed: 0:00:00.001272

CDR Tickets