PDQ Issues

Issue Number	3561
Summary	CDR Server Broken?
Created	2012-11-08 16:18:14
Issue Type	Bug
Submitted By	Englisch, Volker (NIH/NCI) [C]
Assigned To	alan
Status	Closed
Resolved	2013-12-05 14:28:39
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.107889

Description

BZISSUE::5258
BZDATETIME::2012-11-08 16:18:14
BZCREATOR::Volker Englisch
BZASSIGNEE::Alan Meyer
BZQACONTACT::William Osei-Poku

It seems in the past 4-6 weeks we've ran into a few "strange" (unexplainable) filter failures.
Unexpected exception caught.

Each time we ran into these problems we "fixed" these by restarting the CDR service.
We may want to identify what had changed in the past 8 weeks and see if those changes are causing our server to run less stable.

I'm marking this as a P6 due to the ongoing EBMS tasks.

Comment entered 2012-11-08 17:13:50 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-08 17:13:50
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1

After talking to Alan we're thinking that maybe a memory leak is to blame for this problem. This may have been introduced with the permaTarg changes.

In the coming week or two I will check the memory usage of the CDRServer once a day to collect a few more data points.

Currently, the memory usage is 82,400k (this is our baseline).

Comment entered 2012-11-13 12:00:23 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-13 12:00:23
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2

We only have a very small sample at this point but the server was rebooted yesterday and I think we can already see a trend for the memory usage of CDRServer.exe:

09/11 84,000k
12/11 145,800k
Reboot
13/11 15,800k

Comment entered 2012-11-13 13:42:48 by alan

BZDATETIME::2012-11-13 13:42:48
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

Okay, I think I know where to look for the problem. I'll work on it sometime though I'm not certain just when right now.

Comment entered 2012-11-13 14:09:33 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-13 14:09:33
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4

In the meantime I will continue to monitor the memory usage for a little while and set myself a reminder to restart the CDRService as needed to avoid publishing failures. I'm confident if we do this every other week we should be OK for now.

Comment entered 2012-11-15 23:13:19 by alan

BZDATETIME::2012-11-15 23:13:19
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5

The last thing that was changed in the CdrServer was the
PermaTargId processing, so I concentrated on that as the first
place to look for a memory leak. So far, I haven't found
anything.

Looking on Bach, all of the PermaTargIds were added on the same
day, August 8, 2012. No PermaTargIds were deleted except for
one that I did before that date just to test that the software
worked on Bach. But the apparent memory leak has occurred after
a reboot that took place well after that date. So it would
appear that, if there is a leak, it is not in, or at least is not
limited to, the addition or deletion of PermaTargIds.

The next most likely place would be in the validation of
PermaTargIds, so I validated two records over and over again, one
with and one without PermaTargIds. I couldn't tell just by
monitoring memory usage by eye that anything unusual was going
on. Memory went up and down in similar ways in both documents
and the average memory allocation at the end didn't look
different from the average at the beginning.

It looks like there's no low hanging fruit here. I'm going to
have to figure out how to climb the tree.

I've changed the priority of this task from 6 to 5.

Comment entered 2012-11-16 10:56:49 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-16 10:56:49
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6

(In reply to comment #5)
> The next most likely place would be in the validation of
> PermaTargIds, so I validated two records over and over again, one
> with and one without PermaTargIds. I couldn't tell just by
> monitoring memory usage by eye that anything unusual was going
> on.

On which server did you do this?

I see the memory usage for the CDRServer.exe on
BACH: 56,200 K
FRANCK: 12,700 K
MAHLER: 156,000 K
MAHLER: 1,278,200 K

All three machines where rebooted 3 days ago and I would have expected the CDRServer on BACH to use the most memory since that system is used the most. The one thing that I can come up with that may have been used a lot on MAHLER was the PublishPreview report but I wouldn't think it's been used more on MAHLER (just as myself as a user) than by the collection of CIAT users working on BACH.

Comment entered 2012-11-16 12:13:57 by alan

BZDATETIME::2012-11-16 12:13:57
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7

(In reply to comment #6)

> On which server did you do this?

I started a new instance of the CdrServer on Mahler listening on port 2020. It's still running. I did that because I wanted to be sure that the only transactions being processed were the ones that I initiated.

My testing was done with XMetal. I loaded up Adult Hodgkins Lymphoma, which has no PermaTargIds, and Adult Non-Hodgkins Lymphoma, which has a number of them. Using Task Manager, I couldn't see any difference in the pattern of memory usage between 20 or so validations of the first document and a similar number of the second one, or validations done with schema only or schema + link (where PermaTargId validation is done.)

Comment entered 2012-11-16 12:19:54 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2012-11-16 12:19:54
BZCOMMENTOR::Bob Kline
BZCOMMENT::8

Is it possible that analysis of the patterns of transactions recorded in the debug_log table on Mahler (where the heavy memory usage seems to be concentrated) might be useful?

Comment entered 2012-11-16 12:30:07 by alan

BZDATETIME::2012-11-16 12:30:07
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9

(In reply to comment #8)
> Is it possible that analysis of the patterns of transactions recorded in the
> debug_log table on Mahler (where the heavy memory usage seems to be
> concentrated) might be useful?

I'll check it out, probably on Monday.

Comment entered 2012-11-21 11:15:41 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-21 11:15:41
BZCOMMENTOR::Volker Englisch
BZCOMMENT::10

Another data point - I'm not sure if this is significant:
It seems that the increase in memory isn't constant but the rate of the increase is. The change in memory usage is about ~30% from day to day. Obviously, this makes the problem get worse the longer the server is running.

In the interest of having a nice and uneventful Thanksgiving holiday I will restart the CDRServer tonight after publishing finished to ensure this memory problem will not interfere with our weekend publishing job.

Comment entered 2013-03-28 10:19:05 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2013-03-28 10:19:05
BZCOMMENTOR::Volker Englisch
BZCOMMENT::11

One more data point:
Since we turned off the CTRP Protocol import the nightly memory increase dropped significantly. The increase is usually around 20k - 50k but was only around 4k - 5k in the past two days.
I don't know if the reason for this drop is the import itself or the fact that we're now publishing far fewer documents at night.

Comment entered 2013-04-05 10:02:35 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2013-04-05 10:02:35
BZCOMMENTOR::Volker Englisch
BZCOMMENT::12

(In reply to comment #11)
> One more data point:
> Since we turned off the CTRP Protocol import the nightly memory increase
> dropped significantly.

... and since we've turned on the CTRPImport last night the memory increase from yesterday to today was 50,000k while it was 4,200k the day before.

Comment entered 2013-04-09 16:27:58 by alan

BZDATETIME::2013-04-09 16:27:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13

Here's a little progress report on this task.

By enabling and disabling various features of the publishing process I got very strong evidence that the memory leak is happening in the XSLT filtering process. It could be in Sablotron, our XSLT processor, or in the callback functions that we provide to Sablotron to perform CDR specific functions. When filtering is run normally, there is a memory leak in publishing. When filtering is circumvented so that everything else but that runs during publishing, there appears to be no leak.

This is not where I expected the problem to be. We've been filtering ever since the start of the CDR. Maybe it has to do with the increasing size of our documents. In any case I now know (or think I know) where to concentrate further efforts.

Comment entered 2013-04-12 01:11:27 by alan

BZDATETIME::2013-04-12 01:11:27
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14

In my last Bugzilla episode I reported narrowing down the general location of the memory leak. I then reported verbally that I had found and fixed two leaks.

That turned out to be false. I found two and fixed one. The other fix didn't work. I know what is leaking (a Term object representing a node in the terminology tree). However, many hours in the debugger have shown me much more about how not to fix the leak than how to fix it.

It's in the caching of Terms. Mike Rubenstein warned me not to do that, but I didn't listen. Full of hubris, I produced a cache structure that did what it was intended to do. As I recall, it saved 10-20 minutes from a full publishing run.

Unfortunately however, the treatment of different kinds of terms has to be different, and I think I may be leaking Terms of type "TermType". But I'm not sure of that. I'm afraid I may have to relearn all of the Term hierarchy TermTypes to figure out what is going wrong.

But I'm having fun. I haven't experienced a knock-down drag-out Visual Studio debugging session in years.

Comment entered 2013-04-12 08:27:33 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2013-04-12 08:27:33
BZCOMMENTOR::Volker Englisch
BZCOMMENT::15

(In reply to comment #14)
> It's in the caching of Terms. Mike Rubenstein warned me not to do that, but I
> didn't listen. Full of hubris, I produced a cache structure that did what it
> was intended to do. As I recall, it saved 10-20 minutes from a full publishing
> run.

Please remember that the term caching had never been turned on in production. It's only available on Mahler.

Comment entered 2013-04-12 10:42:11 by alan

BZDATETIME::2013-04-12 10:42:11
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16

I did not remember that.

It puts a new light on things.

Comment entered 2013-07-11 13:54:54 by alan

I will work with Volker next Tuesday (July 16) to see how the memory looks on DEV. If it's still high, I'll put in a ticket with the CBIIT web team to monitor the CdrServer memory.

Comment entered 2013-07-16 20:43:17 by alan

We still don't really know whether the memory leak is present and/or whether it is reduced.

Database connection failures keep bringing down the CDR Server (15 times yesterday, 19 times today), so there isn't time for the leak to manifest itself. So further work on this depends on getting the connection problems resolved.

Volker has filed another ticket with the CBIIT database team.

Comment entered 2013-09-23 10:55:34 by Kline, Bob (NIH/NCI) [C]

This seems very similar to OCECDR-3587. What's the current assessment?

Comment entered 2013-09-23 20:07:39 by alan

I think the two issues turned out to be the same, though that does not seem to have been realized when this one was created. So I see no objection to closing one of them.

It is my understanding that there is still a memory leak. If I understand Volker's reports, the efforts I made to find and plug a memory leak were successful but incomplete. It now takes longer for the memory to grow, but it still does grow.

Having spent a lot of hours attempting to find the memory leak, I am strongly tempted to install a weekly CdrServer restart as part of the publishing job. It would not be hard to do, would probably not impact any CDR services, and would eliminate the issue as a practical problem. I'm not happy about that. I'd rather solve the problem than work around it in this way. But I think a restart will save us a lot of additional work.

Comment entered 2013-12-05 14:28:39 by Kline, Bob (NIH/NCI) [C]

Issue OCECDR-3587 addresses the restart suggestion. Closing this ticket.

Elapsed: 0:00:00.000977

CDR Tickets