Issue Number | 3853 |
---|---|
Summary | CDR is slow and freezes |
Created | 2015-01-06 08:59:57 |
Issue Type | Task |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2018-01-18 15:03:26 |
Resolution | Cannot Reproduce |
Path | /home/bkline/backups/jira/ocecdr/issue.144339 |
The CDR has been very slow this morning. It takes forever to save and freezes sometimes. It occasionally reports the byte count error message.
I filed a ticket with the web team:
https://tracker.nci.nih.gov/browse/WEBTEAM-5257
... and marked it "Blocker" (since we can't change the priority once a ticket has been filed - something we might wish we could do if the problem persists).
How's performance now?
It seems to be working well now. Thank you!!
XMetaL does seem to be pretty responsive but some of us are having trouble getting publish preview to come up in the Bastion. I ran PP of the same summary outside of the Bastion and it came up pretty quickly. I can put in a separate ticket for this if you think this is a separate issue.
The PP problem appears to have started yesterday and didn't get fixed. We weren't able to run PP for most of yesterday and it continues today too.
Thanks. Robin said she's had the same problems since yesterday, too. Also to correct my previous comment: I can't run PP in the Bastion OR on my desktop. It was the B/U report that came up so easily on my desktop :-)
I believe Blair is working on tracking down some problems on production with Percussion. I've added him as a watcher so he can chime in with any relevant information he might have.
William, am I correct when I say that the PP issue is not limited to Summaries and DIS? I'm having the feeling that other document types are also affected.
Robin B reported the PP report to be very slow around noon yesterday
but said later in the afternoon it was working OK again. Around 3pm or
so both William and Robin said again that the PP report was timing
out.
Maybe some snow got into the kabel and the electrons are spinning.
(Sorry, a bad physicist joke :-) )
William, am I correct when I say that the PP issue is not limited to Summaries and DIS? I'm having the feeling that other document types are also affected.
It affects mostly regular summaries and DISs. Gen. Prof documents for example, are fine.
Publish preview seems to be back now.
The Percussion problems we were experiencing earlier got cleared up some time this afternoon. For the other performance problems, I'm leaving the WEBTEAM ticket open for a while so they can try and track down the cause. I think you should be able to close this ticket now, though.
It seems there were no problems logging in this morning but some users have reported that it is taking them a long time to open and save documents.
Reported to CBIIT.
Volker:
Please make yourself a watcher on https://tracker.nci.nih.gov/browse/WEBTEAM-5257 (JIRA doesn't let me do it) so you can follow up with any questions David might have.
Thanks.
CDR has been stable since I reported this problem and everything appears to be back to normal. Hopefully it will stay this way tomorrow morning.
Some of us are having problems logging into the CDR this morning. We are getting stuck on the screen with the XMetaL banner.
We've been having the same problem this morning. It started around 6:30 am and appeared to have gotten better at about 7:40 am. It is much worse now for many users.
FYI - I just received an email from David Do stating that he was looking into it.
FYI -
From: Do, David (NIH/NCI) [C] david.do@nih.gov
Sent: Thursday, January 08, 2015 11:15 AM
To: Osei-Poku, William; Osei-Poku, William (NIH/NCI) [C]
Subject: RE: Latency issue on CDR server
I am working with the server team to find out why the performance has decreased lately. Once I have more info, I will let you know.
Thanks,
David
This problem appears to have been resolved. No one has complained about this issue for more than a week now. So, I am marking it as resolved and will close it.
The problems resurfaced this morning. The CDR has been slow and freezes often. Importing citations freezes XMetal as well as saving and retrieving documents.
We are having problems this morning with the CDR freezing and at least one user is now unable to log in.
Reported to CBIIT. I have added David Do as a watcher for this ticket.
Unfortunately, the problems may be back. One user reported repeatedly receiving a message to the effect of "XMetaL has stopped working" or "XMetal is unavailable" while working last night between 7PM-9PM. The only option available to her was to close the program. She gave up at 9PM after this happened several times.
Another user just reported that she received the "XMetaL has stopped working" message a few minutes ago and was promptly kicked out, losing her changes.
Assigning this ticket to Alan per request from Erika.
Questions:
Is the problem ongoing?
Are other users affected?
Did the problem occur on or before 8:22 am and stop after that? 8:22 am
is when the CTGovImport job ended this morning.
Looking in the ClientRefresh.log file I see a steady stream of attempted XMetal logins this morning, The number looks about the same as on other days. I'm hoping that they represent users logging on successfully, not failures.
The issues seem to be sporadic and they are affecting multiple users at different times.
Robin B (who reported the problem this morning) said she got the
error after 9am. I haven't heard of anyone else having login problems
this morning.
Sharon reported receiving the error last week.
[2nd version, some typos fixed]
I found three XMetal crashes so far in the Prod bastion host event log for today. They were at 09:03:14 09:21:52 and 10:46:44. All were "0xc0000005" exceptions, access violations. There were five crashes Monday, none last Friday, one last Thursday, and I stopped looking before that, but I'm guessing that this is an under-reported problem and people have gotten used to the crashes and our lack of ability to fix them, and so just soldier on.
Sorting the events to just look at "Application Errors", I found lots and lots of xmetal crashes over time. It's not a recent problem. I looked at about 15 of them. All were access violations. About half of the ones I looked at came from jscript.dll, but they were all over the place, one in xmetal45.exe, one in cdr.dll, two i.n wkscli.dll, one each in comctl.dll, ntdll.dll, audiodev.dll, and rtutils.dll.
I looked for patterns in the dates and times. Were they almost all on Tuesdays? All in the mornings? But I didn't see any pattern. They occurred on different days of the week and different times of day.
I looked to see if other programs crashed. There were PIV card login crashes and Microsoft Word crashes, but most were XMetal.
It's possible that there are bugs in XMetal and in our own C++ and Javascript code that account for all this, and I don't think we can rule that out. But the fact that these things happen all over the place could conceivably be due to something underlying the process on the bastion host. Maybe there's one or more problems in the remote terminal server, the registry, the network connections, or somewhere else. We don't see a lot of other programs crashing, but there aren't a lot of other programs running. I'm not sure that this bastion host is used for other systems besides the CDR.
If we're lucky, the problem is either in XMetal 4.5, or in the bastion host, or in the complex CBIIT network and maintenance environment. If I had to bet, I'd bet on it being there rather than in our impeccable program code. 🙂 I'm hoping the problem will disappear when we switch to XMetal 9 on users' desktop computers. If it doesn't then we've got some nitty gritty work to do.
I'm going to go back to my regularly scheduled tasks for now and revisit this after Bob, Volker, Erika, Bryan or others weigh in.
The approach I'm taking here is that we have a problem that is NOT due to the slowdowns caused by batch jobs running over into the work day. I think we had those kinds of problems. But I think we have other problems that are not due to that, or at least not due to contention for the resources that we knew were blocking progress on the past.
One thing we can all try to do is to try to recall exactly what we were doing when the crashes occurs. For example:
"I was trying to start the program. It failed after the login screen. Nothing came up."
I" had press Alt-E to see the list of validation errors."
"I had inserted a new empty paragraph in a Summary and it froze immediately after clicking the insert button."
"I clicked the button to save a new version."
It takes a person with a prodigious awareness of his or her environment to be able to do that however, if there is a bug in our code or in XMetal, we might get clues to help us reproduce the failures and fix them (if in our code) or work around them (if in XMetal.)
Sharon reported receiving an error message last night, although she did not get kicked out of the CDR. Here are the details:
On 3/3/15 at 11:34 pm, I
1) Check out box was checked and Document type was Summary from previous
searches. Typed Retinoblastoma into the CDR search Doc title box. Hit
the Enter key to search.
2) Received a message that said “sendCommand: unexpected failure”
I closed the search box and started over from the beginning.
Not quite what happened before as the CDR didn’t shut down and boot me out.
Sharon
I just received (at about 9:12am on 3/10/15) the message "XMetaL 4.5 has stopped working" when I tried logging into the CDR. It happened right after I entered my login credentials and before the CDR opened. I attached a screenshot of the specific error message.
As near as I can tell, this crash of XMetal is unrelated to anything having to do with the batch processes that caused slowness in the past. It looks to me like the finished before the latest errors occurred.
Looking in the event logs for the past half dozen days I see a lot of such crashes, 1-4 per day. About 80% of them are in rtutils.dll. As near as I can tell from looking thing up on the web, this program is involved in networking and remote access.
My pure speculation is that one of the following is occurring:
There is a problem in the remote terminal server software, registry, or other configuration on the bastion host.
There is a bug in XMetal 4.5
There is something in XMetal that occasionally runs afoul of remote terminal services. The version of XMetal that we're running was written for an earlier operating system and for single user operation.
There is a bug in our code.
Right now, we've got no support for XMetal, and no Windows systems administrators available to us to help with rtutils or similar low level components in our very complicated environment.
I know that these crashes are an annoyance but my sense of what we should do is to put this issue on hold and wait for the CDR Security Remediation deployment which is intended to eliminate both the bastion host and the older version of XMetal. It will also give us access to current support from the XMetal vendor. That has not always been helpful in the past but sometimes they come through.
I will attach some screenshots of representative event log errors to the issue so that, if the problem continues after we install XMetal 9, it will be possible to see whether anything has changed.
Grasping at straws here, I have attached a screen shot ("RemoteDesktopVersion.jpg") of the version of the remote desktop client on my workstation. I found some crumbs of information on the web saying that older remote desktop clients might cause errors.
To see your own version, bring up Remote Desktop Connection (Start > All Programs > Accessories > Remote Desktop Connection) and click the little computer icon in the upper left corner. Is it old?
Like I said ... grasping at straws.
Alan, I have the same remote desktop connection version as you, from 2012.
I think it makes sense to go through with the security fixes deployment and see if that helps.
In the meantime, should I continue to report problems as we have them? Sharon just had XMetaL close on her after entering her password (same as happened to me earlier this week). She entered her password and received the "XMetaL has stopped working" message. This happened at 11:54am on 3/13.
I think we can stop gathering information at this point. Thanks for recording what we've got.
I don't remember people reporting these problems when we were at 6116, though it's possible that that says more about my memory than about the history of the problem. But I'm hopeful that it has to do with the CBIIT environment with bastion hosts, newer operating systems, more complicated networks, and multi-user installations of XMetal. If things work out for us, we'll have XMetal running in users' desktops soon and all of the funky issues in the CBIIT environment will be invisible to XMetal.
I'll go ahead and put this issue on hold and we'll see if it needs to be re-opened of if the problem goes away.
Now that the cdr-security-remediation deployment is complete and everyone is running with XMetal 9.0 on their own workstations, I think we should start reporting any errors in XMetal that might indicate that the problems we had in XMetal 4.5 on the bastion host might still be present.
We're particularly looking for errors in which a small window opens on the screen with a message like:
"XMetal90.exe has stopped working", followed by an offer to close or debug the program.
Our hope is that these aren't happening any more. If they are, we may need to re-open this issue. If not, great, we lucked out and can close it.
Unfortunately, Robin B reported that she got the "XMetaL has stopped working" message this morning at about 8:30am when she was working in a summary document (CDR0000763567). The CDR grayed out and she was forced to reboot, losing about an hour of unsaved work. She was adding markup and marking the changes as approved, but she doesn't recall precisely the action she was performing at the time.
This is similar to (maybe the same as?) the type of issue we are tracking in OCECDR-3896, so I'm not sure which issue to post this comment in.
What a bummer!
It appears to me that 3896 is the same as this one. We'll probably need to look at the event logs for the affected machines. Let's discuss it at the status meeting.
As discussed in our status meeting, this is going to be a very hard problem to fix unless we can find a way to reliably reproduce the error. If we know how to make it happen, then we can either trace the problem into our code or the XMetal code and either fix it (if it's ours) or send enough information to Just Systems (the XMetal vendor) to enable them to fix it.
So, for the immediate future, it's important for everyone who experiences the errors to try their best to recall exactly what they did just before it happened. If there is someone who gets the error more than anyone else, then that person is doing something more often than others and we may be able to make some guesses based on that.
I recommend that, if anyone experiences an XMetal crash, restart XMetal and try to do the exact same thing again, making notes about what is being done. If it crashes again, Bingo! We might have what we need.
We haven't been successful in the past in figuring out how to reproduce the error. It's possible that the cause is very subtle and not due to a specific sequence of actions before the crash, so I think this issue could be with us for quite a while.
Just wanted to report that Victoria received the "XMetaL has stopped working" message four times in a row last Friday while she was working (and re-working) on the same paragraph since she lost some work each time it happened. She was working in the Renal Cancer Treatment HP summary. At least two of the times happened when she was backspacing quickly. The final time happened when she saved the document.
Closed in status meeting; hasn't been a problem in a while.
Victoria reported getting the "XMetaL has stopped working" message again on Tuesday, 10/18. Here's her message:
I was working on CDR0000763238 (Head and Neck Cancers, not published) and when I went to save the doc, I got an “XMetaL has stopped working” message and when I clicked on OK to close the message, XMetaL closed. I went back in and my changes are there, although I didn’t try to re-save and name it because I wasn’t sure if you would want to take a look first.
Volker asked me what I meant by "name it." The message popped up after I hit the save button but before I got the "Save CDR Document" screen that would have allowed me to save and name a version, so that's what I meant. Also, i wanted to mention that this is one of the documents that shows the "Rules checking cannot be turned on" message before it opens, in case that is relevant. Thanks.
I thought that maybe, since almost the entire SummaryMetaData block is missing, this problem could have been a result of a "confused" validation. However, I've copied the document to STAGE and was able to save it without any problems.
Assigned to ~volker with story points for determining whether this is a "can't reproduce" ticket.
Just thought I would add a comment because we continue to occasionally get the "sendCommand: unexpected failure" error message when saving a document in XMetaL. Upon trying to save it again, we get a message saying that we do not have the document checked out which is odd since it IS checked out and we are working in it. Upon closing and reopening the CDR, all seems to be fine and the document does in fact save with the changes even though we are receiving an error message. In case it is helpful for error log purposes, I just received these messages today at about 9:25am when saving the Genetics of Breast/Gyn summary document (CDR62855) on PROD. I'll attach a document with the specific messages.
This doesn't seem to be happening anymore. We can reopen this issue if needed.
File Name | Posted | User |
---|---|---|
RemoteDesktopVersion.jpg | 2015-03-10 11:46:54 | |
screenshot-1.png | 2015-03-10 09:16:31 | Juthe, Robin (NIH/NCI) [E] |
sendcommand failure message.docx | 2017-07-14 09:36:57 | Juthe, Robin (NIH/NCI) [E] |
XMetalApplicationFailure01.jpg | 2015-03-10 11:55:52 | |
XMetalApplicationFailure02.jpg | 2015-03-10 11:56:03 |
Elapsed: 0:00:00.001664