Issue Number | 3593 |
---|---|
Summary | Investigate CDR login/access issue |
Created | 2013-03-21 16:55:11 |
Issue Type | Improvement |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2013-07-12 14:41:02 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.107921 |
BZISSUE::5292
BZDATETIME::2013-03-21 16:55:11
BZCREATOR::William Osei-Poku
BZASSIGNEE::Bob Kline
BZQACONTACT::Margaret Beckwith
Sometime last week and early this morning users were not able to log into the CDR/Xmetal. In most cases, the program gets stuck on the CDR logo and not able to successfully load the full program for user access. Other times, when users successfully log into the CDR, some tasks like saving a document or retrieving a report takes too long.
Below are some of the relevant email exchanges from this
morning.
--------------
On 3/21/2013 9:18 AM, Osei-Poku, William wrote:
> We are now able to log in.
When I bounced the services I noticed that the CT.gov import job, which was still running, was killed, so I restarted the job by hand. I had been able to log in successfully myself, but that was before restarting CT.gov import. So after digging around in SQL server with Qian for a while, killing SQL Server blocking tasks and watching them come back, I killed the CT.gov import job and the last locking SQL task. So something in CT.gov import is blocking other work. I'm not going to restart the job until we figure out why.
--
Bob Kline
I wouldn't have thought that bouncing the server would solve the problem because the CDR server is far away from the critical memory usage that's causing problems.
William, in the last week or so you have reported a few times problems to connect. Maybe these events are all related.
Thanks,
Volker
___________________________________
Volker Englisch
NCI OCE - Communications Technology Branch
Contractor: Sapient Government Services
NCI: 301-496-0102
------Original Message
From: Osei-Poku, William William.Osei-Poku@icfi.com
Sent: Thursday, March 21, 2013 8:46 AM
To: Kline, Robert (NCI)
Cc: Englisch, Volker (NIH/NCI) [C]; Beckwith, Margaret (NIH/NCI) [E];
Juthe, Robin (NIH/NCI) [E]; Alan Meyer
Subject: RE: Unable to access CDR/Xmetal
It doesn't seem to have solved the problem. We're experiencing the same problem.
------Original Message
From: Bob Kline bkline@rksystems.com
Sent: Thursday, March 21, 2013 8:28 AM
To: Osei-Poku, William
Cc: Englisch, Volker (NIH/NCI) [C]; Beckwith, Margaret (NIH/NCI) [E];
Juthe, Robin (NIH/NCI) [E]; Alan Meyer
Subject: Re: Unable to access CDR/Xmetal
On 3/21/2013 8:10 AM, Osei-Poku, William wrote:
>
> Hi Volker,
>
> We are not able to access the CDR/Xmetal on the NIH VPN. After
typing
> in the login information, the Xmetal logo will come on and stay on
the
> screen for a long time and never successfully lunch XMetal.
Sometimes
> it will lunch successfully after many trials but it will be very
slow
> during saves. We've experienced this problem intermittently since
last
> week. This happens on both the NIH VPN and the ICF-NIH
site-to-site- VPN.
>
>
I bounced the CDR and SQL Server services. Please try again.
--
Bob Kline
http://www.rksystems.com
mailto:bkline@rksystems.com
BZDATETIME::2013-03-21 19:29:17
BZCOMMENTOR::Bob Kline
BZCOMMENT::1
I did some investigation and found that when the blocks were preventing users from working in the CDR this morning the CTGovImport job was trying to save a new version of CDR475774 (with updates from NCT00324805). The attempt was made twice, beginning around 7:26 this morning and lasting the better part of an hour, when the job failed because I had turned off the CDR server. The second time was with the restarted job I kicked off manually, and then killed when William reported that CIAT was continuing to experience problems. For a long time largest document was 1/2 a megabyte (one of the summaries). This CTGovProtocol document ballooned from about that size to approximately 3.5MB in the past few months, presumably from site information pulled in from CTRP. I will turn off the CT.gov nightly jobs until we can figure out more about exactly where the bottlenecks are. We were able to run the save job on Franck successfully. Almost no other work was being done on the server, and it took just about exactly an hour. I'm not sure how long it would have taken on Bach if we had let the job run indefinitely (or if it would have finished at all): we don't know exactly how much faster (or slower) Bach is than an unloaded Franck. Alan is going to do some debugging to see what he can discover. In the meantime we might consider creating a patched version of the import job which just skips the documents which are over a certain size threshold and run that until we can figure out how to get the really big documents processed without preventing CIAT from doing its work.
BZDATETIME::2013-03-21 23:18:18
BZCOMMENTOR::Bob Kline
BZCOMMENT::2
I suppressed the import job, but left the download job in place. Are there really 1,784 sites actively participating in this protocol? Just glancing at some of the sites, starting from the bottom[1], I see what looks very suspiciously like duplicate entires for the same site, with conflicting statuses. For example site #1784 is listed as Coney Island Hospital (CDR30160), with status "Closed to Accrual" but site #1783 is the same org (same address, same investigator, same contact) with the status "Administratively Complete." There's lots of these, many with the statuses "Active" and "Closed to Accrual." How can a site be simultaneously active and closed? Is it possible CTRP has some serious bugs in their export software? I suspect that if it weren't for these duplicates the CT.gov import process might complete successfully.
[1] http://bach.nci.nih.gov/cgi-bin/cdr/QcReport.py?DocId=CDR0000475774&Session=guest
BZDATETIME::2013-03-21 23:29:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3
I did all of this before seeing Bob's comment #2, but it is
especially interesting in the light of what he found.
I turned on the CdrTiming machinery, modifying it to write to a
log file. I then experimented a bit with a short record, then
loaded up XMetal with CDR0000475774 Length: 4901554. That was
the cdr String length of the xml as the CdrDoc constructor sees
it. The length reported by SQL Server is 6884902. I haven't
even thought about why there is a discrepancy - we have enough
other mysteries at the moment.
I then did the following:
Store the doc, with validation, total time = 104.110 seconds.
Validate the doc, links and schema, total time = 38.016 seconds
Store with validation and versioning = 59.125 seconds
I believe that the last two went faster because SQL Server had
cached information needed for the command from the first run.
I expected something to stand out as slow, but nothing did.
Here
are the big tasks, with first run first and second run (with
caching making it faster) in parentheses.
Fully construct the CdrDoc object 10.735 (10.672)
Link validation = 20.281 (15.391)
Total validation = 27.297 (22.500)
Total query term processing = 50.125 (4.938)
I thought the big winner from caching would be link validation
but it wasn't. The big winner was query term processing. I was
storing the same 4.9 million character document I got from the
server, so the query term processing should have involved zero
updates. The speed increase had to be caching of the search for
the old terms.
So what went wrong when we ran it this afternoon? These times
are waaaay better than what we saw?
My guess is that when we ran this afternoon we were replacing a
document with a new one that was totally different from the point
of view of the query term processing. Small insertions in
strategic places in the document would cause the node_loc value
to be different for every succeeding sibling and descendent
element. We had 50 seconds of query term processing when there
were no changes. So I can easily imagine there would be an hour
of query term changes with CTGovProtocols like the one we
processed today.
I think the first thing for us to do is to go over the
query_term_defs for CTGovProtocols. If we can significantly pare
those down, we can get a big, quick win with no programming
changes.
However this isn't a logic optimization issue. I think we've
got
this process pretty well optimized. We have to see if the users
or the publishing system, or the link validation, or whatever,
are using all of those query terms.
I analyzed the query terms generated for this big document.
There are 40,122 query terms.
Of those, 29,609 are in the subtree of:
/CTGovProtocol/CTRPInfo/CTRPLocation/...
Almost all of the rest are in:
/CTGovProtocol/Location/...
with the great majority of those in:
/CTGovProtocol/Location/Facility/...
I could be wrong about what's happening. So before we do the
analysis of what can be pared away, we might want to take a meat
axe to the query term defs on Franck just to see if this really
is the issue.
What we might do is:
Get a huge CTGovProtocol from Bach.
Store it on Franck, getting timing info. It should be very
bad.
Replace that doc with the earlier version - also very bad.
Bounce SQL Server on Franck to totally clear caches.
Blow away the big query term defs that account for lots of the
terms.
Store the big record again. If I'm right it should be much
faster.
The timing data is on Franck in d:\cdr\log\Timing.log I wrote
timestamps with each log item so we can conceivably coordinate it
with anything else going on on the machine.
BZDATETIME::2013-03-22 00:13:40
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4
Here are the details that I used to derive the data in my last comment. The column on the left has the number of query terms, i.e., index entries, for the XML field displayed on the right.
This may help people to see if the really big indexing that we've done is useful.
Attachment CTGovProtocol_QueryTermBreakdown.txt has been added with description: Breakdown of query terms for CTGovProtocol CDR0000475774
BZDATETIME::2013-03-22 11:19:46
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5
(In reply to comment #2)
> Just glancing at some of the sites, starting from the bottom[1], I
see
> what looks very suspiciously like duplicate entires for the same
site,
> with conflicting statuses.
It's even worse. I've extracted just the organization and its status and I see organizations with 3, 4, and 5 entries even with identical status values. This is just one sample of one organization:
Sutter Cancer Center at Roseville Medical Center (CDR34845)
Active
Sutter Cancer Center at Roseville Medical Center (CDR34845) Closed to
Accrual
Sutter Cancer Center at Roseville Medical Center (CDR34845) Active
Sutter Cancer Center at Roseville Medical Center (CDR34845) Closed to
Accrual
Sutter Cancer Center at Roseville Medical Center (CDR34845) Active
BZDATETIME::2013-03-22 11:28:26
BZCOMMENTOR::Bob Kline
BZCOMMENT::6
(In reply to comment #5)
> It's even worse. ....
Or better, depending on your point of view. In the long run, the more bogus duplication they've introduced into the documents, the more likely it will be that the CDR server will be able to handle the document saves for the CT.gov import job (and the CTRP import job, for that matter; it's been stretching out long enough to cause other production problems), once CTRP has fixed their software. The bad news, of course, is that CTRP doesn't exactly turn on a dime when addressing the need for fixing bugs, so "long run" may have the emphasis on the "long" part. We'll need to have a discussion soon about what to do in the meantime.
Margaret:
Should we have a conference call to talk about our options?
BZDATETIME::2013-03-22 11:34:02
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::7
Sigh....I guess we do need to have a call with CTRP. Just when I thought things were ending with them.
BZDATETIME::2013-03-22 11:37:18
BZCOMMENTOR::Bob Kline
BZCOMMENT::8
For this trial, we have 1,784 blocks for sites, representing 1,067 unique organizations, or a ballooning of the number of sites by 67%.
BZDATETIME::2013-03-22 11:44:39
BZCOMMENTOR::Bob Kline
BZCOMMENT::9
(In reply to comment #7)
> Sigh....I guess we do need to have a call with CTRP.
Or possibly two calls: one with CTRP, and one without them, to figure out what we should do while we're waiting for them to fix the problem. Unless you think we need a third "get our ducks in a row" meeting first, the sequence should probably be (1) talk to CTRP, tell them about the problem, find out when they can get it fixed; (2) based on what we find out from CTRP, do some triage on our side to figure out:
how long we need to keep the CT.gov import job turned off
whether we should run a modified CT.gov import job
whether to turn off the CTRP import (probably yes)
whether we should pull the bogus trial docs from CG
A "modified CT.gov import job" might be one that skips trials over a certain size. Or it could skip trials that have duplicate site blocks (though that might eliminate a good percentage of the trials).
BZDATETIME::2013-03-22 11:59:35
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::10
(In reply to comment #8)
> For this trial, we have 1,784 blocks for sites, representing 1,067
unique
> organizations, or a ballooning of the number of sites by 67%.
A look on Clinicaltrials.gov shows exactly 1067 sites for this trial:
http://www.clinicaltrials.gov/ct2/show/NCT00324805?term=NCT00324805&rank=1
So CTRP is sending us over 1,784 sites and we are getting 1067 sites from CTGov and storing all of them in the same trial. No wonder it takes a long time to open this trial up in the CDR.
It also looks like for all CTRP trials, we are getting the same sites from CTGov also. It doesn't look like we will miss anything if we turn off the CTRP service and get all the sites from CTGov.
BZDATETIME::2013-03-22 12:22:22
BZCOMMENTOR::Bob Kline
BZCOMMENT::11
(In reply to comment #10)
> So CTRP is sending us over 1,784 sites and we are getting 1067
sites from CTGov
> and storing all of them in the same trial. ....
Right. We're storing all of the sites twice: once from the document we get from NLM, and once for the document we get from CTRP. We knew that was going to happen when we decided not to discard the CT.gov Location blocks even if we have site information for the trial from CTRP (which may turn out to be a wiser decision than we originally imagined), though I don't think we were imagining trials with more than a thousand sites. So that would be 2,134 site blocks for this trial. As if that weren't enough of a strain on our processing resources, CTRP has inflated that number to 2,851 site blocks.
> It also looks like for all CTRP trials, we are getting the same
sites from
> CTGov also. It doesn't look like we will miss anything if we turn
off the CTRP
> service and get all the sites from CTGov.
That would certainly solve the problem.
BZDATETIME::2013-03-22 12:26:30
BZCOMMENTOR::Bob Kline
BZCOMMENT::12
(In reply to comment #11)
> (In reply to comment #10)
> > It also looks like for all CTRP trials, we are getting the
same sites from
> > CTGov also. It doesn't look like we will miss anything if we
turn off the
> > CTRP service and get all the sites from CTGov.
>
> That would certainly solve the problem.
Well, not all by itself. We would have to do something to strip out the existing CTRP information.
BZDATETIME::2013-03-22 12:41:37
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::13
I think we should have an internal conversation first before we talk to CTRP. I am confused. I thought the whole reason we are getting the sites from CTRP is that they are updated by RSS in their system, but not in CT.gov. Otherwise, why wouldn't we have just taken the trial info from CT.gov? I think I am forgetting some key element here...
BZDATETIME::2013-03-22 12:49:32
BZCOMMENTOR::Bob Kline
BZCOMMENT::14
(In reply to comment #13)
> I think we should have an internal conversation first before we talk to CTRP.
Sure, the sooner the better.
> I am confused. ....
This whole project has been one big pile of confusion from the start. :-)
BZDATETIME::2013-03-25 13:51:09
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::15
Can we meet tomorrow for a few minutes to talk about this? I am thinking maybe we should shut down the CTRP import but I want to get Lakshmi's input on this. I sent her a link to this issue and also added her a CC. I could meet before 10 AM or after 2:00 PM tomorrow.
BZDATETIME::2013-03-25 14:11:55
BZCOMMENTOR::Bob Kline
BZCOMMENT::16
(In reply to comment #15)
> Can we meet tomorrow for a few minutes to talk about this? I am
thinking maybe
> we should shut down the CTRP import but I want to get Lakshmi's
input on this.
> I sent her a link to this issue and also added her a CC. I could
meet before
> 10 AM or after 2:00 PM tomorrow.
Sure. I have a 2:30-3:30 meeting with Jonathan but should be free otherwise. The sooner the better.
BZDATETIME::2013-03-25 14:15:22
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::17
We'll see if Lakhsmi can meet at 3:30. I did let Kim know there was an issue and she said:
"I can’t say I am aware of the specific issue, but I know that there are plans to make an update to the feed we send to you; it was dependent on our last release which came out late last week. It could be related."
BZDATETIME::2013-04-04 16:41:26
BZCOMMENTOR::Bob Kline
BZCOMMENT::18
CTRP has fixed the problem. I took a look at the latest download set, and the document sizes are significantly reduced, and I didn't see any duplicate sites. Volker is turning the CTRP import job back on. He'll monitor the job this evening, and if it completes without problems before he turns in, he'll turn on the CT.gov jobs, too.
BZDATETIME::2013-04-04 17:10:03
BZCOMMENTOR::Volker Englisch
BZCOMMENT::19
(In reply to comment #18)
> Volker is turning the CTRP import job back on.
This is done. The CTRP import will run again tonight.
BZDATETIME::2013-04-08 12:12:25
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::20
It was my understanding that the CTGov download service would be
turned on on Friday since the CTRP service was successfully turned on
but (In reply to comment #18)
> CTRP has fixed the problem. I took a look at the latest download
set, and the
> document sizes are significantly reduced, and I didn't see any
duplicate sites.
> Volker is turning the CTRP import job back on. He'll monitor the
job this
> evening, and if it completes without problems before he turns in,
he'll turn on
> the CT.gov jobs, too.
Did the CTGov jobs get turned on? Since the CTRP service was successfully turned on, I thought the CTGov jobs would be turned on also but it doesn't look like it since we still can't see any of the trials we imported from about three weeks ago.
BZDATETIME::2013-04-08 12:16:36
BZCOMMENTOR::Bob Kline
BZCOMMENT::21
(In reply to comment #20)
> Did the CTGov jobs get turned on?
I assume so. Volker was going to do it Thursday night, but the CTRP job ran too long. He would have held off turning it on the next day, to avoid problems with the weekly publishing job, but I expect it will run tonight. Can you confirm, Volker?
BZDATETIME::2013-04-08 12:52:03
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::22
(In reply to comment #21)
> (In reply to comment #20)
>
> > Did the CTGov jobs get turned on?
>
> Can you
> confirm, Volker?
Volker is out today and tomorrow. Is it possible for you or someone else to check?
BZDATETIME::2013-04-08 13:17:06
BZCOMMENTOR::Bob Kline
BZCOMMENT::23
Just checked. The instruction to import was still commented out. I have uncommented the instruction, so it should run tonight.
BZDATETIME::2013-04-08 13:19:25
BZCOMMENTOR::Alan Meyer
BZCOMMENT::24
I also investigated and was just about to say that the job ran successfully this morning. But I guess a more accurate statement would be to say that the job ran successfully but didn't do what we would normally expect it to do.
BZDATETIME::2013-04-08 13:58:11
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::25
Thank you Bob and Alan. Hopefully, it will run without any problems tonight.
BZDATETIME::2013-04-09 11:48:09
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::26
(In reply to comment #25)
> Thank you Bob and Alan. Hopefully, it will run without any problems
tonight.
It doesn't look like the jobs ran successfully. We still don't see trials that were imported weeks ago.
BZDATETIME::2013-04-09 13:09:21
BZCOMMENTOR::Bob Kline
BZCOMMENT::27
There was one more modification which needed to be made to the SQL Server scheduler. I expect it will run tonight.
BZDATETIME::2013-04-09 14:46:13
BZCOMMENTOR::Volker Englisch
BZCOMMENT::28
Which was the other modification???
I bet I had the changes done in my sandbox on Friday but it was
getting late and I tried to rush to finish up a lot of things before the
weekend when I messed up my sandbox overwriting every modified file.
:-(
Most likely I wouldn't have thought about the "other" modification.
BZDATETIME::2013-04-09 15:11:19
BZCOMMENTOR::Bob Kline
BZCOMMENT::29
The answer to the "what do I do if step 1 [the download script] succeeds?" question in the scheduler's configuration had been set to "jump to step 3 [exit with failure]" so I set it back to "jump to step 2 [the import script]."
BZDATETIME::2013-04-10 08:18:14
BZCOMMENTOR::Bob Kline
BZCOMMENT::30
The CT.gov import job is running. Right now it's processing CDR475774, which is the document which caused the original problems. It has slimmed down considerably, but it's still over 2MB. This document has been churning along since 7:39. We have a choice to let it finish, or to abort the import job. I'll hold off until William has weighed in.
BZDATETIME::2013-04-10 08:31:38
BZCOMMENTOR::Bob Kline
BZCOMMENT::31
The document finished processing at 8:16, so it took 37 minutes. It's now processing CDR478976, weighing in at roughly 1.8 MB. It will be useful to see how much the smaller size makes in processing time.
BZDATETIME::2013-04-10 08:33:33
BZCOMMENTOR::Bob Kline
BZCOMMENT::32
(In reply to comment #31)
> The document finished processing at 8:16, so it took 37 minutes.
It's now
> processing CDR478976, weighing in at roughly 1.8 MB. It will be
useful to see
> how much the smaller size makes in processing time.
CDR475774 was 2.3 MB, so this one is roughly 4/5 the size.
BZDATETIME::2013-04-10 08:43:27
BZCOMMENTOR::Robin Juthe
BZCOMMENT::33
(In reply to comment #32)
> (In reply to comment #31)
> > The document finished processing at 8:16, so it took 37
minutes. It's now
> > processing CDR478976, weighing in at roughly 1.8 MB. It will
be useful to see
> > how much the smaller size makes in processing time.
> CDR475774 was 2.3 MB, so this one is roughly 4/5 the size.
The CDR is noticibly slower this morning. I am having trouble saving the behemoth (Breast/Ovarian) right now.
Christina also reported the following:
The CDR took >10 minutes to load (after several tries) and then timed out on me as I was saving the Endocrine summary. As far as I can tell, the CDR is still trying to save my change to the Endocrine summary and check it in...15 minutes later. At least one other person is having problems, too. I let William know about it earlier today.
BZDATETIME::2013-04-10 08:46:50
BZCOMMENTOR::Robin Juthe
BZCOMMENT::34
(In reply to comment #33)
> (In reply to comment #32)
> > (In reply to comment #31)
> > > The document finished processing at 8:16, so it took 37
minutes. It's now
> > > processing CDR478976, weighing in at roughly 1.8 MB. It
will be useful to see
> > > how much the smaller size makes in processing time.
> > CDR475774 was 2.3 MB, so this one is roughly 4/5 the
size.
> The CDR is noticibly slower this morning. I am having trouble
saving the
> behemoth (Breast/Ovarian) right now.
> Christina also reported the following:
> The CDR took >10 minutes to load (after several tries) and then
timed out on me
> as I was saving the Endocrine summary. As far as I can tell, the
CDR is still
> trying to save my change to the Endocrine summary and check it
in...15 minutes
> later. At least one other person is having problems, too. I let
William know
> about it earlier today.
The B/O summary took ~3 minutes to save.
BZDATETIME::2013-04-10 08:58:05
BZCOMMENTOR::Bob Kline
BZCOMMENT::35
(In reply to comment #33)
> The CDR is noticibly slower this morning. I am having trouble
saving the
> behemoth (Breast/Ovarian) right now.
>
> Christina also reported the following:
>
> The CDR took >10 minutes to load (after several tries) and then
timed out on me
> as I was saving the Endocrine summary. As far as I can tell, the
CDR is still
> trying to save my change to the Endocrine summary and check it
in...15 minutes
> later. At least one other person is having problems, too. I let
William know
> about it earlier today.
You might want to negotiate with Laura and/or Chuck to get some time allocated (a day or two) to work on some of the possible measures we've discussed (eliminate the sites stored in NLM's portion of the document, at least for the docs where the sites look like they're at least roughly the same set as what we're getting from CTRP; perhaps skip over the really big documents, maybe processing them on the weekends, etc.).
BZDATETIME::2013-04-10 09:57:52
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::36
I will talk to Laura about getting some of Bob's time to address this.
I spoke with Lakshmi about various options for making these doucments smaller, and she agreed that we could strip out the NLM sites from the trials. She did recommend that we make sure that we have all of the sites from CTRP for the transferred trials before stripping them out. She is already on this issue.
BZDATETIME::2013-04-10 10:30:27
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::37
(In reply to comment #30)
> The CT.gov import job is running. Right now it's processing
CDR475774, which
> is the document which caused the original problems. It has slimmed
down
> considerably, but it's still over 2MB. This document has been
churning along
> since 7:39. We have a choice to let it finish, or to abort the
import job.
> I'll hold off until William has weighed in.
I think you can abort the import job. Users can barely do anything in the CDR. What time does the import job start in the night? I am wondering if you can start it early tonight so that hopefully, it would finish before 9:00AM tomorrow morning.
BZDATETIME::2013-04-10 10:39:46
BZCOMMENTOR::Volker Englisch
BZCOMMENT::38
(In reply to comment #37)
> I am wondering if you can start it early tonight so that hopefully,
it
> would finish before 9:00AM tomorrow morning.
I don't think we can start the CTGov import job too early because it could interfere with the CTRP import job. Maybe those jobs could be run on alternate days?
BZDATETIME::2013-04-10 10:52:05
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::39
(In reply to comment #38)
> (In reply to comment #37)
> > I am wondering if you can start it early tonight so that
hopefully, it
> > would finish before 9:00AM tomorrow morning.
>
> I don't think we can start the CTGov import job too early because
it could
> interfere with the CTRP import job. Maybe those jobs could be run
on alternate
> days?
We certainly want the CTGov import job run daily but we can have the CTRP import job run only on Weekends. The CDR appears to be fine now for most users.
BZDATETIME::2013-04-10 13:02:06
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::40
The login and sluggishness issue has resurfaced again. I (and at least another user) am unable to log into either Mahler or Bach.
BZDATETIME::2013-04-10 13:17:19
BZCOMMENTOR::Bob Kline
BZCOMMENT::41
(In reply to comment #40)
> The login and sluggishness issue has resurfaced again. I (and at
least another
> user) am unable to log into either Mahler or Bach.
The service had been stopped on Mahler for the memory leak investigation work. I restarted it.
BZDATETIME::2013-04-10 13:25:22
BZCOMMENTOR::Alan Meyer
BZCOMMENT::42
(In reply to comment #41)
> (In reply to comment #40)
> > The login and sluggishness issue has resurfaced again. I (and
at least another
> > user) am unable to log into either Mahler or Bach.
>
> The service had been stopped on Mahler for the memory leak
investigation work.
> I restarted it.
Mahler is back to normal. The problem on Bach is looking urgent. Maybe we should follow William's advice here and kill the job that is hanging up Bach, and then leave it off until we discuss it at the meeting tomorrow.
Can we tell from the log files how close we are to finishing?
BZDATETIME::2013-04-10 14:28:36
BZCOMMENTOR::Bob Kline
BZCOMMENT::43
(In reply to comment #42)
> Mahler is back to normal. The problem on Bach is looking urgent.
Maybe we
> should follow William's advice here and kill the job that is
hanging up Bach,
> and then leave it off until we discuss it at the meeting
tomorrow.
>
> Can we tell from the log files how close we are to finishing?
As of 30 seconds ago, 78 transfers had been processed, another 2,938 updates were done, and there are 28 more trials with the status "import requested." Shall I abort?
BZDATETIME::2013-04-10 14:30:21
BZCOMMENTOR::Bob Kline
BZCOMMENT::44
(In reply to comment #36)
> I will talk to Laura about getting some of Bob's time to address
this.
>
> I spoke with Lakshmi about various options for making these
doucments smaller,
> and she agreed that we could strip out the NLM sites from the
trials. She did
> recommend that we make sure that we have all of the sites from CTRP
for the
> transferred trials before stripping them out. She is already on
this issue.
She is, but her Bugzilla account is sent to have no email notifications sent to her, so you'll have to communicate any updates she should be aware of directly.
BZDATETIME::2013-04-10 14:36:23
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::45
(In reply to comment #43)
> > Can we tell from the log files how close we are to
finishing?
>
> As of 30 seconds ago, 78 transfers had been processed, another
2,938 updates
> were done, and there are 28 more trials with the status "import
requested."
> Shall I abort?
Please do not abort. We seem to be okay now.
BZDATETIME::2013-04-10 15:26:46
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::46
Hi Bob. I talked to Laura and she said she was going to talk to you about putting time on the CDR project to get this problem fixed. We can discuss what needs to be done at the meeting tomorrow. I will ask Lakshmi to follow this issue.
BZDATETIME::2013-04-10 15:38:31
BZCOMMENTOR::Bob Kline
BZCOMMENT::47
(In reply to comment #46)
> Hi Bob. I talked to Laura and she said she was going to talk to you
about
> putting time on the CDR project to get this problem fixed. We can
discuss what
> needs to be done at the meeting tomorrow. I will ask Lakshmi to
follow this
> issue.
OK. I'm already looking into pruning the NLM sites. Let me know if you want me to hold off any work on this until after tomorrow's meeting.
BZDATETIME::2013-04-10 16:37:07
BZCOMMENTOR::Bob Kline
BZCOMMENT::48
(In reply to comment #43)
> ... and there are 28 more trials with the status "import requested."
It looks like those 28 got their "import requested" status after the import job had started, so they weren't going to be picked up by this morning's job anyway.
BZDATETIME::2013-04-10 18:26:12
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::49
Are you going to turn off the import job to avoid the login problem tomorrow morning? We've had enough trials come in to keep us busy for a while so it won't hurt anything to turn it off temporary if it will help to do so.
BZDATETIME::2013-04-10 18:49:59
BZCOMMENTOR::Bob Kline
BZCOMMENT::50
During this morning's CT.gov import job, which started at 6:59 and finished at 1:50, there were 34 documents which took a minute or more to process. The document which took the most time was CDR692475, clocking in at 43:20 processing time. These 34 documents took a total of 4 hours and 22 minutes, or well over half of the time it took to do the whole job. These documents are listed on the first sheet in the attached Excel workbook, sorted with the heavy hitters at the top. The second sheet lists all 26,043 active CTGovProtocol documents in the system, sorted with the largest documents at the top (CDR692475 is the biggest, at 3,867,609 characters, which would be possibly more bytes, since those are Unicode characters). One discouraging discovery is that we still have duplicate sites in many – though not all – of the CTRP blocks. That biggest document, for example, has 1,421 duplicates. That document, by the way, is the same one that took over 43 minutes to process. So I guess they got rid of the dups in the ones we were watching closely (for example, CDR475774) but since we didn't examine every single document, we didn't notice that they didn't eliminate all of the duplicates. Part (possibly all) of the problem is that CTRP isn't sending us all of the documents. For example, NCI-2011-02623 isn't in last night's set (in fact, they haven't sent us that one since March 6). That's the one that corresponds to our biggest document.
Attachment ctgov-import-problems.xls has been added with description: Results of analysis of this morning's import job
BZDATETIME::2013-04-10 19:11:03
BZCOMMENTOR::Bob Kline
BZCOMMENT::51
(In reply to comment #49)
> Are you going to turn off the import job to avoid the login
problem tomorrow
> morning?
I can do that if you want.
BZDATETIME::2013-04-10 21:16:23
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::52
(In reply to comment #51)
> (In reply to comment #49)
>
> > Are you going to turn off the import job to avoid the login
problem tomorrow
> > morning?
>
> I can do that if you want.
Yes. Please turn it off temporarily.
BZDATETIME::2013-04-10 23:28:40
BZCOMMENTOR::Bob Kline
BZCOMMENT::53
(In reply to comment #52)
> Yes. Please turn it off temporarily.
Done.
BZDATETIME::2013-04-11 08:51:46
BZCOMMENTOR::Bob Kline
BZCOMMENT::54
I just sent Charles a message with the attached report. The message said:
Charles:
Some of the trials have indeed been corrected, but we still have 88 trials with duplicate sites. As you can see from the attached spreadsheet, we have a couple of variations on the problem. Some of these trials have been exported with duplicates sites since the fix was announced to have been put in place. Some of the trials have not been exported since the fix was installed, but should have been, because the removal of the duplicate sites would have resulted in a changed document. We have had to turn off publishing of all updates to the CTRP trials until the problems are completely resolved, because the processing of the trials with duplicate sites was crippling our production environment.
Bob
Attachment ctrp-trials-that-still-have-duplicate-sites.xls has been added with description: Report for Charles of trials that still have duplicate sites
BZDATETIME::2013-04-11 10:46:23
BZCOMMENTOR::Bob Kline
BZCOMMENT::55
The only really interesting new information in this report is the addition of the six trials for which CTRP has sent us a corrected XML documents, but we can't import them yet because of mapping gaps. So we have a total of 94 CTGovProtocol documents with duplicate sites. CTRP has fixed six of them, and still needs to fix the other 88.
Attachment duplicate-sites.xls has been added with description: Trials that still have duplicate sites in the CTGovProtocol document
BZDATETIME::2013-04-11 12:10:29
BZCOMMENTOR::Alan Meyer
BZCOMMENT::56
We have reason to believe that processing of the huge CTGov and CTRP protocols is limited by the number of query term index entries that get updated when a new version is stored. The problem is that inserting or deleting even a single new site changes the indexing order, which requires a database update of the index terms for every subsequent site.
I was talking to Lakshmi this morning about another topic (CDR documentation) and mentioned this to her. She questioned whether we need to continue to index CTGov or CTRP protocol sites.
Bob and I have discussed a number of ways to turn this indexing off, but before doing any serious design work on the problem we first need to know if turning off indexing for sites is okay, or whether it harms functionality that we need.
Are there any thoughts on that?
BZDATETIME::2013-04-15 14:16:34
BZCOMMENTOR::Bob Kline
BZCOMMENT::57
Next steps:
[X] Implement global change to strip NLM sites where CTRP sites
exist
[ ] Test global change
[ ] Run global change in production
[X] Implement changes to CT.gov import program/filter
[ ] Test changes to CT.gov import program/filter
[ ] Get permission from CBIIT to deploy changes to production
[ ] Deploy CT.gov import changes to production
[ ] Get CTRP to fix their exports [Lakshmi to communicate with
CTRP]
[ ] Test the results of CTRP's fixes
[ ] Adjust schedules of import jobs to minimize impact on workday
[ ] Turn CTRP import back on
[ ] Turn CT.gov import back on
[ ] Verify that the affected data are correct
[ ] Monitor impact of reduced document sizes on import performance
[ ] Determine whether further modifications to protocol documents are
needed
I have done the implementation parts. To get rolling with the testing, CIAT should review the test results[1].
[1] http://bach.nci.nih.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2013-04-15_11-49-28
BZDATETIME::2013-04-15 16:44:37
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::58
(In reply to comment #57)
> Next steps:
> I have done the implementation parts. To get rolling with the
testing, CIAT
> should review the test results[1].
>
> [1]
> http://bach.nci.nih.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2013-04-15_11-49-28
I have looked at several trials in the results and compared with the CDR record in XMetal and they all seem fine. The global appears to be doing exactly what is expected.
BZDATETIME::2013-04-15 16:46:19
BZCOMMENTOR::Bob Kline
BZCOMMENT::59
Shall we proceed with the live-mode run on Bach, or should I do a live run on Franck first?
BZDATETIME::2013-04-15 17:18:57
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::60
(In reply to comment #59)
> Shall we proceed with the live-mode run on Bach, or should I do a
live run on
> Franck first?
Let's do a live-mode on Franck first.
BZDATETIME::2013-04-16 10:19:53
BZCOMMENTOR::Bob Kline
BZCOMMENT::61
I just noticed that on both Bach and Franck someone had created a query term definition for the path '/CTGovProtocol/CTRPInfo' which would have the effect of causing an enormous amount of work during the document save processing for the larger documents, as the DOM processor has to build up a composite string for indexing from all the children and grandchildren and great-grandchildren of the thousands of site blocks under CTRPInfo. I have removed this on both servers. I am very eager to see what effect this will have on the document save times.
BZDATETIME::2013-04-17 15:19:16
BZCOMMENTOR::Bob Kline
BZCOMMENT::62
I ran (In reply to comment #60)
> > ... should I do a live run on Franck first?
>
> Let's do a live-mode on Franck first.
Done. Please examine the CTGovProtocol documents on Franck to verify that they look OK. The job took 22 hours and 17 minutes, so we'll want to schedule the run on Bach carefully (and we may want to revisit Alan's idea of masking out indexing of the site information).
Has Lakshmi had any luck in lighting a fire under CTRP? The job took so long largely because we still have all those duplicate sites in the documents.
BZDATETIME::2013-04-18 08:28:21
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::63
We have reviewed a lot of the trials on Franck and didn't find any problems. All the ones we checked looked good so it looks like we are ready for a live run on Bach.
BZDATETIME::2013-04-18 09:00:07
BZCOMMENTOR::Bob Kline
BZCOMMENT::64
(In reply to comment #63)
> We have reviewed a lot of the trials on Franck and didn't find any
problems.
> All the ones we checked looked good so it looks like we are ready
for a live
> run on Bach.
I'm planning on running it after the weekly publishing job. Volker's going to send me a note when that's done.
BZDATETIME::2013-04-22 10:58:22
BZCOMMENTOR::Bob Kline
BZCOMMENT::65
I ran the CTRP download and import jobs over the weekend by hand. I'm doing the analysis of the current state of the CTGovProtocol documents and hope to have some results to post later in the day.
BZDATETIME::2013-04-22 15:22:10
BZCOMMENTOR::Bob Kline
BZCOMMENT::66
(In reply to comment #65)
> I ran the CTRP download and import jobs over the weekend by hand.
I'm doing
> the analysis of the current state of the CTGovProtocol documents
and hope to
> have some results to post later in the day.
One of the first things I did was to write a report which identified all of the CTGovProtocol documents which still have more than one location with the same CDR org ID, and I found 371 trials. Some of these (the minority) are the result of the fact that we can't import the latest site information we got from CTRP because of new mapping gaps. These are noted by a "Y" in the "MAPPING GAPS?" column on the first sheet of the attached report. I drilled down more closely for one of the other trials, and (for this trial at least, and I suspect for most or all of the others in this category) it appears that we have mapped more than one of CTRP's PO IDs to the same organization. For some of these it looks like the organizations really are different (for example, PO ID 250307 for Saint Mary's Hospital at 1044 Kabel Avenue in Rhinelander WI, and PO ID 252065 for Marshfield Clinic at James Beck Cancer Center, 2251 North Shore Drive, also in Rhinelander; seems possible that someone saw they were in the same city and just mapped them to the same CDR org doc – I don't see any evidence in the CDR27374 document that the name of this org is or ever was "Marshfield Clinic" or "James Beck Cancer Center"); for others, it looks like CTRP and/or CTEP are just being sloppy, not really curating the organization records, and failing to notice that they're dealing with just minor variations in organization names, which defeats our efforts to present a list of distinct options for cancer patients looking for clinical trials in which they might participate, and also cripples our efforts to get the document sizes down to a manageable level. I have posted details for the trials which have multiple sites with the same CDR org ID, as well as the detailed information on the duplicates in CDR66727. I will wait for instructions on how to proceed next. One discouraging data point: the CTRP import job only imported sites for 523 trials, but took more than 8 1/2 hours.
Attachment ctrp-dups.xls has been added with description: Analysis of weekend's CTRP import job
BZDATETIME::2013-04-28 20:46:46
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::67
I believe I've fixed all but one of the mapping gaps problems that prevented you from importing the latest files from CTRP; there was one country record on one of the pages, the result of a typo from CTRP. I couldn't fix it because I would have to create a new country record with the typo in the country name in order to get it off the mapping gaps page. I will email CTRP tomorrow morning and hopefully they will fix it ASAP. Meanwhile, I believe we can continue with the next steps as we discussed in last Thursday's CDR meeting. Hopefully, that one country record will not prevent you from importing a lot of the trials.
BZDATETIME::2013-04-29 08:53:18
BZCOMMENTOR::Bob Kline
BZCOMMENT::68
I ran the CTRP import job by hand last night. Only one trial was skipped.
[X] Implement global change to strip NLM sites where CTRP sites
exist
[X] Test global change
[X] Run global change in production
[X] Implement changes to CT.gov import program/filter
[ ] Test changes to CT.gov import program/filter
[ ] Get permission from CBIIT to deploy changes to production
[ ] Deploy CT.gov import changes to production
[X] Get CTRP to fix their exports [Lakshmi to communicate with
CTRP]
[X] Test the results of CTRP's fixes
[ ] Adjust schedules of import jobs to minimize impact on workday
[ ] Turn CTRP import back on
[ ] Turn CT.gov import back on
[ ] Verify that the affected data are correct
[ ] Monitor impact of reduced document sizes on import performance
[ ] Determine whether further modifications to protocol documents are
needed
I am going to run the CTRP import job on Franck, followed by the CT.gov import job with the changes to the import program and filter so CIAT can test the results of those changes.
BZDATETIME::2013-04-29 09:42:57
BZCOMMENTOR::Bob Kline
BZCOMMENT::69
(In reply to comment #68)
> I am going to run the CTRP import job on Franck, followed by the
CT.gov import
> job with the changes to the import program and filter so CIAT can
test the
> results of those changes.
I ran the CTRP download job and started the CTRP import job, when I realized that we'd have the same mapping gap problem on Franck as we had on Bach. So perhaps a better plan would be to work from a fresh copy of the data on Bach. Would it be possible to refresh Franck sometime today, Volker?
BZDATETIME::2013-04-29 11:19:24
BZCOMMENTOR::Volker Englisch
BZCOMMENT::70
(In reply to comment #69)
> Would it be possible to refresh Franck sometime today, Volker?
Yes, it's possible and yes, it's done.
BZDATETIME::2013-04-29 11:28:18
BZCOMMENTOR::Bob Kline
BZCOMMENT::71
(In reply to comment #70)
> (In reply to comment #69)
> > Would it be possible to refresh Franck sometime today,
Volker?
>
> Yes, it's possible and yes, it's done.
Thanks, Volker. I'm going to start a CT.gov import job on Franck in a few minutes. Can you tell me when the backup was taken from which the refresh was done (so I can figure out whether I need to run last night's download job on Franck)?
BZDATETIME::2013-04-29 11:41:12
BZCOMMENTOR::Volker Englisch
BZCOMMENT::72
(In reply to comment #71)
> Can you tell me when the backup was taken from which the refresh
was
> done
The time stamp of the backup file is 4/29/2013 12:52am.
BZDATETIME::2013-04-30 15:02:27
BZCOMMENTOR::Bob Kline
BZCOMMENT::73
I ran the CT.gov import job on Franck. A total of 2,264 trial documents were added, transferred, or updated. Of these 72 had CTRP sites, so any location blocks in the document downloaded from NLM were dropped. Another 1,907 documents have NLM sites in the CTGovProtocol XML. That would mean 285 trials have no sites from either source. No trial documents in the job have both CTRP and NLM sites, which is the goal for this change. Please take a look and make sure everything looks as expected. I will post a complete list of the trial documents processed by the job.
Attachment has-ctrp-sites.txt has been added with description: Trials for which NLM sites were dropped if present in incoming clinical_trial XML doc
BZDATETIME::2013-04-30 15:03:45
BZCOMMENTOR::Bob Kline
BZCOMMENT::74
Attachment import-job-1188.txt has been added with description: List of all documents process by CT.gov import job on Franck
BZDATETIME::2013-05-01 17:10:09
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::75
We've reviewed several records and they all look good. Please proceed to do the next step.
BZDATETIME::2013-05-01 18:08:44
BZCOMMENTOR::Bob Kline
BZCOMMENT::76
(In reply to comment #75)
> We've reviewed several records and they all look good. Please
proceed to do the
> next step.
The next step is permission from CBIIT to make the change to the import code. I have asked Erika what the status is of the change request I submitted back on April 11.
The step after that is to determine what the modified schedule should be for the import jobs. We can discuss that at tomorrow's status meeting.
BZDATETIME::2013-05-06 10:15:29
BZCOMMENTOR::Bob Kline
BZCOMMENT::77
I have turned on the CT.gov import job, and modified the schedule so that the download/import jobs start at 3am Monday through Saturday, skipping Sunday. The CTRP jobs have been moved to 11pm Saturday night, so the CTRP jobs should be done well before the Monday morning CT.gov jobs start. I ran the CT.gov over the weekend and it took a little over two hours, a considerable improvement over what we had been seeing recently. This morning's scheduled job didn't have anything to do (and we might consider eliminating the Monday morning job).
I noticed you have disabled the CTRP import job using line-by-line comments in the jobmaster Python script, Volker, so I'll let you do the re-enabling, lest I uncomment a line which was supposed stay commented out.
Let me know if you need the CTRP jobs run before Saturday, William, and if so, when they should happen, and I'll do a manual run.
BZDATETIME::2013-05-06 11:58:37
BZCOMMENTOR::Bob Kline
BZCOMMENT::78
Volker: CBIIT gave us permission to install this change on their servers, so you can go ahead and update the filter (CDR0000349690.xml) and the import script (ImportCTGovProtocols.py) on CBIIT Mahler (and CBIIT Franck if it exists).
BZDATETIME::2013-05-06 14:39:35
BZCOMMENTOR::Volker Englisch
BZCOMMENT::79
(In reply to comment #77)
> I noticed you have disabled the CTRP import job using line-by-line
comments in
> the jobmaster Python script, Volker, so I'll let you do the
re-enabling, lest I
> uncomment a line which was supposed stay commented out.
I updated the following script and copied it to FRANCK, BACH, and
CBIIT-DEV:
JobmasterCTRP.py - R11676
BZDATETIME::2013-05-24 14:46:29
BZCOMMENTOR::Volker Englisch
BZCOMMENT::80
The following program and filter have been copied to the CBIIT-MAHLER:
CDR0000349690 - Import CTGovProtocol - R11692
ImportCTGovProtocols.py - R11691
We decided at the CDR status meeting to close this issue since the issues has been resolved.
File Name | Posted | User |
---|---|---|
ctgov-import-problems.xls | 2013-04-10 18:49:59 | |
CTGovProtocol_QueryTermBreakdown.txt | 2013-03-22 00:13:40 | |
ctrp-dups.xls | 2013-04-22 15:22:10 | |
ctrp-trials-that-still-have-duplicate-sites.xls | 2013-04-11 08:51:46 | |
duplicate-sites.xls | 2013-04-11 10:46:23 | |
has-ctrp-sites.txt | 2013-04-30 15:02:27 | |
import-job-1188.txt | 2013-04-30 15:03:45 |
Elapsed: 0:00:00.000741