CDR Tickets

Issue Number 3044
Summary [CTGov] Global to copy Citations
Created 2009-12-17 13:29:25
Issue Type Improvement
Submitted By Osei-Poku, William (NIH/NCI) [C]
Assigned To alan
Status Closed
Resolved 2010-02-23 15:13:43
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107372
Description

BZISSUE::4720
BZDATETIME::2009-12-17 13:29:25
BZCREATOR::William Osei-Poku
BZASSIGNEE::Alan Meyer
BZQACONTACT::William Osei-Poku

In OCECDR-3021 a schema change and import software change were made to include the RelatedPublications and PublishedResults blocks in the CTGovProtocol document. We need a global change to copy all Citations (Relatedpublications and PublishedResults) for trials that have already been converted (and did have Citations in the InScopeProtocol documents). Not all trials will have these blocks.

Comment entered 2009-12-24 18:12:38 by alan

BZDATETIME::2009-12-24 18:12:38
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

I have written the global change program and run it on Franck in
test mode. The data on Mahler wasn't as up to date for a useful
test run.

Testing revealed a number of related changes that need to be made
before running the global:

1. Add new Citation link control table entries for:

CTGovProtocol//Citation -> Citation
CTGovProtocol//RelatedCitation -> Citation

2. Modify the CTGovProtocol schema to enable to support the
attribute "PdqKey" in the Citation element.

A possible alternative would be to drop the PdqKey attribute
from the citations when we add them.

I made the link control table changes on Franck (but not yet
Mahler or Bach) but left the CTGovProtocol schema alone for now.
I ran the test with validation, so the schema validation error
will show in "Errs" files in the test results page.

My global change program will copy the PublishedResults and
RelatedPublications elements into an existing PDQAdminInfo
element, if there is one. If no PDQAdminInfo exists, I create
one and put the results in it. This did not happen for any
current working documents but did happen with some publishable
versions, viz:

66987
68713
257255
257580
333213
422431
422432
423204
428447

Finally, according to my program, the great majority of protocols
do not have any PublishedResults or RelatedPublications. Of the
1080 documents processed, 814 of the current working documents
did not get updated with these elements.

Test results are in: 2009-12-24 17:22:40

A log file is attached. The log contains some additional
information to aid in testing. For each document processed, a
line like the following will appear:

2009-12-24 17:22:52: For CTGov doc=63391: fetching: cdr:CDR0000063391/75
or:
2009-12-24 17:24:09: For CTGov doc=393403: fetching: cdr:CDR0000485418

For the first one, CTGov doc 63391 was associated with version 75
of the same document. This was one of the new style conversions
where the same CDR ID was retained for the CTGov replacement of
the InScope doc.

For the second one, CTGov doc 393403 was associated with
InScopeProtocol 485418.

This is ready for testing.

Comment entered 2009-12-24 18:12:38 by alan

Attachment Request4720.log has been added with description: Log file from test run on Franck

Comment entered 2009-12-28 11:36:23 by alan

BZDATETIME::2009-12-28 11:36:23
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

(In reply to comment #1)
> ...
> Testing revealed a number of related changes that need to be made
> before running the global:
>
> 1. Add new Citation link control table entries for:
>
> CTGovProtocol//Citation -> Citation
> CTGovProtocol//RelatedCitation -> Citation

I propose to go ahead and do this on Bach and Mahler. I don't
see any possible harm.

> 2. Modify the CTGovProtocol schema to enable to support the
> attribute "PdqKey" in the Citation element.
>
> A possible alternative would be to drop the PdqKey attribute
> from the citations when we add them.

What should I do about this?

On the theory that we never throw away information, and on the
theory that whatever we have in the CTGovProtocol should exactly
duplicate the source data in an InScopeProtocol, at least
initially, I propose to modify the schema to add PdqKey as an
attribute.

Citations are currently defined as instances of "LinkedValue" in
the CTGovProtocol schema. If we do add PdqKey, we have a choice:

a. Add the optional attribute "PdqKey" to all LinkedValues.

Other LinkedValues are:

Condition
Gene
Diagnosis
ExclusionCriteria
PDQPerson
InterventionType
InterventionNameLink
PoliticalSubUnit_State
Country

b. Use the existing "CitationLink" type defined in
CdrCommonBase.

To do this, we'd have to include the CommonBase in the
CTGovProtocol schema - which we don't do now. I assume there
is a reason for that and that conflicts would arise if we
did.

c. Create a new type for this, defined the same way as
CitationLink in CdrCommonBase.

My first inclination is to use c. define a new type like
CitationLink, but I don't know if that's the best approach, and
don't know for sure that that is better than dropping PdqKey from
the global instead of keeping it.

Comment entered 2009-12-28 11:58:06 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-12-28 11:58:06
BZCOMMENTOR::Bob Kline
BZCOMMENT::3

I'd go with (a), but (c) is OK, too.

Comment entered 2009-12-28 13:16:30 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2009-12-28 13:16:30
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::4

I have looked at the logs and the test results. I did not see anything unusual. The only issues I saw were the ones with the pdqkey which produced the error messages.

Comment entered 2009-12-28 20:42:43 by alan

BZDATETIME::2009-12-28 20:42:43
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5

(In reply to comment #2)
...
> > 1. Add new Citation link control table entries for:
> >
> > CTGovProtocol//Citation -> Citation
> > CTGovProtocol//RelatedCitation -> Citation
>
> I propose to go ahead and do this on Bach and Mahler. I don't
> see any possible harm.
...

It turns out someone already took care of this on Bach. Mahler
and Franck are now up to date with the same link table values.

Comment entered 2009-12-28 20:44:37 by alan

BZDATETIME::2009-12-28 20:44:37
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6

(In reply to comment #3)
> I'd go with (a), but (c) is OK, too.

We have a vote from Bob. (a) and (c) both seem OK to me too.

I'll wait for a final decision from Lakshmi or Margaret before
actually making the change.

Comment entered 2010-01-14 13:15:46 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-01-14 13:15:46
BZCOMMENTOR::Bob Kline
BZCOMMENT::7

Lakshmi and Margaret decided to go with option (a): add PdkKey as an optional attribute to

Comment entered 2010-01-14 13:17:26 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-01-14 13:17:26
BZCOMMENTOR::Bob Kline
BZCOMMENT::8

Lakshmi and Margaret decided to go with option (a): add PdkKey as an optional
attribute to the LinkedValue type. [comment got truncated by premature submission]

Comment entered 2010-01-28 23:04:15 by alan

BZDATETIME::2010-01-28 23:04:15
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9

I did the following tonight:

1. Modified the CTGovProtocol schema on all three servers, adding
a second attribute to the type LinkedValue with:

name = "PdqKey"
type = "string"
minOccurs = "0"

This makes PdqKey a legal attribute in Citation links and in
any of the other nine link elements named in comment #2.

The CdrCommonBase schema defines the PdqKey type as
"NotEmptyString". For historical reasons, this turns out to
be the same thing as "string". I didn't want either to
include CdrCommonBase in the CTGovProtocol schema, where it
doesn't belong, or to define a new type of NotEmptyString
within CTGovProtocol. Just using plain string seemed the
least confusing approach.

It was with some trepidation that I updated Bach, but the
change should not break anything and I tested first on Franck
by running a global change (see 2 below.) It seemed to me
that the danger of forgetting to update Bach was greater than
the danger of performing the update.

The revised schema is committed in version control.

2. Ran the Request4720.py global change on Franck.

There were no validation errors. The log file is attached.

The following documents all contained new PdqKey attributes
after running the global:

CDR0000065208
CDR0000066283
CDR0000075692
CDR0000075754
CDR0000077691
CDR0000650230
CDR0000650233
CDR0000650235
CDR0000650236
CDR0000650238
CDR0000650239
CDR0000650242
CDR0000650669
CDR0000650671
CDR0000651985
CDR0000652000
CDR0000652413
CDR0000652420
CDR0000652676
CDR0000652677
CDR0000652678
CDR0000652843
CDR0000652877
CDR0000653659
CDR0000653677
CDR0000656381
CDR0000660151

I suggest that William QC the output of the global (see Runtime:
2010-01-28_21-55-44.) If everything looks okay, the next step is
probably to run in test mode on Bach and then run live.

Comment entered 2010-01-28 23:04:15 by alan

Attachment Request4720.log has been added with description: Log file from another test run on Franck

Comment entered 2010-02-04 11:35:22 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-02-04 11:35:22
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::10

(In reply to comment #9)
> Created an attachment (id=1852) [details]
> Log file from another test run on Franck

>
> I suggest that William QC the output of the global (see Runtime:
> 2010-01-28_21-55-44.) If everything looks okay, the next step is
> probably to run in test mode on Bach and then run live.

I have reviewed many records and did not find anything wrong. Please run in test mode on Bach.

Comment entered 2010-02-04 21:32:06 by alan

BZDATETIME::2010-02-04 21:32:06
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

I have run the global in test mode on Bach. Output is in the
directory named "2010-02-04_20-39-40". There were no errors.
One document could not be checked out, CDR465522 was checked out
to Bob at the time of the run.

Here is the list of docs that got a PdqKey as an attribute for an
included Citation:

CDR0000065208
CDR0000066283
CDR0000075692
CDR0000075692
CDR0000075692
CDR0000075754
CDR0000075754
CDR0000077691
CDR0000077691
CDR0000650230
CDR0000650233
CDR0000650233
CDR0000650235
CDR0000650235
CDR0000650236
CDR0000650238
CDR0000650239
CDR0000650242
CDR0000650669
CDR0000650671
CDR0000650671
CDR0000651985
CDR0000652000
CDR0000652413
CDR0000652413
CDR0000652413
CDR0000652420
CDR0000652676
CDR0000652677
CDR0000652677
CDR0000652677
CDR0000652677
CDR0000652677
CDR0000652677
CDR0000652677
CDR0000652677
CDR0000652678
CDR0000652843
CDR0000652877
CDR0000653659
CDR0000653677
CDR0000656381
CDR0000660151

The log file is attached.

Comment entered 2010-02-04 21:32:06 by alan

Attachment Request4720.log has been added with description: Log file from test run on Bach

Comment entered 2010-02-05 11:57:50 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-02-05 11:57:50
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::12

I believe it is OK to run in live mode on Bach.

Just to clarify, the protocols in the file are listed irrespective of whether the global will modify the document or not, right? In other words, the records in the file list every document that the global is looking at, not necessarily the ones that would be affected (modified) by the global? I am asking this question because I looked at a few records (InScope) that did not have citations.

Comment entered 2010-02-05 12:36:24 by alan

BZDATETIME::2010-02-05 12:36:24
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13

(In reply to comment #12)
> I believe it is OK to run in live mode on Bach.

It took about 25 minutes to run in test mode and will take longer
in live mode. On the theory that people are leaving early today,
I'll start it around 3 pm so as to not interfere to much with
people getting work done, but to finish well before weekly
publishing starts.

Volker:

Note that 304 docs will be published because of this. I assume
that's no problem. Let me know if it is and I'll hold off.

> Just to clarify, the protocols in the file are listed irrespective of whether
> the global will modify the document or not, right? In other words, the records
> in the file list every document that the global is looking at, not necessarily
> the ones that would be affected (modified) by the global? I am asking this
> question because I looked at a few records (InScope) that did not have
> citations.

William,

That's correct. The global looks at every document that was
transferred. The majority (1133) either have no citations or
already had the citations carried over into the CTGov doc, but
I don't find that out until after I run the filter on it.

Comment entered 2010-02-05 15:20:09 by alan

BZDATETIME::2010-02-05 15:20:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14

I started the global change in live mode on Bach at 3:18.

I'll post the log file when it's available.

Comment entered 2010-02-05 16:58:04 by alan

BZDATETIME::2010-02-05 16:58:04
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15

Here is the log file from today's run.

Comment entered 2010-02-05 16:58:04 by alan

Attachment Request4720b.log has been added with description: Log file for live run on Bach

Comment entered 2010-02-05 16:58:46 by alan

BZDATETIME::2010-02-05 16:58:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16

Marking the task as resolved-fixed, ready for final QA.

Comment entered 2010-02-08 18:27:06 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-02-08 18:27:06
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::17

(In reply to comment #16)
> Marking the task as resolved-fixed, ready for final QA.

I am getting "Document does not conform to DTD and XML Schema" error message for some of the trials. These are mostly the trials in comment # 9. I think it has something to do with the PdqKey attribute.

Comment entered 2010-02-18 23:05:33 by alan

BZDATETIME::2010-02-18 23:05:33
BZCOMMENTOR::Alan Meyer
BZCOMMENT::18

(In reply to comment #17)
> (In reply to comment #16)
> > Marking the task as resolved-fixed, ready for final QA.
>
> I am getting "Document does not conform to DTD and XML Schema" error message
> for some of the trials. These are mostly the trials in comment # 9. I think it
> has something to do with the PdqKey attribute.

William,

This works for me, please test it again.

I think what probably happened is that I may have forgotten
to rebuild the XMetal DTD after updating the schema. So
schema validation would work in the global change program,
but XMetal would get confused. If that's what happened,
I'm guessing that someone else, probably Volker, came along
after me and rebuild the DTD for some other change, thereby
fixing the problem.

If I'm right, it will now work for you too.

If it doesn't work, please post the CDR ID of a document that
it fails on.

Comment entered 2010-02-23 15:13:43 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-02-23 15:13:43
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::19

(In reply to comment #17)
> (In reply to comment #16)
> > Marking the task as resolved-fixed, ready for final QA.
>
> I am getting "Document does not conform to DTD and XML Schema" error message
> for some of the trials. These are mostly the trials in comment # 9. I think it
> has something to do with the PdqKey attribute.

(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #16)
> > > Marking the task as resolved-fixed, ready for final QA.
> >
> > I am getting "Document does not conform to DTD and XML Schema" error message
> > for some of the trials. These are mostly the trials in comment # 9. I think it
> > has something to do with the PdqKey attribute.
>
> William,
>
> This works for me, please test it again.
>
> I think what probably happened is that I may have forgotten
> to rebuild the XMetal DTD after updating the schema. So
> schema validation would work in the global change program,
> but XMetal would get confused. If that's what happened,
> I'm guessing that someone else, probably Volker, came along
> after me and rebuild the DTD for some other change, thereby
> fixing the problem.
>
> If I'm right, it will now work for you too.
>
> If it doesn't work, please post the CDR ID of a document that
> it fails on.

I tried a few more and they all appear to be working. Also, I did not see any problems with the global. I am therefore closing this global. Thank you!

Attachments
File Name Posted User
Request4720.log 2010-02-04 21:32:06
Request4720.log 2010-01-28 23:04:15
Request4720.log 2009-12-24 18:12:38
Request4720b.log 2010-02-05 16:58:04

Elapsed: 0:00:00.001689