Issue Number | 4511 |
---|---|
Summary | Add nct_id attribute to ProtocolRef elements |
Created | 2018-08-07 11:47:55 |
Issue Type | Improvement |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2018-10-29 12:08:13 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.230580 |
(I went ahead with putting this ticket in myself, so we can get the ball rolling.)
We are removing the protocol documents from the CDR. Currently the denormalization filter uses the protocol documents to look up the NCT ID so that it can be exported during publication. Since those documents (and the query_term rows extracted from them) will no longer be available, this task will
add an nct_id
attribute to the schema definition of
the ProtocolRef
element
populate the new attribute with NCT IDs using a global change script
drop the code in the Summary denormalization filter which adds
the nct_id
attribute to the ProtocolRef
element
We have a couple of different approaches for how to handle missing NCT IDs:
remove the ProtocolRef
markup when the NCT ID is not
present, making the nct_id
attribute required (and remove
the template code in the cleanup filter which strips the
ProtocolRef
markup when the nct_id
attribute
is not present); or
make the nct_id
optional, leaving the cleanup filter
code to strip the ProtocolRef
markup when the
nct_id
is empty
We have a couple of options for dealing with the
cdr:href
attributes:
drop them during the global change modification; or
leave the existing attribute, making them optional in the schema
My recommendations are (a) that we make the nct_id
attribute optional, deferring stripping of ProtocolRef
markup until the downstream cleanup filter (thus preserving the existing
comment and other attributes); and (b) that we drop the
cdr:href
attributes from the documents and the schema
(avoiding problems with the link tables for meaningless links to
documents which will no longer exist).
We could have moved everything to ExternalRef
elements,
but that would have a couple of disadvantages:
losing semantic information, no longer identifying the elements as references to protocols; and
losing the other existing attributes (e.g., comments, review information, etc.)
I've marked the ticket as Critical, as it's blocking the work on the database cleanup.
What are your thoughts, ~oseipokuw and ~mbeckwit?
Your recommended approach looks good to me. I do not see any downside to this approach. I like the fact that we get to see the NCT ID in XMetal and the existing comments are preserved. Will the new nct_id attribute be editable? We are currently adding new protocol links as external refs which works fine but there is no way to run reports against that and do periodic reviews of the protocol links. It looks like these changes may allow us to overcome that limitation but that is a conversation for another day. 🙂
Yes, the nct_id
attribute will be editable.
The schema change has been installed on DEV, and the global change was run in test mode.
https://cdr-dev.cancer.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2018-08-07_18-18-41
Some trials (e.g., EORTC-10801) don't have NCT IDs, but the text
content of the ProtocolRef
element should still have enough
evidence to keep looking.
Does this global change take care of all of the links of concern for OCECDR-4478?
Yes, it does. The test results look good. Please run in live mode on DEV.
Live mode has completed on DEV. Please review.
Verified on DEV. Please run in test mode on QA.
It will be good to run a publishing job on QA or STAGE, after testing, so we can review the results and compare with what is currently on Cancer.gov.
Test mode is complete on QA.
Verified. Please run in live mode on QA.
Sure. I have installed the temporary schema, adding the optional
nct_id
attribute to the ProtocolRef
element,
and making the cdr:href
attribute optional. Next steps (I
think) are:
~volker runs a Summary publishing job on QA
I run the global change job on QA in live mode
I install the permanent schema, removing the
cdr:href
attribute from ProtocolRef
I modify the DTDs on CDR QA to remove the href
attribute from the ProtocolRef
element
I install the modifications to CDR0000335424.xml (Denormalization Filter: Summary) on QA
~oseipokuw reviews the results of the live global change on QA
~volker makes whatever comparable changes to the DTD on GateKeeper QA as are necessary
~volker runs a second Summary publishing job on QA and does the before/after comparison
We proceed to the upper tiers
Does this sound right? Ideally, step #1 can be finished in time for me to take care of the next four steps before heading out of town. Is that feasible, ~volker?
(EDIT): I added another step above, adjusting the summary denormalization filter so it no longer tries to map CDR IDs to NCT IDs (now that the documents represented by those CDR IDs are gone).
(SECOND EDIT): I added yet another step for ~oseipokuw to take a look at the results of the live global change on QA.
One little footnote to all of this: bear in mind that since the protocol documents have been removed from QA, all of the ProtocolRef markup will be stripped from the exported summaries by the "before" job. Just didn't want any shocking surprises. :-)
Step #1: Completed.
Steps #2-5 completed.
Over to you, ~oseipokuw, for step #6, then ~volker for #7.
Step # 6 Completed. The changes look good on QA.
After ~volker takes care of steps #7 and #8, and I'm back from vacation, we can
install the interim schema (adding nct_id
) on
STAGE
run the global change job on STAGE in live mode
review the results of the global change
install the permanent schema, removing the cdr:href
attribute from ProtocolRef
modify the DTDs on STAGE (with CBIIT's assistance)
install the modifications to CDR0000335424.xml (Denormalization Filter: Summary) and CDR0000315588.xml (Module: Vendor Cleanup Templates) on STAGE
run a test summary publishing job on STAGE
repeat the above steps on PROD
Actual removal of the protocol documents from the upper tiers will happen separately.
Sound good?
We need to split step 7 into (a) and (b). We need to update the licensee DTD (pdq.dtd) and the WCMS DTD (pdqCG.dtd) as well as the DTD on Gatekeeper. I can take care of that but likely not before the publishing job ran at 4pm.
Should you disable the weekly run on QA, then?
Good point, the job will fail anyway. Maybe I will make the change and manually start the weekly publishing job.
~bkline, the href attribute is now empty. Was that intended?
<ProtocolRef href="" nct_id="NCT00363909">NCCTG-N05C9</ProtocolRef>
When I first implemented the global change, I set the cdr:href attribute to an empty string. But then I changed the code to remove the attribute altogether. Some processors view an empty attribute value as semantically equivalent to the case in which the attribute is absent. But I'm not sure that's true of all processors (including validators). The CDR document no longer has the cdr:href attribute, so it would be interesting to find out where the href attribute is coming from in the vendor output. I'd be inclined to make whatever changes to the publishing filters to make sure the attribute is not inserted when there's no cdr:href attribute.
Found it! It's actually the cleanup filter (CDR315588) that adds the href element. Since the cdr:href was mandatory we didn't have to check if it existed or not and so it sets the href attribute to whatever exists for @href, which is blank.
I can make the change while you're gone.
Where can I find your filter changes?
Created new branch and removed mandatory href attribute from
ProtocolRef element in both DTD files - pdq.dtd and
pdqCG.dtd:
https://github.com/NCIOCPL/cdr-publishing/commit/5737b68
Steps #7,8: Completed.
The diff output looks as expected.
The schema change is on STAGE and the global change has been run in live mode on that tier. ~oseipokuw can review the results of the global change, while ~volker and I take care of replacing the filters and DTDs and running a test summary publishing job on STAGE..
Filters CDR0000335424.xml and CDR0000315588.xml have been replaced on STAGE.
The DTD files have been updated. I'm running a test summary publishing job now.
Run this on PROD after hours some evening next week (first notifying the user) so that the republishing of summaries that happens as a result of the global change coincides with the republishing for OCECDR-4221.
While testing OCECDR-4513, I found out that the QC report wasn't
working for Organization documents with links to deleted person records
because the CIPS Contact Person element is linking to non existent
Person records. Examples
36714, 34786, 29115. I am not sure how many Org. records are in this
situation. It would be helpful to look into this before we move to
PROD.
There are 379 rows in the query_term table on DEV for organizations whose CIPS contact person documents have been removed. Shall we modify the QC report so that it doesn't fail for this condition?
Yes. Thank you!
OCECDR-4536 has been created for this enhancement. Are there any other reasons to hold off on running the job for this ticket (OCECDR-4511) on PROD?
No. We can proceed with this on PROD.
Checking in with you, too, ~volker: if I run the global change on PROD tonight, can the filter and DTD changes get put into place?
Aren't the DTD changes part of the ticket to CBIIT?
The schema modification and the global change don't involve CBIIT. Are you thinking of the ticket I gave them to patch the publishing software (for another JIRA issue) to enable hotfix/remove jobs?
Ah, yes! That's what I was thinking about.
I guess we'll have to submit another ticket for CBIIT.
So should I hold off on running the global change?
Go ahead and run it. I'll submit a ticket to have the DTD replaced before Friday evening.
~volker will put in a ticket this afternoon for CBIIT to install a new version of the DTD on PROD (and ask Cameron to help make sure it gets done in time for tomorrow's publishing job) and Bob will run the global change on PROD tomorrow morning.
DTD changes are in place. The global change job has been started on PROD in live mode.
~volker: Looking back over the comments for this ticket I see that some filter changes are involved as well. Could you make sure that the filters are as they should be on PROD for this ticket?
Thanks!
I have messed things up. I had forgotten that there is a schema change which should have been installed before the global change. In fact, the way it was supposed to work, a temporary schema change was to have been made, making the ProtocolRef's cdr:ref attribute optional and adding the nct_id attribute as also optional. Then after the global change was run, the permanent schema change would remove the cdr:ref attribute. So the attempts to save new publishable versions for the summaries resulted in unpublishable summaries. After the global change finishes, I will install the new schema (with the nct_id attribute and without the cdr:ref attribute) and then I will create a script which identifies those versions and re-save them as really publishable. Will keep you posted. I'll keep my eyes out for any summaries which are locked and have to be handled manually (I'm hoping there won't be too many of those).
Thanks Bob! I don't think many summaries would be locked either. The only problem this is causing now is preventing PP from running successfully for some summaries so, we will wait for you to install the new schema before running PP.
I've updated the filter change I made to the Vendor filter. Did you want me to take care of your filter changes as well or did you already copy those?
Do you want me to take care of your filter changes ...
Yes, please.
The following filters have been copied to PROD:
CDR315588: Module: Vendor Cleanup Templates
https://github.com/NCIOCPL/cdr-server/commit/d84ea47
CDR335424: Denormalization Filter: Summary
https://github.com/NCIOCPL/cdr-server/commit/2ec4255
CBIIT has copied the DTD files for PROD to the CDR server:
pdq.dtd, pdqCG.dtd
https://github.com/NCIOCPL/cdr-publishing/commit/5737b68
CBIIT has copied the Gatekeeper DTD to PROD as well:
pdq.dtd (copy of pdqCG.dtd)
https://github.com/NCIOCPL/wcms-gatekeeper/commit/1775ec89
All of the summaries which had cdr:ref attributes in ProtocolRef elements have been processed. The schema is still in its intermediate state (allowing but not requiring the cdr:ref attributes). On Monday I will take care of installing the permanent schema change, dropping the cdr:ref attribute from the ProtocolRef element.
The schema change replacing the cdr:ref
attribute with
an nct_id
attribute on ProtocolRef
elements
has been install on the production server. The change will show up in
XMetaL the next time you launch it. Please verify that the summaries are
as they should be and if so close this ticket. The modified summaries
were published successfully over the weekend.
It looks like some of the older versions of the affected summaries still have the DTD errors when you try to retrieve them. For example CDR0000062896 - Older versions like 593 and a lot more all come up with the DTD error. Another example is CDR0000062910 - Version 501.
Well, yes, we didn't transform every version of every summary. The global change harness only ensures that the most recent version, the most recent publishable version (if different from the latest version), and the CWD conform to the changes requested for the schema. There has never been a requirement that the DTDs loaded into XMetaL somehow roll themselves back to earlier versions on the user's workstation when she loads an older version of a document created against a different set of validation requirements.
In the past when a change would invalidate previous versions, we did not delete the element from the schema. We made it redundant. We probably should have done the same thing here.
It's an attribute, not an element, but yes, we can do that, assuming ~mbeckwit or ~juther have no objections to giving up the constraints which would ensure the documents don't have obsolete data.
I'm fine with this. We need to be able to pull up earlier versions of the summaries.
I have installed the more permissive version of the schema on DEV. Please test to make sure everything is as you want it and nothing else is broken, and then we can march up the tiers.
We've looked at a few summaries on DEV and all checked out fine without any problems. We also check PP and QC reports. There were no problems either.
And you looked at older versions?
Yes. Several older versions.
Installed on QA.
Verified on QA. However, PP appears to be missing all its formatting on QA. No left/right navigating bars etc. Not sure if it is related to this change.
I cannot confirm a problem with PP on QA. Are you certain the browser window is big enough and you're not displaying the tablet version of the report?
Yes, the screen size is big enough.
The request for installing the more permissive schema can be tracked
in OCECDR-4539, and the publish preview problems are very unlikely to
have been caused by the ProtocolRef
attribute juggling
tracked by this ticket. Let's keep this ticket focused on the original
request to add nct_id
attributes to the
ProtocolRef
elements in the summary documents.
How are you running the PP report? Could you paste the URL for the report?
I agree. ~oseipokuw, you may want to submit a new ticket for the PP issue.
Will do.
It should now be okay to install the changes on PROD.
Verified on PROD. Thanks!
File Name | Posted | User |
---|---|---|
DTD error.JPG | 2018-10-29 19:44:55 | Osei-Poku, William (NIH/NCI) [C] |
PP_withoutNAV.JPG | 2018-11-01 11:21:41 | Osei-Poku, William (NIH/NCI) [C] |
Elapsed: 0:00:00.001759