CDR Tickets

Issue Number 4511
Summary Add nct_id attribute to ProtocolRef elements
Created 2018-08-07 11:47:55
Issue Type Improvement
Submitted By Kline, Bob (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2018-10-29 12:08:13
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.230580
Description

(I went ahead with putting this ticket in myself, so we can get the ball rolling.)

We are removing the protocol documents from the CDR. Currently the denormalization filter uses the protocol documents to look up the NCT ID so that it can be exported during publication. Since those documents (and the query_term rows extracted from them) will no longer be available, this task will

  1. add an nct_id attribute to the schema definition of the ProtocolRef element

  2. populate the new attribute with NCT IDs using a global change script

  3. drop the code in the Summary denormalization filter which adds the nct_id attribute to the ProtocolRef element

We have a couple of different approaches for how to handle missing NCT IDs:

  1. remove the ProtocolRef markup when the NCT ID is not present, making the nct_id attribute required (and remove the template code in the cleanup filter which strips the ProtocolRef markup when the nct_id attribute is not present); or

  2. make the nct_id optional, leaving the cleanup filter code to strip the ProtocolRef markup when the nct_id is empty

We have a couple of options for dealing with the cdr:href attributes:

  1. drop them during the global change modification; or

  2. leave the existing attribute, making them optional in the schema

My recommendations are (a) that we make the nct_id attribute optional, deferring stripping of ProtocolRef markup until the downstream cleanup filter (thus preserving the existing comment and other attributes); and (b) that we drop the cdr:href attributes from the documents and the schema (avoiding problems with the link tables for meaningless links to documents which will no longer exist).

We could have moved everything to ExternalRef elements, but that would have a couple of disadvantages:

  1. losing semantic information, no longer identifying the elements as references to protocols; and

  2. losing the other existing attributes (e.g., comments, review information, etc.)

I've marked the ticket as Critical, as it's blocking the work on the database cleanup.

What are your thoughts, and ?

Comment entered 2018-08-07 13:33:38 by Osei-Poku, William (NIH/NCI) [C]

Your recommended approach looks good to me. I do not see any downside to this approach. I like the fact that we get to see the NCT ID in XMetal and the existing comments are preserved. Will the new nct_id attribute be editable? We are currently adding new protocol links as external refs which works fine but there is no way to run reports against that and do periodic reviews of the protocol links. It looks like these changes may allow us to overcome that limitation but that is a conversation for another day. 🙂

Comment entered 2018-08-07 14:34:59 by Kline, Bob (NIH/NCI) [C]

Yes, the nct_id attribute will be editable.

Comment entered 2018-08-08 07:57:32 by Kline, Bob (NIH/NCI) [C]

The schema change has been installed on DEV, and the global change was run in test mode.

https://cdr-dev.cancer.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2018-08-07_18-18-41

Some trials (e.g., EORTC-10801) don't have NCT IDs, but the text content of the ProtocolRef element should still have enough evidence to keep looking.

Does this global change take care of all of the links of concern for OCECDR-4478?

Comment entered 2018-08-08 12:18:26 by Osei-Poku, William (NIH/NCI) [C]

Yes, it does. The test results look good. Please run in live mode on DEV.

Comment entered 2018-08-08 15:21:16 by Kline, Bob (NIH/NCI) [C]

Live mode has completed on DEV. Please review.

Comment entered 2018-08-09 11:51:46 by Osei-Poku, William (NIH/NCI) [C]

Verified on DEV. Please run in test mode on QA.

It will be good to run a publishing job on QA or STAGE, after testing, so we can review the results and compare with what is currently on Cancer.gov.

Comment entered 2018-08-09 15:31:16 by Kline, Bob (NIH/NCI) [C]

Test mode is complete on QA.

Comment entered 2018-08-09 16:20:38 by Osei-Poku, William (NIH/NCI) [C]

Verified. Please run in live mode on QA.

Comment entered 2018-08-09 16:36:45 by Kline, Bob (NIH/NCI) [C]

Sure. I have installed the temporary schema, adding the optional nct_id attribute to the ProtocolRef element, and making the cdr:href attribute optional. Next steps (I think) are:

  1. runs a Summary publishing job on QA

  2. I run the global change job on QA in live mode

  3. I install the permanent schema, removing the cdr:href attribute from ProtocolRef

  4. I modify the DTDs on CDR QA to remove the href attribute from the ProtocolRef element

  5. I install the modifications to CDR0000335424.xml (Denormalization Filter: Summary) on QA

  6. reviews the results of the live global change on QA

  7. makes whatever comparable changes to the DTD on GateKeeper QA as are necessary

  8. runs a second Summary publishing job on QA and does the before/after comparison

  9. We proceed to the upper tiers

Does this sound right? Ideally, step #1 can be finished in time for me to take care of the next four steps before heading out of town. Is that feasible, ?

(EDIT): I added another step above, adjusting the summary denormalization filter so it no longer tries to map CDR IDs to NCT IDs (now that the documents represented by those CDR IDs are gone).

(SECOND EDIT): I added yet another step for to take a look at the results of the live global change on QA.

Comment entered 2018-08-09 16:39:41 by Kline, Bob (NIH/NCI) [C]

One little footnote to all of this: bear in mind that since the protocol documents have been removed from QA, all of the ProtocolRef markup will be stripped from the exported summaries by the "before" job. Just didn't want any shocking surprises. :-)

Comment entered 2018-08-09 17:05:03 by Englisch, Volker (NIH/NCI) [C]

Step #1: Completed.

Comment entered 2018-08-09 17:28:51 by Kline, Bob (NIH/NCI) [C]

Steps #2-5 completed.

Over to you, , for step #6, then for #7.

Comment entered 2018-08-10 09:31:54 by Osei-Poku, William (NIH/NCI) [C]

Step # 6 Completed. The changes look good on QA.

Comment entered 2018-08-10 15:02:12 by Kline, Bob (NIH/NCI) [C]

After takes care of steps #7 and #8, and I'm back from vacation, we can

  1. install the interim schema (adding nct_id) on STAGE

  2. run the global change job on STAGE in live mode

  3. review the results of the global change

  4. install the permanent schema, removing the cdr:href attribute from ProtocolRef

  5. modify the DTDs on STAGE (with CBIIT's assistance)

  6. install the modifications to CDR0000335424.xml (Denormalization Filter: Summary) and CDR0000315588.xml (Module: Vendor Cleanup Templates) on STAGE

  7. run a test summary publishing job on STAGE

  8. repeat the above steps on PROD

Actual removal of the protocol documents from the upper tiers will happen separately.

Sound good?

Comment entered 2018-08-10 15:16:29 by Englisch, Volker (NIH/NCI) [C]

We need to split step 7 into (a) and (b). We need to update the licensee DTD (pdq.dtd) and the WCMS DTD (pdqCG.dtd) as well as the DTD on Gatekeeper. I can take care of that but likely not before the publishing job ran at 4pm.

Comment entered 2018-08-10 15:25:26 by Kline, Bob (NIH/NCI) [C]

Should you disable the weekly run on QA, then?

Comment entered 2018-08-10 15:29:06 by Englisch, Volker (NIH/NCI) [C]

Good point, the job will fail anyway. Maybe I will make the change and manually start the weekly publishing job.

Comment entered 2018-08-10 15:47:45 by Englisch, Volker (NIH/NCI) [C]

, the href attribute is now empty. Was that intended?

<ProtocolRef href="" nct_id="NCT00363909">NCCTG-N05C9</ProtocolRef>
Comment entered 2018-08-10 16:29:36 by Kline, Bob (NIH/NCI) [C]

When I first implemented the global change, I set the cdr:href attribute to an empty string. But then I changed the code to remove the attribute altogether. Some processors view an empty attribute value as semantically equivalent to the case in which the attribute is absent. But I'm not sure that's true of all processors (including validators). The CDR document no longer has the cdr:href attribute, so it would be interesting to find out where the href attribute is coming from in the vendor output. I'd be inclined to make whatever changes to the publishing filters to make sure the attribute is not inserted when there's no cdr:href attribute.

Comment entered 2018-08-10 16:44:56 by Englisch, Volker (NIH/NCI) [C]

Found it! It's actually the cleanup filter (CDR315588) that adds the href element. Since the cdr:href was mandatory we didn't have to check if it existed or not and so it sets the href attribute to whatever exists for @href, which is blank.

I can make the change while you're gone.
Where can I find your filter changes?

Comment entered 2018-08-10 16:48:33 by Kline, Bob (NIH/NCI) [C]
Comment entered 2018-08-13 16:39:13 by Englisch, Volker (NIH/NCI) [C]

Created new branch and removed mandatory href attribute from ProtocolRef element in both DTD files - pdq.dtd and pdqCG.dtd:
https://github.com/NCIOCPL/cdr-publishing/commit/5737b68

Comment entered 2018-08-13 17:34:30 by Englisch, Volker (NIH/NCI) [C]

Steps #7,8: Completed.

The diff output looks as expected.

Comment entered 2018-08-21 13:25:57 by Kline, Bob (NIH/NCI) [C]

The schema change is on STAGE and the global change has been run in live mode on that tier. can review the results of the global change, while and I take care of replacing the filters and DTDs and running a test summary publishing job on STAGE..

Comment entered 2018-08-21 13:40:28 by Kline, Bob (NIH/NCI) [C]

Filters CDR0000335424.xml and CDR0000315588.xml have been replaced on STAGE.

Comment entered 2018-08-21 15:36:14 by Englisch, Volker (NIH/NCI) [C]

The DTD files have been updated. I'm running a test summary publishing job now.

Comment entered 2018-08-23 14:14:48 by Kline, Bob (NIH/NCI) [C]

Run this on PROD after hours some evening next week (first notifying the user) so that the republishing of summaries that happens as a result of the global change coincides with the republishing for OCECDR-4221.

Comment entered 2018-09-12 16:03:58 by Osei-Poku, William (NIH/NCI) [C]

While testing OCECDR-4513, I found out that the QC report wasn't working for Organization documents with links to deleted person records because the CIPS Contact Person element is linking to non existent Person records. Examples
36714, 34786, 29115. I am not sure how many Org. records are in this situation. It would be helpful to look into this before we move to PROD.

Comment entered 2018-10-15 11:15:18 by Kline, Bob (NIH/NCI) [C]

There are 379 rows in the query_term table on DEV for organizations whose CIPS contact person documents have been removed. Shall we modify the QC report so that it doesn't fail for this condition?

Comment entered 2018-10-15 11:40:33 by Osei-Poku, William (NIH/NCI) [C]

Yes. Thank you!

Comment entered 2018-10-15 11:49:25 by Kline, Bob (NIH/NCI) [C]

OCECDR-4536 has been created for this enhancement. Are there any other reasons to hold off on running the job for this ticket (OCECDR-4511) on PROD?

Comment entered 2018-10-16 14:40:32 by Osei-Poku, William (NIH/NCI) [C]

No. We can proceed with this on PROD.

Comment entered 2018-10-16 17:16:48 by Kline, Bob (NIH/NCI) [C]

Checking in with you, too, : if I run the global change on PROD tonight, can the filter and DTD changes get put into place?

Comment entered 2018-10-16 17:46:10 by Englisch, Volker (NIH/NCI) [C]

Aren't the DTD changes part of the ticket to CBIIT?

Comment entered 2018-10-16 17:59:54 by Kline, Bob (NIH/NCI) [C]

The schema modification and the global change don't involve CBIIT. Are you thinking of the ticket I gave them to patch the publishing software (for another JIRA issue) to enable hotfix/remove jobs?

Comment entered 2018-10-16 18:14:21 by Englisch, Volker (NIH/NCI) [C]

Ah, yes! That's what I was thinking about.

I guess we'll have to submit another ticket for CBIIT.

Comment entered 2018-10-16 19:17:21 by Kline, Bob (NIH/NCI) [C]

So should I hold off on running the global change?

Comment entered 2018-10-17 11:54:00 by Englisch, Volker (NIH/NCI) [C]

Go ahead and run it. I'll submit a ticket to have the DTD replaced before Friday evening.

Comment entered 2018-10-25 13:58:40 by Kline, Bob (NIH/NCI) [C]

will put in a ticket this afternoon for CBIIT to install a new version of the DTD on PROD (and ask Cameron to help make sure it gets done in time for tomorrow's publishing job) and Bob will run the global change on PROD tomorrow morning.

Comment entered 2018-10-26 08:45:42 by Kline, Bob (NIH/NCI) [C]

DTD changes are in place. The global change job has been started on PROD in live mode.

Comment entered 2018-10-26 08:49:38 by Kline, Bob (NIH/NCI) [C]

: Looking back over the comments for this ticket I see that some filter changes are involved as well. Could you make sure that the filters are as they should be on PROD for this ticket?

Thanks!

Comment entered 2018-10-26 09:29:02 by Kline, Bob (NIH/NCI) [C]

I have messed things up. I had forgotten that there is a schema change which should have been installed before the global change. In fact, the way it was supposed to work, a temporary schema change was to have been made, making the ProtocolRef's cdr:ref attribute optional and adding the nct_id attribute as also optional. Then after the global change was run, the permanent schema change would remove the cdr:ref attribute. So the attempts to save new publishable versions for the summaries resulted in unpublishable summaries. After the global change finishes, I will install the new schema (with the nct_id attribute and without the cdr:ref attribute) and then I will create a script which identifies those versions and re-save them as really publishable. Will keep you posted. I'll keep my eyes out for any summaries which are locked and have to be handled manually (I'm hoping there won't be too many of those).

Comment entered 2018-10-26 09:57:22 by Osei-Poku, William (NIH/NCI) [C]

Thanks Bob! I don't think many summaries would be locked either. The only problem this is causing now is preventing PP from running successfully for some summaries so, we will wait for you to install the new schema before running PP.

Comment entered 2018-10-26 13:14:58 by Englisch, Volker (NIH/NCI) [C]

I've updated the filter change I made to the Vendor filter. Did you want me to take care of your filter changes as well or did you already copy those?

Comment entered 2018-10-26 13:18:05 by Kline, Bob (NIH/NCI) [C]

Do you want me to take care of your filter changes ...

Yes, please.

Comment entered 2018-10-26 13:52:35 by Englisch, Volker (NIH/NCI) [C]

The following filters have been copied to PROD:

Comment entered 2018-10-26 14:11:51 by Englisch, Volker (NIH/NCI) [C]

CBIIT has copied the DTD files for PROD to the CDR server:

Comment entered 2018-10-26 14:32:20 by Englisch, Volker (NIH/NCI) [C]

CBIIT has copied the Gatekeeper DTD to PROD as well:

Comment entered 2018-10-26 17:03:30 by Kline, Bob (NIH/NCI) [C]

All of the summaries which had cdr:ref attributes in ProtocolRef elements have been processed. The schema is still in its intermediate state (allowing but not requiring the cdr:ref attributes). On Monday I will take care of installing the permanent schema change, dropping the cdr:ref attribute from the ProtocolRef element.

Comment entered 2018-10-29 12:08:13 by Kline, Bob (NIH/NCI) [C]

The schema change replacing the cdr:ref attribute with an nct_id attribute on ProtocolRef elements has been install on the production server. The change will show up in XMetaL the next time you launch it. Please verify that the summaries are as they should be and if so close this ticket. The modified summaries were published successfully over the weekend.

Comment entered 2018-10-29 19:45:56 by Osei-Poku, William (NIH/NCI) [C]

It looks like some of the older versions of the affected summaries still have the DTD errors when you try to retrieve them. For example CDR0000062896 - Older versions like 593 and a lot more all come up with the DTD error. Another example is CDR0000062910 - Version 501.

Comment entered 2018-10-29 20:10:13 by Kline, Bob (NIH/NCI) [C]

Well, yes, we didn't transform every version of every summary. The global change harness only ensures that the most recent version, the most recent publishable version (if different from the latest version), and the CWD conform to the changes requested for the schema. There has never been a requirement that the DTDs loaded into XMetaL somehow roll themselves back to earlier versions on the user's workstation when she loads an older version of a document created against a different set of validation requirements.

Comment entered 2018-10-30 07:56:55 by Osei-Poku, William (NIH/NCI) [C]

In the past when a change would invalidate previous versions, we did not delete the element from the schema. We made it redundant. We probably should have done the same thing here.

Comment entered 2018-10-30 14:13:06 by Kline, Bob (NIH/NCI) [C]

It's an attribute, not an element, but yes, we can do that, assuming or have no objections to giving up the constraints which would ensure the documents don't have obsolete data.

Comment entered 2018-10-31 11:49:58 by Juthe, Robin (NIH/NCI) [E]

I'm fine with this. We need to be able to pull up earlier versions of the summaries.

Comment entered 2018-10-31 12:37:57 by Kline, Bob (NIH/NCI) [C]

I have installed the more permissive version of the schema on DEV. Please test to make sure everything is as you want it and nothing else is broken, and then we can march up the tiers.

Comment entered 2018-10-31 13:42:41 by Osei-Poku, William (NIH/NCI) [C]

We've looked at a few summaries on DEV and all checked out fine without any problems. We also check PP and QC reports. There were no problems either.

Comment entered 2018-10-31 14:14:24 by Kline, Bob (NIH/NCI) [C]

And you looked at older versions?

Comment entered 2018-10-31 14:16:52 by Osei-Poku, William (NIH/NCI) [C]

Yes. Several older versions.

Comment entered 2018-10-31 17:32:30 by Kline, Bob (NIH/NCI) [C]

Installed on QA.

Comment entered 2018-11-01 10:42:58 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA. However, PP appears to be missing all its formatting on QA. No left/right navigating bars etc. Not sure if it is related to this change.

Comment entered 2018-11-01 11:12:45 by Englisch, Volker (NIH/NCI) [C]

I cannot confirm a problem with PP on QA. Are you certain the browser window is big enough and you're not displaying the tablet version of the report?

Comment entered 2018-11-01 11:22:01 by Osei-Poku, William (NIH/NCI) [C]

Yes, the screen size is big enough.

Comment entered 2018-11-01 11:37:42 by Kline, Bob (NIH/NCI) [C]

The request for installing the more permissive schema can be tracked in OCECDR-4539, and the publish preview problems are very unlikely to have been caused by the ProtocolRef attribute juggling tracked by this ticket. Let's keep this ticket focused on the original request to add nct_id attributes to the ProtocolRef elements in the summary documents.

Comment entered 2018-11-01 11:39:53 by Englisch, Volker (NIH/NCI) [C]

How are you running the PP report? Could you paste the URL for the report?

Comment entered 2018-11-01 11:40:49 by Englisch, Volker (NIH/NCI) [C]

I agree. , you may want to submit a new ticket for the PP issue.

Comment entered 2018-11-01 11:43:44 by Osei-Poku, William (NIH/NCI) [C]

Will do.

Comment entered 2018-11-01 11:44:17 by Osei-Poku, William (NIH/NCI) [C]

It should now be okay to install the changes on PROD.

Comment entered 2018-11-13 16:34:24 by Osei-Poku, William (NIH/NCI) [C]

Verified on PROD. Thanks!

Attachments
File Name Posted User
DTD error.JPG 2018-10-29 19:44:55 Osei-Poku, William (NIH/NCI) [C]
PP_withoutNAV.JPG 2018-11-01 11:21:41 Osei-Poku, William (NIH/NCI) [C]

Elapsed: 0:00:00.001759