PDQ Issues

Issue Number	5097
Summary	Redirect links to SVPC summaries in documents sent to the data partners
Created	2022-01-26 12:19:45
Issue Type	Improvement
Submitted By	Kline, Bob (NIH/NCI) [C]
Assigned To	Englisch, Volker (NIH/NCI) [C]
Status	Closed
Resolved	2022-02-16 21:46:27
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.309693

Description

We have a new requirement for the filters which assemble the summaries sent to the data partners to modify links to SVPC content so that those links point instead to the documents the partners get. So, for example, if legacy document CDR1000 is replaced on the web site by SVPC nodes for CDR1001, CDR1002, and CDR1003, all the links to CDR1001, CDR1002, and CDR1003 will need to be replaced with links to CDR1000. This will need to happen not just in partner-only summary documents but every document sent to the partners.

One approach would be to expose a new custom XSL/T callback function which takes the CDR ID of a link target and returns

the CDR ID of the legacy document which the link target replaces if the target is an SVPC document
the link target passed to the function if the link target is not an SVPC document, but is a published CDR document
an empty string if the link target is an SVPC document, but is not itself included by any vendor-only summary document

The third choice handles the edge case in which an SVPC document is created, linked by other content, but not yet included by a vendor-only summary document. In theory, the users should not allow this to happen, but there is nothing in the CDR software to prevent this condition.

The filter which processes the documents which are sent to the PDQ data partners identifies all of the links which might be to summaries and invokes this new callback function so that it can modify or remove the links as appropriate.

Another approach might be to pass in a map of SVPC summary IDs to partner document IDs to the filter. This would eliminate the need to implement a new XSL/T callback function, and would alleviate some of the performance hit that this new requirement will probably introduce, but it might increase the complexity of the XSL/T code somewhat, though not to a great extent, I don't think, assuming the logic to do the lookup is encapsulated in a named template. In fact, it might be possible to wrap all of the link transformation into a single named template. I'll leave those decisions to the assignee for this ticket. 😉

Comment entered 2022-01-26 13:18:56 by Englisch, Volker (NIH/NCI) [C]

We will also need to think about how to handle links from legacy documents to partner documents, don't we? Or will those not exist because they will have been manually corrected to point to SVPC docs?

Comment entered 2022-01-26 13:49:51 by Kline, Bob (NIH/NCI) [C]

That's a data problem, to be addressed by the users, not the software.

Comment entered 2022-01-26 17:42:33 by Osei-Poku, William (NIH/NCI) [C]

Yes, this is part of the plan. We will have to clean up the links (in other legacy documents pointing to the the partner documents (and vice versa))prior to taking the legacy documents off cancer.gov, and make decisions as to whether the links go to the new SVPC summaries or not. We go through this process each time a summary needs to be blocked and removed from cancer.gov.

Comment entered 2022-02-01 16:17:13 by Englisch, Volker (NIH/NCI) [C]

Question for ~mbeckwit and ~juther:

I will continue with Bob's example from above where he presented the following scenario:

CDR1000 - Partner document
CDR1001, CDR1002, CDR1003 - SVPC document set

As mentioned in the description, if there is a legacy document that is linking to one of the SVPC documents, the filter needs to replace that link to link instead to the Partner ID because our PDQ partners will not receive the SVPC documents

CDR5000 - Legacy document
CDR5000 links to CDR1003
This will need to change to
CDR5000 links to CDR1000

The software can easily search in the CDR to find the "parent" (a.k.a. partner document) for CDR1003 as long as there is a one-to-one relationship and the SVPC can only have one parent. If we have multiple legacy documents linking to a single SVPC we won't know which would be the correct partner document to link to.

CDR2000 - Partner document
CDR2001, CDR2002, CDR1003 - SVPC document set

CDR5000 (legacy) links to CDR1003: Parent doc could be CDR1000 or CDR2000
CDR6000 (legacy) links to CDR1003: Parent doc could be CDR1000 or CDR2000

How should we handle a link from a legacy document to a SVPC document? Can we assume that a SVPC document will only be included within one partner document?

Comment entered 2022-02-03 11:57:12 by Englisch, Volker (NIH/NCI) [C]

Can we assume that a SVPC document will only be included within one partner document?

The answer to that question is: No! I can see that the SVPC document "What is Liver Cancer?" is included in all three partner documents for treatment, prevention, and screening. So, we will need to figure out how to handle these type of links in the partner output.

Comment entered 2022-02-03 12:14:28 by Kline, Bob (NIH/NCI) [C]

Good catch!

Comment entered 2022-02-08 19:12:39 by Englisch, Volker (NIH/NCI) [C]

I have modified the following filter to convert SummaryRef elements pointing to a SVPC document to ExternalRef elements:

CDR609947 - Vendor Filter: Convert CG to Public Data

I have tested the situation of legacy documents pointing to a SVPC document. ~oseipokuw since I have blocked all test versions of SVPC documents would it be possible to add some SummaryRef links for SVPC to SVPC links for testing or should I instead unblock some of those test documents?

Comment entered 2022-02-08 21:03:48 by Osei-Poku, William (NIH/NCI) [C]

Hi ~volker, Yes please unblock the test documents since the real SVPC documents have been finalized and are ready for client review. We cannot used them for testing anymore. We may have to create a test partner document as well. If you let me know when I can start testing, I will create the test documents if you haven't created them already.

Comment entered 2022-02-16 21:46:08 by Englisch, Volker (NIH/NCI) [C]

The following filter has been modified in order to convert the SummaryRef elements pointing to a SVPC document to an ExternalRef element:

CDR609947.xml - Vendor Filter: Convert CG to Public Data
https://github.com/NCIOCPL/cdr-server/commit/f9394d557

I have tested all combination that I could think of (Legacy to SVPC, Legacy to Legacy, SVPC to Legacy, SVPC to SCPC, etc. ) and everything looks good on DEV. More tests will be run on the QA server as part of our publishing jobs.

Comment entered 2022-02-23 14:15:27 by Osei-Poku, William (NIH/NCI) [C]

Hi ~volker I generated the vendor filter output for this partner document -CDR0000805713 and I see a few SummaryRref elements in the form

<SummaryRef href="CDR0000805710" url="/types/liver/what-is-liver-cancer/diagnosis">Liver Cancer Diagnosis</SummaryRef>

Should I be looking for ExternalRefs with the full URL path in those cases?

Comment entered 2022-02-23 14:42:48 by Englisch, Volker (NIH/NCI) [C]

As I mentioned earlier, creating the XML Vendor output is not sufficient because the vendor output is what's created for Cancer.gov output. This XML output needs to be transformed one more time with the filter CDR609947 (Vendor Filter: Convert CG to Public Data). This will produce the XML output that gets copied to the SFTP server for our PDQ partners and this is the output that has the SummaryRef elements to SVPC documents converted to ExternalRef elements.

Comment entered 2022-02-23 15:14:01 by Osei-Poku, William (NIH/NCI) [C]

Thanks for the clarification, Volker! I used the above parameters and still see at least 2 summary refs in the XML which are not external refs. Are they OK to be Summary Refs?

The one I reported earlier is no longer there, as it seems to have been transformed into external refs as expected.

Comment entered 2022-02-23 15:31:07 by Englisch, Volker (NIH/NCI) [C]

Yes, you will still see SummaryRefs. The filter only converts those SummaryRefs that are pointing to SVPC documents since those documents are not part of the PDQ summary set of documents. If the SummaryRef points to a legacy summary - a document that is part of the set of documents provided to the PDQ partners - there is no need to convert the element. The partner can continue to use those links just as before.

Comment entered 2022-02-23 15:48:17 by Osei-Poku, William (NIH/NCI) [C]

OK. Thanks for the explanation, Volker. My follow up question is, the URLs to the legacy documents on cancer.gov will no longer be available (on cancer.gov) as the summaries will be removed from cancer.gov (that is what I have heard talked about). Is it still OK to provide those URLs ?

Comment entered 2022-02-23 16:13:33 by Englisch, Volker (NIH/NCI) [C]

The weekly publishing job checks the status of the legacy document that is being linked to. As long as the target document exists it's OK to link to it. Once it has been removed from Cancer.gov the filter will recognize the change, adjust URL and convert the SummaryRef to an ExternalRef. In short, yes, it's OK to list those URLs.

Comment entered 2022-02-23 16:30:23 by Osei-Poku, William (NIH/NCI) [C]

OK. Thanks! Will it be the same for the SummaryURL? Or we need to change that manually?

Comment entered 2022-02-23 17:37:12 by Englisch, Volker (NIH/NCI) [C]

The SummaryRef's href attribute points to a summary within the CDR via its CDR-ID. That document has its own assigned SummaryURL to which we can link. When that document happens to be a document that the PDQ partners don't receive we won't link to it via the CDR-ID but via its URL.

The SummaryURL does not link to anything. Its xref attribute tells the user how to reach it (and specifies the document location within the Drupal hierarchy). When this document disappears from Cancer.gov (because it turned into a partner document), we have no way of retrieving an alternate URL to be used like we do with the SummaryRef. Therefore, the SummaryURL is an element that will have to be maintained manually as we had discussed earlier.

Comment entered 2022-02-23 18:03:12 by Osei-Poku, William (NIH/NCI) [C]

Therefore, the SummaryURL is an element that will have to be maintained manually as we had discussed earlier.

I do have questions about when the URL needs to be updated manually and to what new URL to use. I will ask these questions during the standup next week. Thanks!

Comment entered 2022-02-23 18:04:01 by Osei-Poku, William (NIH/NCI) [C]

Verified on DEV. Thanks!

Comment entered 2022-02-23 18:29:16 by Englisch, Volker (NIH/NCI) [C]

I can only answer the "when": The URL needs to be updated initially when the target document doesn't exist on Cancer.gov anymore. That would be when the legacy document gets replaced by SVPC summaries.

I'll leave the "what" for others to answer.

Comment entered 2022-02-23 19:40:48 by Osei-Poku, William (NIH/NCI) [C]

The weekly publishing job checks the status of the legacy document that is being linked to. As long as the target document exists it's OK to link to it. Once it has been removed from Cancer.gov the filter will recognize the change, adjust URL and convert the SummaryRef to an ExternalRef. In short, yes, it's OK to list those URLs.

Which of the summary URLs will the filter adjust the URL to? It looks like the new adjusted URL will be the ideal URL to use for the Summary URL.

Comment entered 2022-02-24 11:39:10 by Kline, Bob (NIH/NCI) [C]

Is this question still hanging? (I'm looking for confirmation that all of the Oersted loose ends have been resolved so that I can deploy the release to QA.)

Comment entered 2022-02-24 12:19:33 by Englisch, Volker (NIH/NCI) [C]

~oseipokuw , it seems I wasn't clear with my explanation of why the SummaryURL element has to be maintained manually, so let me try again:

This ticket addresses the issue of a SummaryRef pointing to a SVPC document. A SummaryRef element contains the href attribute, which is a CDR-ID:

<SummaryRef cdr:href="CDR0000062893"/>

The filter checks that summary and identifies that it is a summary that's not part of the PDQ Partner document set, therefore a SummaryRef link isn't really appropriate to be used but we do know how to access this document, not via the CDR-ID but via it's URL. The filter can pick up the URL from the SummaryURL element of the document CDR62893 and instead of presenting a "CDR link" replace it with a URL (ExternalRef) for that document.

Now let's take the SummaryURL element of a "partner" document (i.e. CDR1000) with it's xref attribute containing the URL to itself:

<SummaryURL cdr:xref="https://www.cancer.gov/somwhere/on/cg"/>

Let's assume the filter could identify if the link is dead and you want to replace it. What information would you use to replace this URL? You can find the CDR-ID of the document containing this dead URL but you already know the CDR-ID because it's the document you're processing. It's CDR1000. You would need another identifier that could indicate something like: "If the URL for CDR1000 is dead you need to go to CDR1002 for its replacement."

We don't have anything like that and therefore any update of the SummaryURL attribute is a manual process.

Comment entered 2022-02-24 12:25:01 by Osei-Poku, William (NIH/NCI) [C]

~bkline You may proceed to deploy on QA. I have already marked this ticket as DEV verified and the answer to my question should not change the status of this ticket.

Comment entered 2022-02-24 12:37:38 by Osei-Poku, William (NIH/NCI) [C]

Thanks ~volker I understand the part where we have to update the URL manually. I am just not sure with which URL to update it with knowing very well that at some point, the original/starting URL will no longer be valid and we probably don't want to make up a non-existent Summary URL.

Is it OK to update the summary URL with the Overview summary URL, for example? In the case of the Liver Cancer summaries, it will be https://www.cancer.gov/types/liver/what-is-liver-cancer

The drawback of this is we will have two documents with same URLs and may run into a potential link problems in pub preview and QC reports.

Comment entered 2022-02-25 13:21:18 by Osei-Poku, William (NIH/NCI) [C]

Hi ~volker I have done some testing on DEV using some of the SVPC docs you created for testing (please note, some are blocked now). I added the Summary URL from the Partner summary CDR0000256491 to the following SVPC summaries CDR0000805703, CDR0000805702 and CDR0000805704 to simulate a case where more than one summary document has the same Summary URL. So, in this case , all four summaries have the same URL.

With all four summaries having the same Summary URL, I did not run into any validation issues in XMetal. The documents were able to be validated and made publishable without any issues.

I created links (SummaryRef and SrummaryFragRef) between the two SVPC summaries (CDR0000805703 and CDR0000805702)

I still did not get any validation errors in the summaries in XMetal.

The links (inter linking between the two SVPC summaries with the same URL ) worked fine in QC reports. However, the links did not work in pub preview as expected. The links went to a some type of a filter page. This happened before I made the summaries publishable. Once I made the summaries publishable, the link appeared to go to the right URL. More testing is needed to confirm this.

I did not test using the WillReplace element because without them, the documents were valid in XMetal, and I was able to generate pub preview and QC reports for them without any problems. So, it looks to me that we can avoid the use of the WillReplace element. What I don't know is whether publishing these documents to cancer.gov will work as expected.

Comment entered 2022-02-25 13:40:37 by Kline, Bob (NIH/NCI) [C]

Yes, ~volker and I noticed a flaw in the custom validation rule late yesterday afternoon. It does its check for duplicate URLs before the save action, which means that for the first save a duplicate URL will escape detection. I think you'll find that you're not able to save a subsequent publishable version for either document unless one of them has the WillReplace element present. This flaw should be corrected in a subsequent release.

Comment entered 2022-02-25 13:41:53 by Kline, Bob (NIH/NCI) [C]

More accurately, I should say ~volker found the flaw, and I diagnosed and confirmed it, so he gets most of the credit here. 🙂

Comment entered 2022-02-25 14:57:00 by Osei-Poku, William (NIH/NCI) [C]

Sure. I saved several publishable versions of the the test documents (without the WillReplace element) and never ran into a validation issue. Could you confirm where to expect the validation problem? Is it in XMetal or during publication of the summaries?

Comment entered 2022-02-25 14:59:04 by Englisch, Volker (NIH/NCI) [C]

Do you remember how you had saved the documents once you updated the URL, ~oseipokuw ?

As Bob mentioned we saw something similar. I'm not certain what we saw was identical. I will try to replicate what you're describing.

By the way, please make sure to remove the PatientVersionOf element from all documents. This element is used to retrieve the information for the HP/patient toggle for legacy documents but should not be part of the SVPC or partner documents.

Comment entered 2022-02-25 15:09:02 by Osei-Poku, William (NIH/NCI) [C]

Do you remember how you had saved the documents once you updated the URL, ~oseipokuw ?

After replacing the URL, I click on the CDR Save icon to save the document. After generating pub preview, I also click on the CDR Save icon and make the document publishable before running pub preview again.

Comment entered 2022-02-25 15:19:12 by Kline, Bob (NIH/NCI) [C]

~oseipokuw can you tell us what the CDR IDs are for the pair of summaries with the same URL?

Comment entered 2022-02-25 15:24:30 by Osei-Poku, William (NIH/NCI) [C]

All of these documents have the same URL

Partner summary

CDR0000256491

SVPC summaries

CDR0000805703

CDR0000805702

CDR0000805704

Comment entered 2022-02-25 15:50:41 by Kline, Bob (NIH/NCI) [C]

Well, that's interesting, because only two of those four have a summary URL in the query_term table, and they have different URLs.

Comment entered 2022-02-25 16:06:40 by Englisch, Volker (NIH/NCI) [C]

FYI - I have a SQL query in our Query Interface on DEV called "SVPC Partner Pairs" to show partner documents and their linked SVPC modules.

Comment entered 2022-02-25 16:45:20 by Kline, Bob (NIH/NCI) [C]

Volker just told me these documents are on DEV, not QA. So I'm confused. I thought the whole purpose of deploying Oersted to QA was so you could move your base of operations for Oersted to that server, and leave DEV for everything else which is in flight. I see the query terms rows on DEV.

Comment entered 2022-02-25 16:50:30 by Kline, Bob (NIH/NCI) [C]

The reason the custom validation rule isn't hitting those documents is that all four of them are blocked.

Comment entered 2022-02-25 17:19:34 by Kline, Bob (NIH/NCI) [C]

As for the "flaw" in the custom validation rule, a good case can be made that it doesn't present much of a window for the wrong behavior to kick in, given that it never happens in the production world that the steps of (1) get the summary URL (2) create the new summary document (3) save the document (4) create a publishable version ALL get shoved into the same initial Save action. Isn't the standard process more like this?

Assign the summary URL (in consultation with Lindsay et al.)
Create the new summary document
Save the document
Edit the document
Save the document
Many repetitions of steps 4 and 5
Obtain the approval for publishing the summary
Create a publishable version

So in the real production world there's no opportunity to sneak in two summaries with the same URL, right?

Comment entered 2022-02-25 17:24:41 by Osei-Poku, William (NIH/NCI) [C]

Volker just told me these documents are on DEV, not QA. So I'm confused. I thought the whole purpose of deploying Oersted to QA was so you could move your base of operations for Oersted to that server, and leave DEV for everything else which is in flight. I see the query terms rows on DEV.

Yes, but we are waiting for the final edits of the documents to completed before copying them to the QA. So, we currently don't have any SVPC documents on QA.

Comment entered 2022-02-25 17:35:43 by Osei-Poku, William (NIH/NCI) [C]

So in the real production world there's no opportunity to sneak in two summaries with the same URL, right?

Not during the creation of a completely new summary. On PROD, we encounter same URL for two summaries during

"Cloning" a summary for comprehensive revision to replace content in the existing summary
Using the English XML for translation (until the Spanish URL is ready is some cases)

Comment entered 2022-02-25 18:10:50 by Osei-Poku, William (NIH/NCI) [C]

The reason the custom validation rule isn't hitting those documents is that all four of them are blocked.

This is still on DEV.

I unblocked this partner document CDR0000569404, added the summary URL from this SVPC summary CDR0000805709 and made CDR0000569404 publishable (not exactly in this order). I did not add the WillReplace element to the SVPC summary and I still could not get the validation rule to come up.

Comment entered 2022-02-25 21:17:58 by Kline, Bob (NIH/NCI) [C]

Well, if the the custom validation rule is broken, it's been broken since Alan implemented it over 13 years ago, because it's pretty much unchanged from then.

Comment entered 2022-02-28 09:53:00 by Kline, Bob (NIH/NCI) [C]

I created a script to verify that Alan's custom rule behaves as I described it above.

#!/usr/bin/env python3

from argparse import ArgumentParser
from lxml import etree
from cdrapi.docs import Doc, Link
from cdrapi.users import Session

TYPES = "schema", "links"

def make_doc(session):
    summary = etree.Element("Summary", nsmap=Doc.NSMAP)
    meta_data = etree.SubElement(summary, "SummaryMetaData")
    etree.SubElement(meta_data, "SummaryType").text = "Treatment"
    etree.SubElement(meta_data, "SummaryAudience").text = "Patients"
    etree.SubElement(meta_data, "SummaryLanguage").text = "English"
    etree.SubElement(meta_data, "SummaryDescription").text = "yada yada"
    etree.SubElement(meta_data, "SummaryURL").set(Link.CDR_XREF, "/same/url")
    pdq_board = etree.SubElement(meta_data, "PDQBoard")
    etree.SubElement(pdq_board, "Board").set(Link.CDR_REF, "CDR0000028557")
    topics = etree.SubElement(meta_data, "MainTopics")
    etree.SubElement(topics, "Term").set(Link.CDR_REF, "CDR0000038801")
    etree.SubElement(summary, "SummaryTitle").text = "Test Summary"
    etree.SubElement(summary, "AltTitle", TitleType="Short").text = "TS"
    etree.SubElement(summary, "SummarySection")
    doc = Doc(session, xml=etree.tostring(summary), doctype="Summary")
    doc.save(version=True, val_types=TYPES)
    print(f"{doc.cdr_id} has {len(doc.errors)} errors on first save")
    return doc

def save_again(doc):
    doc.save(version=True, val_types=TYPES)
    for error in doc.errors:
        error = " ".join(error.message.split()[1:])
        print(f"{doc.cdr_id} (second save): {error}")

parser = ArgumentParser()
parser.add_argument("--session", "-s", required=True)
opts = parser.parse_args()
session = Session(opts.session)
doc1 = make_doc(session)
doc2 = make_doc(session)
save_again(doc1)
save_again(doc2)
print(f"errors from deleting {doc1.cdr_id}: {doc1.delete()}")
print(f"errors from deleting {doc2.cdr_id}: {doc2.delete()}")
print(etree.tostring(doc1.root, pretty_print=True, encoding="Unicode"))

This is the output.

CDR0000808550 has 0 errors on first save
CDR0000808551 has 0 errors on first save
CDR0000808550 (second save): More than one Summary has the same SummaryURL with no WillReplace element
CDR0000808551 (second save): More than one Summary has the same SummaryURL with no WillReplace element
errors from deleting CDR0000808550: None
errors from deleting CDR0000808551: None

<Summary xmlns:cdr="cips.nci.nih.gov/cdr">
  <SummaryMetaData>
    <SummaryType>Treatment</SummaryType>
    <SummaryAudience>Patients</SummaryAudience>
    <SummaryLanguage>English</SummaryLanguage>
    <SummaryDescription>yada yada</SummaryDescription>
    <SummaryURL cdr:xref="/same/url"/>
    <PDQBoard>
      <Board cdr:ref="CDR0000028557"/>
    </PDQBoard>
    <MainTopics>
      <Term cdr:ref="CDR0000038801"/>
    </MainTopics>
  </SummaryMetaData>
  <SummaryTitle>Test Summary</SummaryTitle>
  <AltTitle TitleType="Short">TS</AltTitle>
  <SummarySection cdr:id="_1"/>
</Summary>

As you can see, the first save of each document fails to find the duplicate URL, because at that point there aren't two rows in the query_term table for the URL. Subsequent saves do not pass validation, because the custom validation rule correctly detects the duplication. Alan assumed (and I think he was right) that it was very unlikely that an attempt to save a publishing version of a summary would happen on the first save of the document with its URL, and that it was not worth making the code for the custom validation rule as complicated as it would need to be in order to detect that edge case.

Comment entered 2022-02-28 11:51:11 by Kline, Bob (NIH/NCI) [C]

Just FYI, that's about as small a summary document as you can get to pass validation, and it's what I use for such repro tests.

Comment entered 2022-03-01 16:40:10 by Osei-Poku, William (NIH/NCI) [C]

To be safe, I proceeded to add the WillReplace element to all the SVPC summaries.

Comment entered 2022-05-26 13:49:27 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA and PROD. Thanks!

Comment entered 2022-05-26 13:50:36 by Osei-Poku, William (NIH/NCI) [C]

We decided eventually to remove the WilReplace element because it was blocking the SVPC summaries from being published. We will address the implication in a different Ohm ticket.

Attachments

File Name	Posted	User
image-2022-02-25-13-15-24-717.png	2022-02-25 13:15:25	Osei-Poku, William (NIH/NCI) [C]
image-2022-02-25-15-50-34-435.png	2022-02-25 15:50:35	Kline, Bob (NIH/NCI) [C]
PartnerSummaryRef.PNG	2022-02-23 15:13:56	Osei-Poku, William (NIH/NCI) [C]
Screen Shot 2022-02-23 at 14.38.18.png	2022-02-23 14:41:21	Englisch, Volker (NIH/NCI) [C]

Elapsed: 0:00:00.000683

CDR Tickets