Issue Number | 5045 |
---|---|
Summary | Modify vendor filters for PDQ content partners data |
Created | 2021-10-15 13:09:51 |
Issue Type | New Feature |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2022-01-24 10:23:56 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.300653 |
The current PDQ patient liver cancer summaries will be split into multiple SVPC summaries that would be published as their own separate pages on cancer.gov. We are currently working on modifying the summaries schema to assemble multiple individual summaries into one master partner summary for the PDQ content partners (OCECDR-5043). We are also working on modifying the summary schema to be able to combine multiple SVPC summaries into one CDR master partner summary. The existing CDR summary documents will no longer be updated (but the individual SVPC summaries will be); however, the CDR ID will be reused for content assembled for partners. There is the need to modify the vendor filters to assemble multiple related summaries using the master partner summary document, into one summary for the content partners while reusing the existing CDR ID.
We are currently working on creating a new SVPC summary schema ...
What I remember from yesterday's discussions is that we decided we
would go with the users' suggestion for sticking with the
Summary
schema for the master partner documents.
That is right, and I think OCECDR-5043 will take care of that. Thanks for pointing that out.
We're going to allow a SummarySection with a section type of "Introductory Text" which will not require a section title for the SVPC documents.
What do we expect the partner output to look like for this section?
The partner document will not include an "Introductory Text" section
The partner document will include a "Introductory Text" section without a section title
The partner document will include a "Introductory Text" section with a canned section title
If we do display the intro text section there are a few options, too:
The partner document will include a manually maintained "Introductory Text" section.
The partner document will include a canned intro text section
The partner document will include some combination of intro text sections imported from the modules
I'm listing the elements in the partner XML output using the name, label, or text of "PDQ":
Attribute /Summary/@LegacyPDQID
Element SummaryEditorialBoard
--> PDQ Adult Treatment Editorial Board
Section "About This PDQ Summary"
That's fewer than I expected. I have removed the attribute LegacyPDQID and the "About This PDQ Summary" section will be removed.
Will we continue to include the SummaryEditorialBoard element for the partner documents?
The PDQ board information will be removed from the single-view summaries sent to the data partners.
Question for ~oseipokuw:
We had decided to make the top-level section titles optional so that we can have one-section SVPC summaries without section title. These would then display only the document title (i.e. "What is Liver Cancer?"). We had then said, for the partner document we would want to use the document title for the "missing" top-level section title. This should only be done if an intro text (without title) exists or if a single-section (no intro text) without title exists. We do not want to replace the document title for the top-level section title if multiple sections without title exist.
We did not specify, however, what the filters should do in such a case of multiple missing section titles. I see these options:
Replace the missing section title with the document title for the first section of the SVPC and do nothing with additionally missing titles
Replace the following missing title with some canned text (i.e. SECTION TITLE MISSING) or
Fail publishing of the summary
We do not want to replace the document title for the top-level section title if multiple sections without title exist.
Hi Volker ~volker, My understanding from our discussions is that there should not be multiple sections without a section title since there is really no use case for that. That is, the top-level section title is made optional only for the first (or only) section and no more. So, if we have multiple sections without section titles, then we should get a validation error message.
We had modified the schema to make the top-level section title optional. I don't know if a rule can be setup in the schema to only have the first top-level section title be optional. Besides that, if you're asking for a validation rule to check for multiple missing section titles, then it's still possible to run the PP report on the CWD of the document even with multiple empty titles because there's no requirement to validate a document before running the PP report.
I think I will fail the filter process if I come across a document without a section title unless it's the first section.
There exists a new validation error for the partner output of legacy documents (CDR62955) that will need to be addressed. The error is part of the "About this PDQ Summary" section.
Can you elaborate? Is this a DTD validation error? Schema validation?
I don't see any errors in pub_proc_doc
on CDR DEV, nor does
the Filter interface show any validation errors using Vendor Summary Set
filters. Do you know which change might have triggered the error?
Yes, it was a DTD validation error for the final vendor output pointing to an invalid namespace (Vendor Summary Set + Vendor Filter: Convert CG to Public Data) against pdq.dtd.
I must have fixed that problem with the latest changes because I'm unable to recreate it myself at this point.
The XML created for a partner document does now include the "DateLastModified" which is created as the latest date specified in the elements DateLastModified across the various imported SVPC documents.
All necessary filter changes have been implemented. Two additional potential changes - About this PDQ Summary info and a decision regarding top-level section titles are handled as part of the linked tickets.
Given that PP is not the best tool to use to test the changes in this ticket, will the XML from the Show CDR Document XML report be enough to test the changes?
No, that tool will not help you. The tool allows you to view the XML output submitted to Drupal but because we're not sending the partner output to Drupal we're not storing the XML in the pub_prog_cg table.
There are two options to viewing the XML:
Using the admin filter interface
https://cdr-dev.cancer.gov/cdrFilter.html
Publishing a document and viewing the version saved in the file system
I believe the first option would be much easier on the eye because the browser provides some styling to the XML.
When using the first option you want to
Enter the CDR-ID
Remove the doc version "lastp" to run the filter for the cwd
Enter "set:Vendor Summary Set" for Filter 1 and
Click Submit
This won't get to all the way to the documents provided to our partners because there is an additional "clean-up" filter we're running in order to remove data that gets created for CG but is removed from the partner output. If you want to go that extra step you want to click on the "+" icon on the filter form and enter the filter ID "609947" in the newly created field named "Filter 2".
Please let me know if you need help running the filter set and creating the XML.
Please note that it will be important not only to test these changes for partner documents but it will be equally important to ensure the XML output for SVPC and legacy summary documents hasn't changed!
Thanks, I was able to generate the partner XML from the admin filter
interface. For partner summary - 805713, I see that the Intro Tags are
included. Is that OK? It appears to be the only SVPV tags included in
the partner XML for this summary.
<SectMetaData>
<SectionType>Introductory Text</SectionType>
</SectMetaData>
I will continue to test...
Yes, that's OK. The SectMetaData block is part of the DTD and is included in the partner output.
I see that the Summary Abstract, Keywords and Main Topics are included in the XML. It may be helpful to retrieve this information from the linked SVPC summaries instead of duplicating them in the partner documents. The Summary Abstract could be a challenge since it may be different for each SVPC summary (include all of them in different para tags?). Also, there is likely to be duplicates in the Keywords and Main Topics but my guess is that you can remove the duplicates in the filter.
As you rightfully acknowledge, creating a single element SummaryAbstract from multiple SVPC documents could be a challenge. For that reason alone it should be maintained in the partner document unless you can give me concrete processing instructions on how to create this information for the partner document. Isn't this element basically created when the document gets created and rarely changes?
For the SummaryKeyWord and MainTopic it is much more straight forward to collect all values from the SVPC documents and append a single list to the partner document.
In that case, I think we should proceed with updating the filters to copy the keywords and the main topics from the SVPC summaries to the partner documents. I think we can manually copy the summary abstracts from the legacy documents to the partner documents for now. Hopefully, by the time we start working on completely new topics a decision would have been made about providing the SVPC summaries to the partners as is.
~oseipokuw, are you OK for me to start working on this additional change now? If you're currently (i.e. today) testing the filters you might see some errors or messages you wouldn't expect.
Yes, it is OK to start working on it.
I have modified the filters to collect the MainTopics and SummaryKeyWords in the partner document output. You hadn't mentioned if the order of the items is significant but you did say you wanted to have duplicates removed, so I am displaying the combined lists sorted and deduped.
Additionally, I'm using the latest DateLastModified to display for the partner document. I'm still testing, ~oseipokuw, but PublishPreview should be working again.
It looks like in some cases, Secondary Topics are added to summaries. I don't think it is the case for the Liver Cancer summaries but we should consider adding the secondary topics as well. Also, the Main Topics block is required in the summary schema. We probably should consider making it optional just like we did the summary bock block/element. This is also not necessarily needed now since we are not creating the partner documents (just reusing the existing summaries). I will mention these in the standup tomorrow morning and if it is agreed, I will create/update Jira tickets appropriately. Other than these observations, I think the changes are looking good on DEV.
The MainTopic is required in the summary schema because the DTD for the XML partner output requires the element. If you want to make this element optional in the SVPCs you probably want to maintain the information manually in the partner document or risk an empty data set and therefore fail validation of the partner document.
In addition, I just talked to a partner who's considering to use the information of the MainTopics to identify "similar" summaries. In other words, I would be careful before deciding to change mandatory elements and making them optional. That kind of change is probably something that requires advance notification of our partners.
We agreed to create a new ticket to take care of the Secondary Topics. I will create the ticket and add it to the release independent pile.
Verified on DEV. Thanks!
Verified on QA and PROD. Thanks!
Elapsed: 0:00:00.001463