CDR Tickets

Issue Number 5055
Summary Modify Publishing Module cdrpub
Created 2021-10-20 21:04:15
Issue Type Task
Submitted By Englisch, Volker (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2021-11-19 11:59:32
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.300974
Description

Our publishing process needs to be modified to handle processing the three different summary types (legacy, partner, SVPC) properly.

This requires a decision on how to store partner documents.  Typically, the published XML of our documents is captured in a table called "pub_proc_cg".  This provides a snapshot of what's on Cancer.gov.  With the partner documents not published to Cancer.gov we need to decide if those should still be captured along all other XML output.

We may need to make similar decisions on how to save the output on the disk for our partners since partners will not receive SVPC summaries.
Depending on the solution chosen we may need to implement additional changes to the CG2Public.py script and/or sftp-export-data.py script.

Comment entered 2021-11-01 14:40:11 by Kline, Bob (NIH/NCI) [C]

I have considered a couple of different approaches for the publishing software to use for determining what kind of summary document is being published. One was to index the new top-level attributes and have cdrpub.py look in the query_term_pub table. The other would be to have your publishing filter set preserve those attributes and strip them out on the way to pile for the data partners. I was initially leaning toward the first approach, as it would mean less work for you (and your filters) to take care of. But the drawback to that approach is that it introduces a tiny window of opportunity between the time the job is created and when the processing of the documents actually happens for another publishable version of the summary to be created with different values for the attributes. Not a very high probability for such an occurrence, but the risk does nevertheless exist. So what would you say to modifying the filters to support the second approach?

Comment entered 2021-11-01 20:49:40 by Kline, Bob (NIH/NCI) [C]

I have considered a couple of different approaches for the publishing software to use for determining what kind of summary document is being published. One was to index the new top-level attributes and have cdrpub.py look in the query_term_pub table. The other would be to have your publishing filter set preserve those attributes and strip them out on the way to pile for the data partners. I was initially leaning toward the first approach, as it would mean less work for you (and your filters) to take care of. But the drawback to that approach is that it introduces a tiny window of opportunity between the time the job is created and when the processing of the documents actually happens for another publishable version of the summary to be created with different values for the attributes. Not a very high probability for such an occurrence, but the risk does nevertheless exist. So what would you say to modifying the filters to support the second approach?

Comment entered 2021-11-02 11:05:50 by Englisch, Volker (NIH/NCI) [C]

I have modified the publishing filter set to preserve the new summary attributes for SVPC related documents.

Comment entered 2021-11-19 11:59:32 by Kline, Bob (NIH/NCI) [C]

The changes necessary for pushing the new fields and for custom handling of SVPC and partner-only summaries has been implemented and installed on DEV.

Comment entered 2022-06-17 11:52:50 by Englisch, Volker (NIH/NCI) [C]

The ticket has been moved to production.  Minor issues with the publishing process will be handled with a new ticket in the Ohm sprint. (OCECDR-5111)

Closing ticket.

Elapsed: 0:00:00.001514