Issue Number | 5055 |
---|---|
Summary | Modify Publishing Module cdrpub |
Created | 2021-10-20 21:04:15 |
Issue Type | Task |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2021-11-19 11:59:32 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.300974 |
Our publishing process needs to be modified to handle processing the three different summary types (legacy, partner, SVPC) properly.
This requires a decision on how to store partner documents. Typically, the published XML of our documents is captured in a table called "pub_proc_cg". This provides a snapshot of what's on Cancer.gov. With the partner documents not published to Cancer.gov we need to decide if those should still be captured along all other XML output.
We may need to make similar decisions on how to save the output on
the disk for our partners since partners will not receive SVPC
summaries.
Depending on the solution chosen we may need to implement additional
changes to the CG2Public.py script and/or sftp-export-data.py
script.
~volker I have considered
a couple of different approaches for the publishing software to use for
determining what kind of summary document is being published. One was to
index the new top-level attributes and have cdrpub.py look in the
query_term_pub
table. The other would be to have your
publishing filter set preserve those attributes and strip them out on
the way to pile for the data partners. I was initially leaning toward
the first approach, as it would mean less work for you (and your
filters) to take care of. But the drawback to that approach is that it
introduces a tiny window of opportunity between the time the job is
created and when the processing of the documents actually happens for
another publishable version of the summary to be created with different
values for the attributes. Not a very high probability for such an
occurrence, but the risk does nevertheless exist. So what would you say
to modifying the filters to support the second approach?
~volker I have considered
a couple of different approaches for the publishing software to use for
determining what kind of summary document is being published. One was to
index the new top-level attributes and have cdrpub.py look in the
query_term_pub
table. The other would be to have your
publishing filter set preserve those attributes and strip them out on
the way to pile for the data partners. I was initially leaning toward
the first approach, as it would mean less work for you (and your
filters) to take care of. But the drawback to that approach is that it
introduces a tiny window of opportunity between the time the job is
created and when the processing of the documents actually happens for
another publishable version of the summary to be created with different
values for the attributes. Not a very high probability for such an
occurrence, but the risk does nevertheless exist. So what would you say
to modifying the filters to support the second approach?
I have modified the publishing filter set to preserve the new summary attributes for SVPC related documents.
The changes necessary for pushing the new fields and for custom handling of SVPC and partner-only summaries has been implemented and installed on DEV.
The ticket has been moved to production. Minor issues with the publishing process will be handled with a new ticket in the Ohm sprint. (OCECDR-5111)
Closing ticket.
Elapsed: 0:00:00.001514