Issue Number | 3908 |
---|---|
Summary | Provide interface for bypassing push optimization |
Created | 2015-05-14 15:53:03 |
Issue Type | Improvement |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | alan |
Status | Closed |
Resolved | 2016-03-01 20:17:34 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.161168 |
It should be possible for any publishing job involving a push of the data to GateKeeper to suppress the comparison of filtered documents with what was previously sent to GateKeeper. Normally the results of this comparison are used to decide when we can skip pushing some of the documents. Sometimes this is an inappropriate optimization (for example, the previous push was to a different GateKeeper instance). Some of the support for this enhancement is likely already in place.
[python (CDR publish module/option on the publishing job interface)]
I've been reading the code in cdrpub.py, PublishDocs.py, and looking at the tables.
Some approaches to this are to:
Use the Republish.py / RepublishDocs.py program pair, which already has this capability - with limitations
Add a function like
RepublishDocs.CdrRepublisher.__adjustPubProcCgTable() to cdrpub.py to
enable cdrpub to
force a push to a Gatekeeper target regardless of the contents of
pub_proc_cg. This also has limitations.
Write a new function that invokes the Gatekeeper Status Request,
then modify RepublishDocs.py and/or
cdrpub.py to use the results of that request instead of pub_proc_cg if
so ordered by the CDR publishing user.
I don't know if that third option is practical. If it is, it might resolve a number of problems that I don't think can be handled perfectly with options 1 or 2.
We can attempt a simulation of option 3 by joining the pub_proc_cg, pub_proc, pub_proc_parm, pub_proc_doc, and doc_version tables to make estimates of what's on any particular server that we publish to, but it's a complex and expensive query that might turn out to be too slow or too fragile.
I don't fully understand your option 3. What is the purpose of using
the GK Status Request when you're trying to publish to a lower tier
server?
Wouldn't it be possible to just skip the diff portion of
cdr2gk.py and handle each document as if the result of the diff
was positive?
We also recently discovered that our publishing option to
Re-publish does not run multi-threaded. Therefore, if a
document type is published using our standard option, i.e.
Summary-Export, it runs multi-threaded but the same documents submitted
as a Re-publish will take about 4-5 times longer being
processed because it's running in a single thread.
The problem is that only the Re-publish option allows us to
submit all documents published to Gatekeeper regardless if there were
changes or not.
The 5-point estimate is based on the assumption that a flag gets placed on the publishing interface that skips the optimization.
I'm working on this again.
As we know, the logic of the publishing program is anything but simple.
I'lll probably come up with a plan and then ask for a walk through with Bob and Volker to be sure the plan covers all of the cases before modifying any code.
I guess what Bob and I were thinking was to add a flag to the push
job that sets the variable needsPush to True whenever
the variable is set and therefore skips over the diff portion of the new
files.
needsPush is set in cdrpub.py on line 1452/1465 and
1797/1810.
That's an approach I hadn't thought of. It looks simpler than what I was looking at.
Thanks.
I had been planning to add a new parameter "ForcePush" with a default value of No, but Bob thought that we might have an unused parameter that fills the bill.
It turns out, as Bob thought, that the "CheckPushedDocs" parameter already exists but is unused. It is referenced in cdrpub.py but only to set the write-only variable __checkPushedDocs = "Yes". A grep of all of the Python code in lib/Python, Publishing, and cgi-bin/cdr shows no other reference to the parameter.
It also turns out that all 16 of the Push subsets in Primary.xml have CheckPushedDocs = "Yes". No other subsets in Primary.xml have the parameter and no other subsets in either of the other publishing control documents have the parameters. That's exactly what we'd want for ForcePush.
So we have the option of activating this parameter as follows:
CheckPushedDocs = "Yes"
Means check the documents to push against the existing pub_proc_cg
table.
CheckPushedDocs = "No"
Means skip the check and push regardless of pub_proc_cg.
That doesn't seem to me to be quite a mnemonic as "ForcePush", but it isn't a big stretch.
We would still want to edit the XML document to add a ParmInfoHelp element for CheckPushedDocs. The only current value is "This option appears to be ignored".
I'm going to proceed with this. If anyone has a better idea let me know. The logic would be the same regardless, it's only the name of the parameter that would change.
Although I can see the argument but I don't exactly like the name. It
sounds like we've send a document to GK and then check something. I also
see the confusion that the parameter ForcePush could have given
the fact that we're already using a column name that's similarly
named.
I suggest another name for the parameter instead:
PushAllDocs
I like that.
I will do the following:
Remove references to CheckPushedDocs in cdrpub.py. It's currently unused.
Replace all occurrences of "CheckPushedDocs" with "PushAllDocs" in the control document.
Change all of the default values from "Yes" to "No". The user will change the value to "Yes" to force push them.
Add an appropriate ParmInfoHelp for PushAllDocs.
Modify the publishing code to skip the compare and push everything if PushAllDocs = "Yes".
I have implemented all of the above and, with Volker's help, tested with a one document Hotfix on DEV.
Modifications to Primary.xml, the publishing control document (CDR0000000178) are in svn and in the DEV database.
Modifications to cdrpub.py are in svn and the live DEV cdr\lib\Python directory.
I thought seriously about refactoring some of the duplicate and extraneous code that occurs at and around the two places in cdrpub.py that were affected ( __createWorkPPC() and __createWorkPPCHE() ) but decided that was outside the scope of this task.
Marking this as Resolved Fixed.
This is finished as expected.
There is, however, a minor improvement one could make. Alan and I had
talked about adding a note to the page that's used to release a
submitted push job. The option to bypass the push optimization can only
be specified as part of the push job submission. This means a publishing
job that automatically submits the push job has to be killed and
restarted manually in order to skip the optimization. This information
should be displayed on the push job release page. Otherwise, if one
forgets this information and pushes the job, the optimization will take
place and the publishing job would need to be restarted in order to
receive a new document set.to be pushed without the optimization.
I hot-fixed a summary that I used for the Darwin smoke test. As expected, this summary failed to be pushed. Then I pushed this summary again after setting the parameter PushAllDocs to Yes and the summary was successfully pushed to Gatekeeper.
Closing ticket.
Elapsed: 0:00:00.001665