PDQ Issues

Issue Number	3908
Summary	Provide interface for bypassing push optimization
Created	2015-05-14 15:53:03
Issue Type	Improvement
Submitted By	Kline, Bob (NIH/NCI) [C]
Assigned To	alan
Status	Closed
Resolved	2016-03-01 20:17:34
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.161168

Description

It should be possible for any publishing job involving a push of the data to GateKeeper to suppress the comparison of filtered documents with what was previously sent to GateKeeper. Normally the results of this comparison are used to decide when we can skip pushing some of the documents. Sometimes this is an inappropriate optimization (for example, the previous push was to a different GateKeeper instance). Some of the support for this enhancement is likely already in place.

[python (CDR publish module/option on the publishing job interface)]

Comment entered 2015-11-12 23:35:00 by alan

I've been reading the code in cdrpub.py, PublishDocs.py, and looking at the tables.

Some approaches to this are to:

Use the Republish.py / RepublishDocs.py program pair, which already has this capability - with limitations
Add a function like RepublishDocs.CdrRepublisher.__adjustPubProcCgTable() to cdrpub.py to enable cdrpub to
force a push to a Gatekeeper target regardless of the contents of pub_proc_cg. This also has limitations.
Write a new function that invokes the Gatekeeper Status Request, then modify RepublishDocs.py and/or
cdrpub.py to use the results of that request instead of pub_proc_cg if so ordered by the CDR publishing user.

I don't know if that third option is practical. If it is, it might resolve a number of problems that I don't think can be handled perfectly with options 1 or 2.

We can attempt a simulation of option 3 by joining the pub_proc_cg, pub_proc, pub_proc_parm, pub_proc_doc, and doc_version tables to make estimates of what's on any particular server that we publish to, but it's a complex and expensive query that might turn out to be too slow or too fragile.

Comment entered 2015-11-13 00:28:10 by Englisch, Volker (NIH/NCI) [C]

I don't fully understand your option 3. What is the purpose of using the GK Status Request when you're trying to publish to a lower tier server?
Wouldn't it be possible to just skip the diff portion of cdr2gk.py and handle each document as if the result of the diff was positive?

Comment entered 2015-11-17 16:06:53 by Englisch, Volker (NIH/NCI) [C]

We also recently discovered that our publishing option to Re-publish does not run multi-threaded. Therefore, if a document type is published using our standard option, i.e. Summary-Export, it runs multi-threaded but the same documents submitted as a Re-publish will take about 4-5 times longer being processed because it's running in a single thread.
The problem is that only the Re-publish option allows us to submit all documents published to Gatekeeper regardless if there were changes or not.

Comment entered 2016-02-16 14:16:13 by henryec

The 5-point estimate is based on the assumption that a flag gets placed on the publishing interface that skips the optimization.

Comment entered 2016-02-26 00:33:25 by alan

I'm working on this again.

As we know, the logic of the publishing program is anything but simple.

I'lll probably come up with a plan and then ask for a walk through with Bob and Volker to be sure the plan covers all of the cases before modifying any code.

Comment entered 2016-02-26 11:08:52 by Englisch, Volker (NIH/NCI) [C]

I guess what Bob and I were thinking was to add a flag to the push job that sets the variable needsPush to True whenever the variable is set and therefore skips over the diff portion of the new files.
needsPush is set in cdrpub.py on line 1452/1465 and 1797/1810.

Comment entered 2016-02-27 20:05:20 by alan

That's an approach I hadn't thought of. It looks simpler than what I was looking at.

Thanks.

Comment entered 2016-03-01 17:03:44 by alan

I had been planning to add a new parameter "ForcePush" with a default value of No, but Bob thought that we might have an unused parameter that fills the bill.

It turns out, as Bob thought, that the "CheckPushedDocs" parameter already exists but is unused. It is referenced in cdrpub.py but only to set the write-only variable __checkPushedDocs = "Yes". A grep of all of the Python code in lib/Python, Publishing, and cgi-bin/cdr shows no other reference to the parameter.

It also turns out that all 16 of the Push subsets in Primary.xml have CheckPushedDocs = "Yes". No other subsets in Primary.xml have the parameter and no other subsets in either of the other publishing control documents have the parameters. That's exactly what we'd want for ForcePush.

So we have the option of activating this parameter as follows:

CheckPushedDocs = "Yes"
Means check the documents to push against the existing pub_proc_cg table.

CheckPushedDocs = "No"
Means skip the check and push regardless of pub_proc_cg.

That doesn't seem to me to be quite a mnemonic as "ForcePush", but it isn't a big stretch.

We would still want to edit the XML document to add a ParmInfoHelp element for CheckPushedDocs. The only current value is "This option appears to be ignored".

I'm going to proceed with this. If anyone has a better idea let me know. The logic would be the same regardless, it's only the name of the parameter that would change.

Comment entered 2016-03-01 17:20:44 by Englisch, Volker (NIH/NCI) [C]

Although I can see the argument but I don't exactly like the name. It sounds like we've send a document to GK and then check something. I also see the confusion that the parameter ForcePush could have given the fact that we're already using a column name that's similarly named.
I suggest another name for the parameter instead: PushAllDocs

Comment entered 2016-03-01 17:25:36 by alan

I like that.

I will do the following:

Remove references to CheckPushedDocs in cdrpub.py. It's currently unused.
Replace all occurrences of "CheckPushedDocs" with "PushAllDocs" in the control document.
Change all of the default values from "Yes" to "No". The user will change the value to "Yes" to force push them.
Add an appropriate ParmInfoHelp for PushAllDocs.
Modify the publishing code to skip the compare and push everything if PushAllDocs = "Yes".

Comment entered 2016-03-01 20:16:59 by alan

I have implemented all of the above and, with Volker's help, tested with a one document Hotfix on DEV.

Modifications to Primary.xml, the publishing control document (CDR0000000178) are in svn and in the DEV database.

Modifications to cdrpub.py are in svn and the live DEV cdr\lib\Python directory.

I thought seriously about refactoring some of the duplicate and extraneous code that occurs at and around the two places in cdrpub.py that were affected ( __createWorkPPC() and __createWorkPPCHE() ) but decided that was outside the scope of this task.

Comment entered 2016-03-01 20:17:34 by alan

Marking this as Resolved Fixed.

Comment entered 2016-03-23 10:24:17 by Englisch, Volker (NIH/NCI) [C]

This is finished as expected.
There is, however, a minor improvement one could make. Alan and I had talked about adding a note to the page that's used to release a submitted push job. The option to bypass the push optimization can only be specified as part of the push job submission. This means a publishing job that automatically submits the push job has to be killed and restarted manually in order to skip the optimization. This information should be displayed on the push job release page. Otherwise, if one forgets this information and pushes the job, the optimization will take place and the publishing job would need to be restarted in order to receive a new document set.to be pushed without the optimization.

Comment entered 2016-05-13 14:32:48 by Englisch, Volker (NIH/NCI) [C]

I hot-fixed a summary that I used for the Darwin smoke test. As expected, this summary failed to be pushed. Then I pushed this summary again after setting the parameter PushAllDocs to Yes and the summary was successfully pushed to Gatekeeper.

Closing ticket.

Elapsed: 0:00:00.001300

CDR Tickets