PDQ Issues

Issue Number	3910
Summary	Add more specific identification to "source" field in communication with GateKeeper
Created	2015-05-19 13:19:15
Issue Type	New Feature
Submitted By	Kline, Bob (NIH/NCI) [C]
Assigned To	alan
Status	Closed
Resolved	2016-03-10 16:19:32
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.161487

Description

Currently, we use "CDR" as the "source" value when we push documents to GateKeeper. This causes problems when two different instances of the CDR try to push to the same GateKeeper instance, because of conflicting job ID values. Modify the software to communicate with GateKeeper to include the tier as part of the source value (e.g. "CDR-QA"). Provide the ability to override the default value from the publishing interface (for example, "CDR-Blair's-VM").

Note: GK distinguishes jobs by the ID and source fields. We would add code to append the tier identifier to the standard "CDR" identifying string and also allow a user to override the value at run time.

[python (Publishing System - modify the control document)]

Comment entered 2015-06-09 20:29:39 by alan

Added this to the Darwin sprint, in accordance with a decision at the last status meeting that the three new issues concerning publishing should all be part of Darwin.

Comment entered 2016-02-23 22:53:49 by alan

I've added Aarti, Blair, and Volker to this issue.

~ShahAj ~LearnB
Volker has probably already discussed this with you but I want to be sure you are on board with the plan before I start modifying code.

Currently the CDR sends requests to Gatekeeper using the following beginning XML:

<Request xmlns='http://www.cancer.gov/webservices/'>
 <source>CDR</source>

Some requests are for "RequestStatus" instead of "Request".

Do you think that anything will break in Gatekeeper if we change the source values, for example to include the tier:

   <source>CDR-PROD</source>
   <source>CDR-QA</source>

or alternatively, to include arbitrary strings:

   <source>CDR-BlairVM</source>
   <source>CDR-DEV-just-testing</source>

Our plan would be to combine one of these source strings plus a job ID in a Gatekeeper request in order to disambiguate possibly identical job IDs that originated on different servers, and then to query Gatekeeper about those different jobs later using the same source strings.

Thanks

Comment entered 2016-02-23 23:42:32 by alan

~volker
So far, I've found five functions in cdr2gk.py that send <source>CDR</source> elements to Gatekeeper:

initiateRequest()
sendDataProlog()
sendDocument()
sendJobComplete()
requestStatus()

The first four are all called from cdrpub.py. The last one is called from PublishingService.py and ReverifyPushJob.py.

I'd like the two of us to spend some time on Thursday going over these to see which ones (probably all of them) are affected by this change and where we want them to find the relevant source strings.

Comment entered 2016-02-24 08:25:30 by Shah, Aarti (NIH/NIDDK) [C] [X]

~Alan -

GK only uses the Source string for display purposes. The change you are planning to make should not affect document processing.

Thanks.

Comment entered 2016-02-24 09:19:46 by Learn, Blair (NIH/NCI) [C]

Just to be clear, the Source string isn't used in processing. It is used for differentiating between identical job IDs from different sources. So from GateKeeper's perspective, this amounts to putting the field to its intended use.

Comment entered 2016-02-25 18:40:33 by alan

~bkline

Volker and I have gone through this issue looking at what code is affected and what will meet our publishing needs.

It turns out that there is a significant difference between two possible implementations of this requirement depending on whether the user is allowed to edit the "source" field.

Here are the two approaches:

Send a source field that always and only contains "CDR-tier" (CDR-DEV, CDR-QA, CDR-STAGE, or CDR-PROD).
Present the CDR-tier string to the user and allow him to make arbitrary changes.

Method 1 can be implemented in a single module, cdr2gk.py. There are five places in that module that need to change, but in each case they can use the same automatically generated tier specific constant. It's a quick and easy change.

It meets the entire need for enabling the CDR publishing on lower tiers to work properly after a lower tier Gatekeeper database refresh from Gatekeeper PROD. Nothing else is required to solve that existing problem. If the CDR and Gatekeeper DEV tiers are out of sync on job numbers, the CDR can continue to send lower numbered publishing job IDs to GK without worrying about the fact that GK already has some higher numbered IDs. Fiddling with odd and even numbers or other hacks are not needed.

If a new CDR instance is created talking to the same CDR database as, say DEV, and it uses "DEV" as its tier identifier, there is still no conflict. Jobs started on the new instance will get the next available DEV job ID and will be visible on both DEV and the new instance and vice versa.

If the new instance has its own CDR database but talks to the same Gatekeeper used by another CDR instance, then it should be given its own \etc\cdrtier.rc value, for example "Blair", "Bob", "TestVM", etc. in order to avoid a collision of the push jobs.

Method 2 is potentially more flexible but: it is significantly more complicated to implement - at least three programs must be modified:

cdr2gk.py
cdrpub.py
publishing.py

I'm not sure if any of the reports must be modified, possibly not.

The cdrpub and publishing programs will have to communicate new information to cdr2gk using either a new calling parameter or a new database entry.

Depending on why we are doing this (i.e., the use case), we might want a permanent store of the decision that was made.

Is it okay to implement method 1, or do you still want method 2? If 2, do you need permanent storage of the name, e.g., in pub_proc_parm.

Do we have a use case for method 2 that isn't adequately covered above?

Comment entered 2016-03-02 00:21:29 by alan

I have implemented Method 1 as described above.

To test, I wrote a test driver that invoked each of the routines that construct the soap transactions but, instead of submitting them to Gatekeeper, I logged the transactions and then inspected them by eye. They looked right to me. I also parsed a few with Firefox and they were okay.

I've put the updated code (cdr2cg.py) in svn and in the live \cdr\lib\Python directory on DEV, but have not marked the issue as Resolved yet. Perhaps on Thursday I can do a live test with Volker to be certain it is doing what he needs and doing it correctly.

Comment entered 2016-03-02 13:04:50 by Englisch, Volker (NIH/NCI) [C]

Since we've recently refreshed CDR on DEV we won't be able to test this truly until one of the DEV Gatekeeper databases are refreshed from PROD. I suggest to move this to QA and test there.

Comment entered 2016-03-02 13:10:06 by Englisch, Volker (NIH/NCI) [C]

I ran a hot-fix and can confirm that the source field is correctly populated on Gatekeeper.
The part we really want to test is when the Job-IDs are colliding.

By the way, are we keeping the source field as CDR on PROD or will that change to CDR-PROD? We may need to let the WCMS team know in case they are running SQL reports by source if that field is changing.

Comment entered 2016-03-02 14:09:04 by alan

> By the way, are we keeping the source field as CDR on PROD or will that change to CDR-PROD?

You had mentioned that before but I thought it was more consistent to make it CDR-PROD.

However if that's inconvenient I'll change it tomorrow.

Comment entered 2016-03-10 16:17:30 by alan

We decided at a status meeting to leave the production source field as "CDR-PROD".

Comment entered 2016-03-10 16:19:32 by alan

We decided at the status meeting today that the last test - how well this will work when document ID's clash between versions - is more a test of Gatekeeper than the CDR. So I am marking this issue as Resolved Fixed.

Comment entered 2016-03-10 17:32:37 by Englisch, Volker (NIH/NCI) [C]

I can see on the Gatekeeper DEV server that the updated source is listed.
http://gatekeeper-blue-dev.cancer.gov/admin/RequestHistory/RequestHistory.aspx

I'm marking this ticket as verified.

Comment entered 2016-04-06 18:40:08 by Englisch, Volker (NIH/NCI) [C]

We decided at the status meeting today that the last test - how well this will work when document ID's clash between versions

I've confirmed that the modification prevents the push job from failing in the event that the push-jobID already exists on the server. On DEV, there are two jobs with a CDR jobID=13489 distinguished by the source field.

Since this modification is an improvement for the lower tiers only because we will never have identical jobIDs on PROD, we may actually want to close this ticket.

Comment entered 2016-05-13 14:02:35 by Englisch, Volker (NIH/NCI) [C]

It turns out the source also changed on PROD to CDR-PROD. I had expected to keep the PROD value as is but I can confirm this is working.

Closing ticket.

Elapsed: 0:00:00.001306

CDR Tickets