PDQ Issues

Issue Number	4631
Summary	[General] Ability to Preserve Testing/Training Documents on DEV
Created	2019-06-13 12:20:25
Issue Type	Improvement
Submitted By	Juthe, Robin (NIH/NCI) [E]
Assigned To	Englisch, Volker (NIH/NCI) [C]
Status	Closed
Resolved	2021-01-29 20:09:16
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.245439

Description

We'd like to have the ability to preserve documents to be used in testing/training on the DEV tier. Based on our discussion in the last CDR meeting, this could potentially be done by adding an element or attribute to the schema. This would identify which documents should be preserved (actually restored) after a refresh. We would also need to add documents to the toolkit that currently exists for restoring things on DEV. Once we have a firm plan, I'll add additional tickets for the schema change and CSS, if necessary.

Comment entered 2021-01-21 19:02:19 by Englisch, Volker (NIH/NCI) [C]

We discussed that the following document types should be included in this effort of preserving individual documents:

Summary
Drug Information Summary
Glossary Term Name
Glossary Term Concept
Media
Terminology

The schema for these document types will need to be modified to add the new top-level attribute @KeepAtRefresh.

Comment entered 2021-01-25 20:19:01 by Englisch, Volker (NIH/NCI) [C]

The following files have been updated:

PullDevData.py
PushDevData.py
cdr_dev_data.py
Drug.xml
GlossaryTermConcept.xml
GlossaryTermName.xml
Media.xml
SummarySchema.xml
TerminologySchema.xml

I have updated the schemas for the document types identified and loaded the changes to DEV. I would like for ~oseipokuw to prepare a couple of test documents for the different document types. So far, I have been testing with summaries and DIS to this point.

The process to restore the test documents relies on document titles within a document type to be unique. For summary document, however, this uniqueness cannot be guaranteed. There are currently about 34 documents with identical titles, some of these are valid (the English and Spanish names are identical), some seem to be blocked documents, and a few may just be test documents on PROD but those are listed with an active status. It may be worth taking a look at those.

Comment entered 2021-01-29 20:09:07 by Englisch, Volker (NIH/NCI) [C]

The changes have been saved on github:

Comment entered 2021-02-01 17:42:27 by Englisch, Volker (NIH/NCI) [C]

Note to myself:

Don't forget to add the new attribute to the query term definitions: "//@KeepAtRefresh"

Comment entered 2021-02-10 19:50:19 by Osei-Poku, William (NIH/NCI) [C]

I have marked the following sample documents with the new attribute on DEV.

SUMMARIES:

CDR0000062907

CDR0000574548

GTC

CDR0000619369

CDR0000619365

GTN

CDR0000613891

CDR0000788711

DIS

CDR0000700207**

CDR0000717920

MEDIA

CDR0000686617

CDR0000415518

TERMINOLOGY

CDR0000664135

CDR0000657275

Comment entered 2021-02-19 18:05:02 by Englisch, Volker (NIH/NCI) [C]

During the refresh process I ran into two issues. While running the script PullDevData.py processing stopped due to the inclusion of prohibited documents (those for which the English and Spanish version are using identical titles). This failure was expected based on the test documents I had prepared for my earlier tests.

The second failure was a bug I found in the code which is related to processing Spanish documents.

Those two issues have been corrected but the actual database refresh failed because the stored procedure that's using the production backup file for the update is not available. Therefore the DB refresh failed.

I submitted a DB ticket for CBIIT and will continue with the refresh on DEV once the problem has been resolved.

Comment entered 2021-02-25 12:29:42 by Englisch, Volker (NIH/NCI) [C]

The database refresh process completed. ~oseipokuw, you can check out if those modified documents listed above have been properly preserved.

Comment entered 2021-03-08 14:32:38 by Osei-Poku, William (NIH/NCI) [C]

They look good. The documents marked were correctly preserved. I must have forgotten to mark the 2 GTC documents to keep at refresh. Hopefully, I will have the chance to test these on QA.

Comment entered 2021-03-10 21:44:42 by Englisch, Volker (NIH/NCI) [C]

I must have forgotten to mark the 2 GTC documents to keep at refresh.

I wish that were the case but you did not forget to mark the GTC documents. As discussed earlier, during the refresh of the documents, our software must use the unique document title to identify if a document already exists to be updated because the CDR-IDs will almost certainly differ for more recently created documents. If the software doesn't find a saved document it will be restored as a new document. This has happened with the GTC documents. The preserved documents are CDR804096 and CDR804095.

The reason why the documents hadn't been recognized is because the title must be converted (replacing newlines with '\n') during the process. I will need to modify the refresh process to handle GTC titles slightly differently from the rest of the bunch.

Comment entered 2021-03-12 22:18:27 by Englisch, Volker (NIH/NCI) [C]

The following programs have been modified to take care of the issue we experienced with the GTC documents:

PullDevData.py
PushDevData.py
cdr_dev_data.py

It is still possible that a GTC document gets added as a new document because the program doesn't recognize its title. This would happen if a GTC with a title that includes excessive white space that hadn't been processed with the modified GTC DocTitle filter on PROD (see OCECDR-4955) needs to be restored on DEV where the document did get processed with the new DocTitle filter. However, this will be a rare occasion and will be less likely over time.

I will have to restore the data on DEV again in order to test the modified tools from start to finish.

Comment entered 2021-03-18 16:29:18 by Englisch, Volker (NIH/NCI) [C]

We've discussed to wait for the weekend backup before testing the changes by running through the refresh process again.

Comment entered 2021-06-24 11:56:03 by Englisch, Volker (NIH/NCI) [C]

The process to restore the test documents relies on document titles within a document type to be unique. For summary document, however, this uniqueness cannot be guaranteed. There are currently about 34 documents with identical titles, some of these are valid (the English and Spanish names are identical), some seem to be blocked documents, and a few may just be test documents on PROD but those are listed with an active status. It may be worth taking a look at those.

We had decided to stop the restore process if one of those few summary documents sharing identical titles with another summary had been marked to be preserved. It's the user's responsibility not to pick a document that cannot be restored successfully.

Comment entered 2021-06-24 12:06:16 by Englisch, Volker (NIH/NCI) [C]

The step to add the @KeepAtRefresh attribute has been added to the documentation for refreshing the DB.

Comment entered 2021-07-16 18:44:04 by Englisch, Volker (NIH/NCI) [C]

I reran a complete CDR DB refresh to test all modifications from start to finish. The resulting diff report that's run as the final step of the refresh listed two GTC documents that did not get updated. In other words, these should have been preserved but didn't. For both documents the title (first 50 characters of the definition) had changed and therefore a new document was created for each because the restore process was unable to match the documents by its title.

The file CDR619385 had a title on PROD that had extra white space cleaned up while the document on DEV had not been cleaned up and therefore still included the extra white space. The title comparison could not identify these documents as being identical and created CDR805689 as the replacement.

The file CDR619431 had a title "A drug used in the treatment of chronic pain." on DEV but the word "chronic" had been deleted on PROD. This caused the creation of document CDR805690.

Another issue has been discussed with Bob. Some of the control values of the 'ctl' table will have to be populated from a Github repository. I have added a note to the documentation to ensure this step will be done as part of the next refresh.

I am satisfied with the modified refresh process.

Comment entered 2021-08-10 20:44:52 by Englisch, Volker (NIH/NCI) [C]

This ticket is still open because the changes can't be moved to the master branch without a release. I'm moving this ticket to the Ohm release.

Comment entered 2021-10-21 18:58:23 by Englisch, Volker (NIH/NCI) [C]

Manually moving schema changes from branch cdr4631-refresh-dev to branch ohm

https://github.com/NCIOCPL/cdr-server/commit/47e849

Comment entered 2022-06-08 12:55:32 by Osei-Poku, William (NIH/NCI) [C]

Hi ~volker Do we need to retest this on DEV? I think we already tested this in an earlier release but there was some other things you needed to do that was why you added to Ohm.

Comment entered 2022-06-08 14:58:07 by Englisch, Volker (NIH/NCI) [C]

That's correct! There is one library function that requires a release (Ohm) to be copied. There's no need to test this on DEV again.

Comment entered 2022-06-08 15:30:03 by Osei-Poku, William (NIH/NCI) [C]

Thanks!

Verified on DEV.

Elapsed: 0:00:00.001333

CDR Tickets