Issue Number | 4631 |
---|---|
Summary | [General] Ability to Preserve Testing/Training Documents on DEV |
Created | 2019-06-13 12:20:25 |
Issue Type | Improvement |
Submitted By | Juthe, Robin (NIH/NCI) [E] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2021-01-29 20:09:16 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.245439 |
We'd like to have the ability to preserve documents to be used in testing/training on the DEV tier. Based on our discussion in the last CDR meeting, this could potentially be done by adding an element or attribute to the schema. This would identify which documents should be preserved (actually restored) after a refresh. We would also need to add documents to the toolkit that currently exists for restoring things on DEV. Once we have a firm plan, I'll add additional tickets for the schema change and CSS, if necessary.
We discussed that the following document types should be included in this effort of preserving individual documents:
Summary
Drug Information Summary
Glossary Term Name
Glossary Term Concept
Media
Terminology
The schema for these document types will need to be modified to add the new top-level attribute @KeepAtRefresh.
The following files have been updated:
PullDevData.py
PushDevData.py
cdr_dev_data.py
Drug.xml
GlossaryTermConcept.xml
GlossaryTermName.xml
Media.xml
SummarySchema.xml
TerminologySchema.xml
I have updated the schemas for the document types identified and loaded the changes to DEV. I would like for ~oseipokuw to prepare a couple of test documents for the different document types. So far, I have been testing with summaries and DIS to this point.
The process to restore the test documents relies on document titles within a document type to be unique. For summary document, however, this uniqueness cannot be guaranteed. There are currently about 34 documents with identical titles, some of these are valid (the English and Spanish names are identical), some seem to be blocked documents, and a few may just be test documents on PROD but those are listed with an active status. It may be worth taking a look at those.
The changes have been saved on github:
Note to myself:
Don't forget to add the new attribute to the query term definitions: "//@KeepAtRefresh"
I have marked the following sample documents with the new attribute on DEV.
SUMMARIES:
CDR0000062907
CDR0000574548
GTC
CDR0000619369
CDR0000619365
GTN
CDR0000613891
CDR0000788711
DIS
CDR0000700207**
CDR0000717920
MEDIA
CDR0000686617
CDR0000415518
TERMINOLOGY
CDR0000664135
CDR0000657275
During the refresh process I ran into two issues. While running the script PullDevData.py processing stopped due to the inclusion of prohibited documents (those for which the English and Spanish version are using identical titles). This failure was expected based on the test documents I had prepared for my earlier tests.
The second failure was a bug I found in the code which is related to processing Spanish documents.
Those two issues have been corrected but the actual database refresh failed because the stored procedure that's using the production backup file for the update is not available. Therefore the DB refresh failed.
I submitted a DB ticket for CBIIT and will continue with the refresh on DEV once the problem has been resolved.
The database refresh process completed. ~oseipokuw, you can check out if those modified documents listed above have been properly preserved.
They look good. The documents marked were correctly preserved. I must have forgotten to mark the 2 GTC documents to keep at refresh. Hopefully, I will have the chance to test these on QA.
I must have forgotten to mark the 2 GTC documents to keep at refresh.
I wish that were the case but you did not forget to mark the GTC documents. As discussed earlier, during the refresh of the documents, our software must use the unique document title to identify if a document already exists to be updated because the CDR-IDs will almost certainly differ for more recently created documents. If the software doesn't find a saved document it will be restored as a new document. This has happened with the GTC documents. The preserved documents are CDR804096 and CDR804095.
The reason why the documents hadn't been recognized is because the title must be converted (replacing newlines with '\n') during the process. I will need to modify the refresh process to handle GTC titles slightly differently from the rest of the bunch.
The following programs have been modified to take care of the issue we experienced with the GTC documents:
PullDevData.py
PushDevData.py
cdr_dev_data.py
It is still possible that a GTC document gets added as a new document because the program doesn't recognize its title. This would happen if a GTC with a title that includes excessive white space that hadn't been processed with the modified GTC DocTitle filter on PROD (see OCECDR-4955) needs to be restored on DEV where the document did get processed with the new DocTitle filter. However, this will be a rare occasion and will be less likely over time.
I will have to restore the data on DEV again in order to test the modified tools from start to finish.
We've discussed to wait for the weekend backup before testing the changes by running through the refresh process again.
The process to restore the test documents relies on document titles within a document type to be unique. For summary document, however, this uniqueness cannot be guaranteed. There are currently about 34 documents with identical titles, some of these are valid (the English and Spanish names are identical), some seem to be blocked documents, and a few may just be test documents on PROD but those are listed with an active status. It may be worth taking a look at those.
We had decided to stop the restore process if one of those few summary documents sharing identical titles with another summary had been marked to be preserved. It's the user's responsibility not to pick a document that cannot be restored successfully.
The step to add the @KeepAtRefresh attribute has been added to the documentation for refreshing the DB.
I reran a complete CDR DB refresh to test all modifications from start to finish. The resulting diff report that's run as the final step of the refresh listed two GTC documents that did not get updated. In other words, these should have been preserved but didn't. For both documents the title (first 50 characters of the definition) had changed and therefore a new document was created for each because the restore process was unable to match the documents by its title.
The file CDR619385 had a title on PROD that had extra white space cleaned up while the document on DEV had not been cleaned up and therefore still included the extra white space. The title comparison could not identify these documents as being identical and created CDR805689 as the replacement.
The file CDR619431 had a title "A drug used in the treatment of chronic pain." on DEV but the word "chronic" had been deleted on PROD. This caused the creation of document CDR805690.
Another issue has been discussed with Bob. Some of the control values of the 'ctl' table will have to be populated from a Github repository. I have added a note to the documentation to ensure this step will be done as part of the next refresh.
I am satisfied with the modified refresh process.
This ticket is still open because the changes can't be moved to the master branch without a release. I'm moving this ticket to the Ohm release.
Manually moving schema changes from branch cdr4631-refresh-dev to branch ohm
Hi ~volker Do we need to retest this on DEV? I think we already tested this in an earlier release but there was some other things you needed to do that was why you added to Ohm.
That's correct! There is one library function that requires a release (Ohm) to be copied. There's no need to test this on DEV again.
Thanks!
Verified on DEV.
Elapsed: 0:00:00.001391