Import of Genetics Professional Records Into the CDR
Table of Contents
1 Overview
This proposal recommends using the existing Person document type to store records imported from the CIAT Genetics Professional system. This document describes the mapping which will be used for this conversion.
2 Use of Existing Documents
There are approximately 600 GP records in the CIAT database. For about eight percent of these records matches can be found with existing Person documents based on automated comparison of name and address information. It is likely that with some additional manual research more matches could be identified. Unless ICRDB decides to ignore these existing documents and create independent documents for the conversion, we will want to create a spreadsheet in which the matches to be used for the conversion are recorded with IDs and indication of which name and address information show be preserved in the case of discrepancies.
3 Mapping to CDR Document Elements
This section of the document identifies how each element in the Person schema will be populated for a newly created or updated document during conversion, indicating those elements and blocks of elements which will be omitted (or preserved in the case of the use of an existing document). For portions of the schema which will be modified for the purpose of accommodating the GP information, these modifications are identified here as well.
3.1 PersonNameInformation
3.1.1 GivenName
Mapped from tblMain.First_Name, with trailing middle initial stripped off. We'll leave names with spaces where the last "word" is not a capital letter followed by a period (e.g., Michele Bruno or H. James), following the practice for persons currently entered into the CDR with such names.
3.1.2 MiddleInitial
Mapped from tblMain.Init, which contains initials for all of the words in the person's name except the surname. For all but five of the records in the current database the contents of this column could be generated programmatically from the words in the First_Name column. For those which don't have predictable values in the Init column, we will generate a report so that the name elements in the document can be reviewed and fixed manually. For the others, if the First_Name column matches the pattern <word> <initial> we will put the word in the GivenName element and the initial in the MiddleInitial element; otherwise we will put the entire content of the First_Name column into the GivenName element and omit the MiddleInitial element.
3.1.3 SurName
Mapped from tblMain.LastName.
3.1.4 Prefix
Omitted (no comparable data in CIAT database).
3.1.5 GenerationSuffix
Mapped from tblMain.Suffix (with normalization to strip out spaces and punctuation so the values match the schema).
3.1.6 ProfessionalSuffix
Included if any degrees are listed for the GP; otherwise omitted.
3.1.6.1 StandardProfessionalSuffix
Mapped from tblDegree.Degree if the value being imported matches one of the valid values in the schema for standard professional prefixes (after normalization to strip out spaces and punctuation). The DegreeSeq column in the tblDegree table will be used to order the elements if there are multiple standard degrees for the person.
3.1.6.2 CustomProfessionalSuffix
Mapped from tblDegree.Degree for values which (after normalization) don't match the list of valid values in the schema. See previous note about ordering of multiple values.
3.1.7 NameFormat
Omitted (no comparable data in CIAT database).
3.1.8 Comment
"Converted from CIAT database on <date of conversion>."
3.2 ProfessionalInformation
This element will always be created or preserved during the conversion, since at least some of the children of GeneticsProfessionalDetails will be present for all records converted.
3.2.1 PDQBoardMembershipDetails
Preserved when merging GP information into an existing Person document; otherwise omitted.
3.2.2 Physician Details
Preserved when merging GP information into an existing Person document; otherwise omitted.
3.2.3 GeneticsProfessionalDetails
Included in all converted documents.
3.2.3.1 AdministrativeInformation
Included in all converted documents.
3.2.3.1.1 Directory
Included in all converted documents, assuming we come up with a usable approach for populating the Include element (besides, it's required by the schema).
I haven't found any indication of a status column in the GP database, so I assume that we'll always populate this element with the value of "Include" provided the criteria in the schema comment are met. We'll need to discuss the wording of these criteria, as there is a little ambiguity currently. Here's what the comment currently says:
If the person is marked "Include" for the Genetics Directory the person should have at least 1 GeneticsBoardCertification element, for each CertificationStatus=Eligible the person should have EligibilityYear. These rules should be forced outside the schema. The appropriate error messages should pop-up in case any of these validations fail.
I see a value of "Delete" in the lookup table for event type, but the tblHistory table which records events has no rows with links to the "Delete" value, so presumably CIAT just wipes out the data for GPs who are no longer active or alive - not my favorite way to archive older data.
I also thought at first that the 'On_Line' and 'Off_Line' values in the lEvents table would provide some useful information about which records were active and which were inactive, but that was a dead-end as well, since those values are never used in the audit history table.
We'll use the "GPImport" CDR account here.
Populated with the date of the conversion.
"Imported from CIAT GP database."
3.2.3.1.2 PreferredContactMode
Mapped from tblMain.ContactBy (but first we'll need to have CIAT tell us what the magic values "1" and "2" mean in this column). If the record has a value which means contact electronically, we'll include this element and populate it with the value "Electronic"; otherwise we'll omit the element.
3.2.3.2 GeneticsSpecialtyInformation
This block will be included for all converted documents.
3.2.3.2.1 GeneticsSpecialtyName
Omitted in conversion, as there is no good match with the incoming data. It seems odd that this element is both required and that only one occurrence is allowed. Is a GP really required to have a specialty? Is it really impossible for a GP to have more than one special area of expertise? Let's discuss this element and decide whether modifications need to be made here.
3.2.3.2.2 GeneticsBoardCertification
Mapped from tblBoard.Code -> lBoard.Desc; we'll need to add a couple of values to the schema type ("Oncology Nursing" and "Molecular Genetic Pathology").
Mapped from tblBoard.Board_Certified_Verified (1 -> Yes, 0 -> No).
Valid values are "Certified" and "Eligible"; set to "Certified" if tblBoard.Board_Certified is true; otherwise, if tblBoard.Board_Eligible is true, set to "Eligible"; if both are false (as is currently true for 94 of the rows in the tblBoard table) are we supposed to just discard the information about the board? If so, what does the presence of these rows in the database mean?
Not sure what to do with this element, because the closest match in the CIAT database is simply Board_Year in the tblBoard table, with no allusion to eligibility. Do we add a BoardYear element, using that for the year of certification, mapping the existing data into this element, and keep EligibilityYear to record when a GP would be eligible for certification (in future data maintenance)? Why is eligibility so significant in this context?
"Converted from CIAT database on <date of conversion>."
3.2.3.2.3 StateCertification
There aren't many rows with real values (about a dozen) so I don't really have a good feel for whether this type of information is actively maintained. If not we can move this to the LegacyGeneticsData block.
Mapped from tblMain.Certificat_State [sic]; link to PoliticalSubUnit document.
Mapped from tblMain.Certificat_Number [sic].
3.2.3.3 InterdisciplinaryTeamMember
Mapped from tblMain.Team_Member.
3.2.3.4 GeneticsTeamServices
Not sure exactly what the presence of "Team" implies in the name of this element, but I assume this should be mapped from tblExp_Provided.Code -> lExp_Provided.Desc, which is the closest match in the incoming data; we will need to expand the list of valid values in the schema to accommodate the following values:
- Patient cancer risk assessment
- Genetic susceptibility testing (including information on limitations, specific tests available, and regulations concerning testing procedures such as CLIA and CAP/ACMG)
- Appropriate pre- and post-test counseling and informed consent (including ethical, legal, and social issues related to testing and disclosure of test results)
- Follow-up plan of care (including medical care, psychological support, and counseling about options for prevention or early detection guidelines)
Let's talk about changing the name to "GeneticsTeamService" or "GeneticsService" or "ServiceProvided" or anything which isn't plural, as this is yet another example of a multiply-occurring element, each occurrence of which represents a single service, not all of the services provided.
Also, we either need to make the element optional, or find values for the GPs which have no rows in the Exp_Provided table (about 1/10 of the records).
3.2.3.5 ProvidesServices
Mapped from tblMain.Provide_Service.
3.2.3.6 AcceptsCalls
Mapped from tblMain.Can_Call.
3.2.3.7 ServiceLimitations
Mapped from tblMain.Limition_Desc [sic].
3.2.3.8 HereditaryCancerSyndrome
Required (though this is optional in the current GP schema), multiply-occurring replacement for FamilialCancerSyndromes element. For one thing, "Syndromes" was inappropriate, because each occurrence represented a single syndrome. Another proposed change is that instead of linking directly to a Term document, I am recommending that this element link to a new document type with the following structure:
/HereditaryCancerSyndrome /SyndromeName [required] /SyndromeCancerTypeWithSites [required, multiple] /SyndromeCancerType [required, link to Term document] /SyndromeCancerSite [required, multiple, link to Term documents or plain text controlled by valid value list in the schema; see note below]
These documents will be generated at the time of conversion using the tblSyndroms tblSite_Syndrom, tblCencerSite [sic] and tblCancerType tables. We will either need to add some Term documents to the repository and/or come up with mapping to different cancer types, because six of the thirteen cancer types in the CIAT database don't have matching term documents:
- Digestive/Gastrointestinal [Cancer]
- Genitourinary [Cancer]
- Germ cell [Cancer]
- Gynecologic [Cancer]
- Hematologic [Cancer]
- Neurologic [Cancer]
In addition, we'll need to do some cleanup of the data in the tblCencerSite table, which - in spite of its name - has values for things other than cancer sites. Some of the values are indeed straightforward identifications of the location in the body where cancer can be found (bladder, thyroid, larynx, tongue, etc.). Others can easilty be made to represent an actual site by simply stripping the trailing word "cancer" from the value (e.g., "lung cancer" or "breast cancer"). Others are the result of a garbling of the concepts represented in the tables, identifying the nature of the cancer rather than the location (e.g. "Hodgkin's lymphoma" and "Non Hodgkin's lymphomas"). This all assumes that "site" doesn't mean something different to a genetic oncologist than it does to the rest of the world.
Once the data has been cleaned up, and/or we construct a mapping table which will convert the existing non-site values into values which really represent sites, we'll need to decide whether it makes more sense to fill in missing Term documents for values in the tblCencerSite table, or come up with additional mapping logic to replace those values with values which do match existing Term documents, or some combination of those two approaches, or not use links to term documents at all for the sites, but instead use a schema-controlled list of valid values. I would prefer to use links to term documents unless we run into problems with creating terms for the sites.
The original definition for this element, with its link directly to Term documents, would have required modifications to the Term schema in order to support representation of the relationships between syndromes, cancer types, and cancer sites. It might be possible to use that approach without running into the problem of syndromes, cancer types, and sites with the same preferred term name, but it wouldn't be possible to make that determination until we have untangled the mess described above with the muddled "site" values. I have therefore deferred the analysis required to determine the feasibility of the original approach of linking directly to Term documents, and even without having done that complete analysis I strongly suspect that the approach I am recommending will be cleaner, easier to maintain, and easier to comprehend.
We need to determine whether we will strip the word "Cancer" from the preferred name of the cancer type terms when we publish the GP documents, in order to preserve the values we are currently exporting. Similarly, we will need to determine whether we will switch to exporting the values using the capitalization of the preferred names for the linked terms, or to come up with some algorithm which could be implemented by the publishing XSL/T scripts and which would avoid capitalizing words which the current GP data does not capitalize ("cell" in "Germ cell" and "and" in "Head and Neck") and would recognize the slash character as a word separator ("Digestive/Gastrointestinal"). If the decision is made that we must preserve the current capitalization of the cancer type values, the simplest approach might be for the XSL/T publsihing scripts to have hard-coded mapping logic. My own preference would be to simply export the preferred term names as they are found in the linked documents.
The mapping for the links to the new HereditaryCancerSyndrome documents will be based on the tblMain_Syndrom table.
3.2.3.9 MemberOfGeneticsSociety
Mapped from tblMembership.Code -> lMemberHsip.Desc (after first deciding whether to add the trailing initials in parentheses (e.g., ACMG) to the valid value list in the schema or strip them from the mapped data).
3.2.3.10 LegacyGeneticsData
This block is for data we will preserve from the conversion, but which will not need to be maintained.
3.2.3.10.1 LegacyID
Mapped from tblMain.ID.
3.2.3.10.2 OracleID
Mapped from tblMain.OracleID.
3.2.3.10.3 OriginalCreationDate
Mapped from tblMain.DateEntry.
3.2.3.10.4 OriginalCreationUser
Mapped from tblMain.UserID_CR.
3.2.3.10.5 LastUpdateBeforeConversion
Mapped from tblMain.DateUpdated.
3.2.3.10.6 LastUpdateUserBeforeConversion
Mapped from tblMain.UserID_UP.
3.2.3.10.7 LegacyUserName
Mapped from tblMain.UserName.
3.2.3.10.8 LegacyPassword
Mapped from tblMain.Password.
3.2.3.11 Comment
Mapped from tblMain.Comment.
3.3 PersonLocations
Included for all converted documents (required by the schema).
3.3.1 Home
Preserved when merging GP information into an existing Person document; otherwise omitted.
3.3.2 PrivatePractice
Preserved when merging GP information into an existing Person document; otherwise omitted.
3.3.3 OtherPracticeLocation
For all documents, we will assemble contact information from the columns of tblMain which carry address or other contact data, as well as from the tblPractice table. We will then de-duplicate the contact blocks (the contact data from the tblMain table is sometimes duplicated in one of the rows of the tblPractice table for the GP). We will then eliminate any blocks which are already present in documents for existing Person documents into which GP information is being merged (this step will be performed by following links to organization contact information to determine whether the GP contact information is already represented). The blocks which remain will be inserted into the Person documents as OtherPracticeLocation blocks.
We need to decide whether to add new contact information blocks by inserting them into Organization documents with fragment links to the new block in the organization document, or by linking to the organization documents as a whole (without a fragment ID), and instead putting the contact information in the Specific… elements. It would even be possible, in theory, to come up with logic which would identify the closest match in an existing location block, and use the Specific… elements to surgically override only the minimum number of specific pieces of information, but the complexity required for this approach would overwhelmingly outweight the benefits.
We also need to decide how we will identify organizations which are already in the CDR, and which will need to be created for the conversion. Part of this step will involve figuring out what to do with such silly values as "Cancer Center" or "Breast Center" as organization names.
We need to decide whether the missing organizations will be created manually by CIAT, or by the conversion software. If by the software, we will need to come up with rules for which elements get populated with what information in the new Organization documents.
3.3.3.1 OrganizationLocation
Link to organization document, possibly with fragment ID for location block within the organization document (see notes above). Mapped from tblMain.Institution and tblPractice.LocationName.
3.3.3.2 PersonTitle
Omitted (optional in the schema, with no corresponding information in the GP tables).
3.3.3.3 PersonRole
Omitted (optional in the schema, with no corresponding information in the GP tables).
3.3.3.4 SpecificPostalAddress
Used if we decide not to insert new contact information in Organization documents.
3.3.3.4.1 Street
Mapped from tblMain.Address[1-3] and tblPractice.Address[1-3].
3.3.3.4.2 City
Mapped from tblMain.City and tblPractice.City. Most values are currently stored all uppercase; all but one can be converted programmatically by capping the first letter of each word and lowercasing the remaining letters in the words; the exception is Coeur d'Alene.
3.3.3.4.3 PoliticalSubUnit_State
Mapped from tblMain.State and tblPractice.State. All but 'XX' can be mapped from the two-letter standard abbreviation; the 'XX' values will have to be looked up manually for addresses (such as those in Canada) which should have a real value, as well as for those which should have no occurrence of the PoliticalSubUnit_State element.
3.3.3.4.4 Country
Mapped from tblMain.Country and tblPractice.Country.
3.3.3.4.5 PostalCode_ZIP
Mapped from tblMain.Zip5, tblMain.Zip4, tblPractice.Zip5, and tblPractice.Zip4.
3.3.3.4.6 @Public
Mapped from tblMain.PostToWeb (need to find out if that's what this column is really used for).
3.3.3.5 SpecificPhone
Mapped from tblMain.Phone and tblPractice.Phone.
3.3.3.6 SpecificTollFreePhone
Omitted (optional in the schema, with no corresponding information in the GP tables).
3.3.3.7 SpecificFax
Mapped from tblMain.Fax.
3.3.3.8 SpecificEmail
Mapped from tblMain.EMail, with @Public mapped from tblMain.PostEMailToWeb.
3.3.3.9 SpecificWebSite
Mapped from tblMain.WebSite.
3.3.3.10 ComplexAffiliation
Omitted (optional in the schema, with no corresponding information in the GP tables).
3.3.3.11 @cdr:id
Populated with a generated unique value; remembered for the tblMain block in new Person documents as the value to be inserted into CIPSContact below.
3.3.3.12 @PdqKey
Omitted (optional in the schema, with no corresponding information in the GP tables).
3.3.3.13 @PreviousLocation
Omitted (optional in the schema, with no corresponding information in the GP tables).
3.3.4 CIPSContact
Preserved for existing Person documents. For new documents the value of the @cdr:id attribute generated for the location block from the tblMain row will be inserted.
3.3.5 UpdateMode
We already have PreferredContactMode; do we also need to populate the optional UpdateMode element for the GP documents? Do the GPs get Person mailers as well as GP mailers?
3.3.6 Comment
For new documents: "Converted from CIAT database on <date of conversion>."
3.4 ProfessionalType
Mapped from tblProf_Type.Code -> lProf_Type.Desc. Optional, multiply-occurring new element with the following value list:
- Clinical Psychologist (Ph.D., Psy.D.)
- Clinical Social Worker (M.S.W., D.S.W.)
- Genetic Counselor (M.S., M.Sc., M.A., C.G.C.)
- Geneticist (Ph.D.)
- Physician (M.D., D.O., or foreign equivalent)
- Nurse (R.N., B.S.N., M.S.N., M.S., M.A., Ph.D.)
- Other
3.5 Status
Required, singly-occurring element. Preserved for existing documents, inserted for new documents.
3.5.1 CurrentStatus
For new documents: 'Active' (see discussion above on Directory/Include element).
3.5.2 EnteredBy
For new documents we will use the GPImport account.
3.5.3 EntryDate
Date of conversion (for new documents).
3.5.4 Comment
For new documents: "Converted from CIAT database on <date of conversion>."
3.6 DatedAction
This block is optional in the Person schema, and there is no information in the CIAT database corresponding to it.
3.7 PdqKey
Since these documents are not being converted from the Oracle PDQ database, they have no legacy PDQ keys.
3.8 DateLastModified
Mapped from tblMain.DateUpdated.
4 Tables/Columns Not Used in the Conversion
4.1 lCountry
This is a lookup table for countries. The CDR already has this information in the Country documents.
4.2 lState
This is a lookup table for states. The CDR already has this information in the PoliticalSubUnit documents.
4.3 lEvents
This is the lookup table for event names. See question below about whether to preserve mailer history.
4.4 lZip
We've got our own tables for zip-code validation.
4.5 tblEmailVer
This appears to be a temporary working table used to hold ephemeral data during emailer processing. There's only one row in the table currently.
4.6 tblHistory
See question below about whether to preserve mailer history, as well as questions above (see Directory/Include) about status.
4.7 tblMain.Have_Limition [sic]
We don't need to do anything with this column, which always contains 0 when tblMain.Limition_Desc is empty, and 1 if Limition_Desc has a value.
4.8 tblOwner
Don't know what this table was used for, except that it had columns for name and address information. There are no rows currently in the table.
4.9 tblReportVer
This table may be needed for converting the mailer system, and possibly for generating historical mailer documents, but I don't think we'll want to store this information (about per-GP mailer schedules) in the newly converted Person documents.
4.10 tblRptQCReport
This looks like a working table for QC reports; there's just one row in the table.
4.11 tblRptSummary
Another working table used by the code to report on mailer activity.
4.12 tblRptVerificationReport
Working table for reports.
4.13 tblSystem
Used for capturing run-time state values (such as which GP record is currently being processed, and whether and where to log error messages).
4.14 tblUsers
There are only nine rows in this table. If the "When" column, which comes directly after the "LogIn" column indicates the date the user logged into the GP system most recently, then the table is obsolete, as the dates are either NULL or in the previous century. Let's have CIAT tell us if there are any users who aren't already in the CDR who need to be added, and which CDR users need authorization to work with Person documents.
5 Other Considerations
5.1 Valid Value Tables
We'll want to do a last-minute check to ensure that the lookup tables at the time of conversion are reflected accurately in the schema valid values lists.
- lBoard (GeneticsBoardName)
- lExp_Provided (GeneticsTeamServices)
- lMemberShip (MemberOfGeneticsSociety)
- lProf_Type (ProfessionalType)
5.2 Mailer History
We could generated Mailer documents from the rows in the tblHistory table (possibly using information from other tables as well), similar to what we did for the conversion from Oracle. Do we want to do this?
6 Action Items
- [ ] Decide whether to adopt approach of using existing Person doctype
- [ ] Decide whether to use existing Person documents
- [ ] Decide how to handle Directory/Include element
- [ ] Find out what "1" and "2" mean in tblMain.ContactBy
- [ ] Discuss possible modifications of GeneticsSpecialtyName element
- [ ] Answer questions about CertificationStatus
- [ ] Answer questions about EligibilityYear
- [ ] Answer questions about GeneticsTeamServices
- [ ] Decide whether to adopt proposed changes for HereditaryCancerSyndrome
- [ ] Decide whether to strip trailing abbreviations for societies
- [ ] Decide whether to put new contact blocks in the Organization docs
- [ ] Determine what tblMain.PostToWeb is really used for
- [ ] Identify organizations which need to be added to the CDR
- [ ] Decide how new Organization documents will be created
- [ ] Decide if UpdateMode element is needed
- [ ] Decide if we want to generate Mailer documents from tblHistory
- [ ] Decide whether to move StateCertification to LegacyGeneticsData block
- [ ] Add any new users to the CDR with appropriate permissions
Date: 2009-06-29 13:51:22
HTML generated by org-mode 6.28d in emacs 22