Issue Number | 5266 |
---|---|
Summary | [Drupal Import] Potential Improvements to the SVPC Imports from Drupal |
Created | 2023-08-01 18:45:30 |
Issue Type | Improvement |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2023-09-05 15:02:15 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.354128 |
Potential Improvements to the SVPC Imports from Drupal
We’ve successfully processed the first batch of SVPC summaries imported from Drupal for the partner summaries and we’d like to see if some enhancements can be done to improve the imports and help with processing them in the CDR. It is okay if some of these enhancements cannot be done especially if it will take a huge LOE to do them as we understand that there will always be some manual cleanup to be done. While working with the summaries in the CDR, we came across the issues below that at least need looking at. They are:
External Ref elements have URLs that include Drupal nodes in the address. Would it be possible to replace the nodes with the actual public URLs? Example: https://www.cancer.gov/node/63867 as contained in 1246011. In some cases, the same node id referenced a different URL for the English and for the Spanish.
Could you add the TranslationOf element to just the Spanish summaries and populate the element with the title of the corresponding English Summary?
In some cases, emphasis tags are included in Glossary Term Refs and External Refs. Could the emphasis tags be removed in such cases since they cause validation errors ? Example – 1246011
Please use the “node/field_date_updated/value” for the Date Last Modified (Instead of the “node/field_date_posted/value”)
5. In some cases, there are Empty Para tags. Would it be possible to remove the empty para tags? Example in 1245996-es.xml
6. In at least one case, there was a broken up External Ref. Can this be avoided? For example 1245996-es.xml
Once again, many of these are just our observations and do not take a very long time to fix manually so if any of these would take a long time to address, please ignore them. Also, in terms of timeline, it may take several months to get the next batch ready so this can be assigned a low priority. Thank you!
I believe all of these are feasible except the last, which needs to be fixed by the authors of the Drupal content.
I think we can use these node ids to test - https://tracker.nci.nih.gov/secure/attachment/233637/233637_stomach-cancer-fields-2023061_updated_06092023.xlsx
I need actual node IDs. What you linked to was the mapping document we used to determine how the individual fields should be handled.
Linking isn't working for me so I have attached the file here.
Sorry about that. Please try this one.
JIRA appears to have discarded my previous comment. In case it was only pretending to discard that comment, you may end up with two comments saying basically the same thing.
Instead of giving me a list of node IDs, you gave me a table identifying only the English summaries. Can you explain your reasoning behind this decision (bearing in mind that one of the modifications requested by this ticket is only applicable for Spanish summaries)?
This is what I see in the attached spreadsheet. The first column appears to show the Node IDs. Please let me know if that is not what you're looking for or if the IDs do not display for you. Also, my understanding is that Drupal has same Node IDs for both English and Spanish.
Yes, but you didn't just give me the first column, did you? You gave me four more columns which narrow the list to the English summaries.
Transformed summary XML attached. Note that several of the summaries link to https://www.cancer.gov/node/63867 which does not exist.
Please ignore the other columns.
Done.
For future reference, if you need to provide a list of node IDs for a JIRA ticket, and all you have is a spreadsheet, one of whose columns has the IDs you want to use, with other columns containing extraneous/misleading data, all you have to do (especially if JIRA is making it difficult to attach/link files) is:
select the column with the node IDs
copy the IDs into the clipboard
paste them into a JIRA comment
Verified on DEV. Thanks!
Marking this as QA verified. I do not think it is necessary to test this again QA.
File Name | Posted | User |
---|---|---|
Broken Up External Link 1245996.PNG | 2023-08-01 18:40:25 | Osei-Poku, William (NIH/NCI) [C] |
cgov_19035 (4).csv | 2023-10-09 08:52:45 | Osei-Poku, William (NIH/NCI) [C] |
emphasis tags in gtrefs 1246011.PNG | 2023-08-01 18:37:54 | Osei-Poku, William (NIH/NCI) [C] |
Empty Para tags 1246006-es.PNG | 2023-08-01 18:39:02 | Osei-Poku, William (NIH/NCI) [C] |
stomach-cancer-xml-20231009.zip | 2023-10-09 11:01:03 | Kline, Bob (NIH/NCI) [C] |
Stomach Node IDs.PNG | 2023-10-09 10:33:20 | Osei-Poku, William (NIH/NCI) [C] |
Elapsed: 0:00:00.001689