CDR Tickets

Issue Number 5245
Summary Importing SVPC summaries from Drupal into the CDR
Created 2023-06-02 10:06:46
Issue Type New Feature
Submitted By Osei-Poku, William (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2023-07-13 09:38:58
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.347667
Description

Starting with Stomach Cancer SVPC summaries, summaries were created in Drupal instead of the CDR. With this approach, there is no efficient way to update the partner summaries without either creating new SVPC summaries in the CDR or recreating the partner summaries from scratch by manually copying data from Drupal. This approach would be too time consuming. We would want to be able to get programming help in importing relevant data from Drupal into the CDR for these summaries and subsequent summaries that are created in Drupal. Attached to this ticket is the spreadsheet with mapping of the elements/fields/values. Please let me know if you have any questions. stomach-cancer-fields-2023061.xlsx

Comment entered 2023-06-08 11:28:18 by Kline, Bob (NIH/NCI) [C]

Some of the mappings you have requested are puzzling. For example:

  • you've asked for node/default_langcode/value (the flag indicating whether this page is in the default language for the node to be copied to the Summary/SummaryMetaData/SummaryLanguage element's text content; why would you want a "1" or a "0" in that element?

  • you've asked for node/field_browser_title/value to be copied to Summary/AltTitle but you don't want the node/field_card_title/value (even though the CancerTypeHomePage AltTitle is required

I'll hold off on any further work on the scripting to give you a chance to go over your mappings carefully to ensure that they're asking for what you really want.

Comment entered 2023-06-08 12:28:01 by Kline, Bob (NIH/NCI) [C]

I've posted the XML generated by your current mappings (working just from the mappings on the nodes page, skipping over the other two tabs on the spreadsheet, since there's no guidance yet provided for what to do with the HTML markup), in the hope that it might be helpful as you refine your mappings.

Comment entered 2023-06-08 13:12:42 by Kline, Bob (NIH/NCI) [C]

I went ahead and added the sections to give you a better picture of the work that needs to be done mapping what needs to happen with the HTML markup.

Comment entered 2023-06-09 12:22:53 by Osei-Poku, William (NIH/NCI) [C]

I have attached the updated spreadsheet and added information under the HTML Markup tab. I also made some minor changes in the Nodes table.

I used only one of the generated XML files to complete the spreadsheet because I thought it was representative of all of them. It is likely that I may have to review the other ones if I did not provide you with the needed information.

In all cases, please do not display the HTML tags and attributes in the XML if possible.

Some of the information is repeated in the spreadsheet so please let me know if you need any clarification. 

Also, please let me know if this is what you expect or not and I will be happy to revise it. Thanks

Comment entered 2023-06-09 12:25:53 by Osei-Poku, William (NIH/NCI) [C]
Comment entered 2023-06-09 18:54:28 by Kline, Bob (NIH/NCI) [C]

New XML documents have been generated and attached.

Comment entered 2023-06-09 23:22:44 by Osei-Poku, William (NIH/NCI) [C]

Thanks, Bob! This is looking really great. 

  1. I have attached an updated spreadsheet with the yellow highlights under the Node tab.

  2. For 1245761.xml, it looks like part of the last Summary Section in the document with the Section Title "Environmental and occupational exposures". There is no data under the Summary Title in the XML and I couldn't figure out why.

  3. Could you please include in the next XMLs, https://www.cancer.gov/types/stomach/diagnosis. I think the node id is: 1245996. It contains a media document and I would like to know how that would come out in the XML.   

Thanks

Comment entered 2023-06-09 23:27:48 by Osei-Poku, William (NIH/NCI) [C]

Attaching the right file.

Comment entered 2023-06-09 23:30:01 by Osei-Poku, William (NIH/NCI) [C]
Comment entered 2023-06-12 08:05:07 by Kline, Bob (NIH/NCI) [C]

Fresh set added. Perhaps your copies of the zip files are corrupted, but I've been including 1245996 all along. The CdrDocCtl block is not stored in the CDR but is instead created on the fly when a request for the document is received from XMetaL.

Comment entered 2023-06-12 08:51:57 by Osei-Poku, William (NIH/NCI) [C]

Thanks Bob! This looks good. It looks like we've gotten everything we need. The images don't have CDR IDs in Drupal so, we wouldn't be able to match them like the glossary documents. I believe this is the only thing missing from the XML. 

How can we get this into the CDR? I would like some of the editors to review them in the CDR. I tried copying one into QA but I had to make a lot of modifications for it work. That was why I asked you to add the CdrDocCtl.

Comment entered 2023-06-12 10:23:03 by Kline, Bob (NIH/NCI) [C]

[JIRA ate the previous comment.]

Installed on CDR DEV as

  • CDR807335

  • CDR807336

  • CDR807337

  • CDR807338

Comment entered 2023-06-12 12:48:34 by Kline, Bob (NIH/NCI) [C]

I tried copying one into QA but I had to make a lot of modifications for it work. That was why I asked you to add the CdrDocCtl.

Ah, I don't see how I could have possibly deduced that from the requirements implied by the ticket's description.

We would want to be able to get programming help in importing relevant data from Drupal into the CDR for these summaries

That sounded very much like you wanted the documents to be imported programmatically. I have replaced the XML set.

Comment entered 2023-06-12 17:49:30 by Osei-Poku, William (NIH/NCI) [C]

In reviewing the documents on DEV, it looks like the Summary Key Words  (SummaryKeywords/SummaryKeyword) block has a capitalization issue with the "W" in keyword not capitalized, which prevents the document from validating. I was able to manually correct them. Other than that, everything else looks good and it looks like we get 99.9% of the summary in there. The only piece that needs to be created manually is the Media document. Thank you!!!

Comment entered 2023-06-13 14:46:56 by Osei-Poku, William (NIH/NCI) [C]

Please generate another set of XML files for me when the Summary Key Word element is fixed. Thanks!

Comment entered 2023-06-13 18:43:11 by Kline, Bob (NIH/NCI) [C]

Fresh set posted.

Comment entered 2023-06-13 20:59:07 by Osei-Poku, William (NIH/NCI) [C]

For the MainTopics block, please include the child element (Terms). Thanks!

Also, if possible, please add the new attribute created in  OCECDR-5251 

Do-Not-Push-To-Drupal = " Yes"

Comment entered 2023-06-14 09:25:19 by Osei-Poku, William (NIH/NCI) [C]

Providing us with a well-formed set of XML files that we can import into the CDR to create a summary document should be good. Thanks!

Comment entered 2023-06-15 13:20:37 by Kline, Bob (NIH/NCI) [C]

Fresh set posted.

Comment entered 2023-06-21 15:28:27 by Osei-Poku, William (NIH/NCI) [C]
  1. Please correct the MainTopics child element to "Term" instead of "Terms", which I incorrectly provided earlier.

  2. Also, since we are no longer going to need to make any filter changes, please remove the "Do-Not-Push-To-Drupal = " Yes" from the XML.

  3. Please include the Module Only attribute in the XML as  ModuleOnly = "Yes".

  4. Please generate a new set of test XML documents after the changes are completed.

Comment entered 2023-06-21 15:49:35 by Kline, Bob (NIH/NCI) [C]

Fresh set posted.

Comment entered 2023-06-22 19:23:24 by Osei-Poku, William (NIH/NCI) [C]

Thanks! Could you please add the XML declaration and the DTC reference

<?xml version="1.0"?>
<!DOCTYPE Summary SYSTEM "Summary.dtd">

I am also attaching a spreadsheet with all the Node IDs for the English documents. 

Please generate a fresh set of data when changes are completed. 

cgov_19035.csv

Comment entered 2023-06-23 07:15:27 by Kline, Bob (NIH/NCI) [C]

Now that I have been given a larger set of node IDs to process, I see that the content authors aren't using the same type for each of the nodes. The ones I've been processing so far have been cgov article nodes. Now I'm seeing nodes which are mini landing pages, with different structures than the articles. So let's back up and see if we can nail down what the software should be able to expect. Have the content authors been given specific guidelines ("use only the following types in the following ways, honoring the following constraints")? If so, it would be helpful (by which I mean much less expensive) if such guidelines were provided to the developers, preferably as early in the project as possible. If instead the authors were told "here's Drupal; poke around and see what you can find; if you like it use it, in any way you see fit" then you should be aware that you're going to be facing a certain amount of frustration as the the project evolves. It's going to be a little bit like trying to dance on quicksand.

Comment entered 2023-06-23 08:22:45 by Osei-Poku, William (NIH/NCI) [C]

The content types the authors use largely depend on the topic being worked on and they are provided specific content types to use.  They are usually not aware of which content type to use until they are ready to create the content in Drupal and they are told which content type to use. I can find out if we might use other content types but I really doubt we would need any other content type beside cgov_article and cgov_mini_landing imported into the CDR. These two content types follow the CDR summary structure more than the other content types.

Comment entered 2023-06-23 08:44:07 by Kline, Bob (NIH/NCI) [C]

No idea what "DTC" means but I've generated and attached another set.

Comment entered 2023-06-23 09:07:28 by Osei-Poku, William (NIH/NCI) [C]

Sorry for the typo. It should be "DTD".

Comment entered 2023-07-06 16:35:05 by Osei-Poku, William (NIH/NCI) [C]

Please add AvailableAsModule = Yes to mapping and generate a new set for all the documents. Thanks!

Comment entered 2023-07-06 16:55:15 by Kline, Bob (NIH/NCI) [C]

Wouldn't that be redundant information? How could a summary marked as "module only" NOT be available as a module?

Comment entered 2023-07-06 17:50:10 by Osei-Poku, William (NIH/NCI) [C]

Yes, it seems redundant but I think that was the approach we took from the beginning. If you look at your comment in OCECDR-3644, that was one of the scenarios you gave, and it looks like that was what we decided to proceed with.

Comment entered 2023-07-07 11:22:22 by Kline, Bob (NIH/NCI) [C]

New set posted.

Comment entered 2023-07-13 09:38:50 by Osei-Poku, William (NIH/NCI) [C]

Thanks! The summaries have successfully been created on PROD. I will enter a new ticket for any future enhancements.

Attachments
File Name Posted User
cgov_19035.csv 2023-06-22 19:23:23 Osei-Poku, William (NIH/NCI) [C]
stomach-cancer-fields-2023061_updated_06092023_11PM.xlsx 2023-06-09 23:29:59 Osei-Poku, William (NIH/NCI) [C]
stomach-cancer-fields-2023061_updated_06092023.xlsx 2023-06-09 12:25:41 Osei-Poku, William (NIH/NCI) [C]
stomach-cancer-fields-2023061_updated_06092023-1.xlsx 2023-06-09 23:27:41 Osei-Poku, William (NIH/NCI) [C]
stomach-cancer-fields-2023061.xlsx 2023-06-02 10:06:38 Osei-Poku, William (NIH/NCI) [C]
stomach-cancer-xml-20230615.zip 2023-06-15 13:19:24 Kline, Bob (NIH/NCI) [C]
stomach-cancer-xml-20230621.zip 2023-06-21 15:49:19 Kline, Bob (NIH/NCI) [C]
stomach-cancer-xml-20230623.zip 2023-06-23 08:43:01 Kline, Bob (NIH/NCI) [C]
stomach-cancer-xml-20230707.zip 2023-07-07 11:22:01 Kline, Bob (NIH/NCI) [C]

Elapsed: 0:00:00.001465