CDR Tickets

Issue Number 4048
Summary [Summaries] Global to populate PMID element in summaries
Created 2016-03-28 13:31:51
Issue Type Task
Submitted By Juthe, Robin (NIH/NCI) [E]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2016-04-08 17:22:36
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.181370
Description

Now that we have the new PMID element available in the summary schema, we would like to globally populate this element with the PMID for each of the summaries available in PubMed (English HP & patient summaries). We will assemble an XML file from PubMed that has the PMIDs and CDRIDs for each summary and post it here when it's ready.

Comment entered 2016-03-30 15:35:29 by Juthe, Robin (NIH/NCI) [E]

Here's the XML file from PubMed for the 358 publishable English HP & patient summaries. We do not have PMIDs for the Spanish summaries at this point.

Comment entered 2016-04-01 16:56:48 by Kline, Bob (NIH/NCI) [C]

Global change implemented and run in test mode on DEV:

https://cdr-dev.cancer.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2016-04-01_16-04-31

You'll notice that all of the documents have a diff size of 47 bytes except for CDR0000062890. That's because that summary document already had the PMID populated, but the value had lots of whitespace on both sides of the Pubmed ID string.

Please review and let me know if everything looks OK.

Comment entered 2016-04-05 15:04:57 by Juthe, Robin (NIH/NCI) [E]

Sorry for the delay. I reviewed this on DEV last Friday and it looked good in test mode.

Comment entered 2016-04-06 10:25:55 by Kline, Bob (NIH/NCI) [C]

The global change has been run on DEV in live mode. Ready for review.

Comment entered 2016-04-06 14:04:19 by Juthe, Robin (NIH/NCI) [E]

Verified on DEV.

Comment entered 2016-04-07 13:41:00 by Kline, Bob (NIH/NCI) [C]

Request to run a query to find out if there are English summaries which don't have the new PMID element.

Comment entered 2016-04-08 16:45:19 by Kline, Bob (NIH/NCI) [C]

Quite a few have no PMID. The few I spot-checked weren't found in the XML document from NLM. One line in the attached report represents a malformed PMID. The rest are documents which haven't got the PMID at all.

Here's the code I used to find the documents to parse:

SELECT doc_id
  FROM query_term
 WHERE path = '/Summary/SummaryMetaData/SummaryLanguage'
   AND value = 'English'

The query found 652 documents. Of these, 295 had no PMID, and one had the malformed PMID (007-911). This is all on DEV (where the global change was run). Looking at the logs for the global change, I see that the global change script found 358 CDR IDs, and processed all but one of these (CDR0000062902, checked out to Volker). Doing the math, 652-295=357, so I think the results come out right (from that perspective).

Comment entered 2016-04-08 16:56:58 by Juthe, Robin (NIH/NCI) [E]

Would it be possible to filter out blocked and unpublishable summaries from this list?

Comment entered 2016-04-08 17:22:36 by Kline, Bob (NIH/NCI) [C]

New query used for second version of report: {code:sql}
SELECT doc_id
FROM query_term_pub
JOIN active_doc
ON id = doc_id
WHERE path = '/Summary/SummaryMetaData/SummaryLanguage'
AND value = 'English'

Comment entered 2016-04-09 17:50:58 by Juthe, Robin (NIH/NCI) [E]

OK, great. This list looks much better.

Looks like we missed one - CDR688139 (Cannabis patient summary). The PMID is 26389314 (we can enter that manually).
CDR62902 had test data in it.

The others are all modules (only), so aren't in PubMed.

Thank you!

Comment entered 2016-04-18 17:59:51 by Osei-Poku, William (NIH/NCI) [C]

I reviewed several of the summaries on DEV and used the PubMed IDs to retrieve the summaries from PubMed and they all matched as expected.
Verified on DEV.

Comment entered 2016-04-20 09:10:38 by Osei-Poku, William (NIH/NCI) [C]

Hi Robin,
I am wondering why we are not adding the PMID(s) to the Spanish summaries. We added all the other PubMed info to the Spanish summaries, taken directly from the English summaries and not translated. It seems to me that the IDs should be added for consistency especially if they are not going to get their own IDs.

Comment entered 2016-04-20 09:16:50 by Juthe, Robin (NIH/NCI) [E]

Hi William - the Spanish summaries are not in PubMed at this time. They could be added to PubMed in the future - all of the data are sent to NLM - at which point they would have their own PMIDs. It wouldn't make sense to add the English PMIDs to the Spanish summaries.

Comment entered 2016-04-27 18:16:42 by Juthe, Robin (NIH/NCI) [E]

Verified on QA. Several summaries do not have PMIDs but I confirmed all but one - Cannabis, as mentioned above - were checked out when the global ran.

Reminder for myself: CDR688139 (Cannabis patient summary). The PMID is 26389314 (we can enter that manually).

Comment entered 2016-05-13 16:10:01 by Kline, Bob (NIH/NCI) [C]

Run on the production server as part of the Darwin deployment. Log attached.

Comment entered 2016-05-13 16:11:52 by Kline, Bob (NIH/NCI) [C]

From the log: these documents were locked:

2016-05-12 18:50:54: Document 62755: Unable to check out CWD for CDR0000062755: Document CDR0000062755 already checked out to user dyerv
2016-05-12 18:52:18: Document 62771: Unable to check out CWD for CDR0000062771: Document CDR0000062771 already checked out to user dyerv
2016-05-12 18:53:05: Document 62787: Unable to check out CWD for CDR0000062787: Document CDR0000062787 already checked out to user vshields
2016-05-12 18:58:45: Document 62829: Unable to check out CWD for CDR0000062829: Document CDR0000062829 already checked out to user vshields
2016-05-12 19:13:21: Document 62877: Unable to check out CWD for CDR0000062877: Document CDR0000062877 already checked out to user dyerv
2016-05-12 19:14:31: Document 62881: Unable to check out CWD for CDR0000062881: Document CDR0000062881 already checked out to user vshields
2016-05-12 19:19:37: Document 62910: Unable to check out CWD for CDR0000062910: Document CDR0000062910 already checked out to user vshields
2016-05-12 19:19:38: Document 62911: Unable to check out CWD for CDR0000062911: Document CDR0000062911 already checked out to user vshields
2016-05-12 19:21:18: Document 62921: Unable to check out CWD for CDR0000062921: Document CDR0000062921 already checked out to user vshields
2016-05-12 19:21:19: Document 62922: Unable to check out CWD for CDR0000062922: Document CDR0000062922 already checked out to user vshields
2016-05-12 19:22:36: Document 62924: Unable to check out CWD for CDR0000062924: Document CDR0000062924 already checked out to user vshields

Comment entered 2016-05-13 16:23:08 by Juthe, Robin (NIH/NCI) [E]

Thanks. I've sent these to Val and Victoria and we'll plan to update them manually.

Comment entered 2016-05-23 16:35:34 by Juthe, Robin (NIH/NCI) [E]

Verified on PROD.

Attachments
File Name Posted User
ocecdr-4048.log 2016-05-13 16:10:01 Kline, Bob (NIH/NCI) [C]
ocecdr-4048.txt 2016-04-08 16:45:19 Kline, Bob (NIH/NCI) [C]
PDQ_Summaries_PMID.xml 2016-03-30 15:35:29 Juthe, Robin (NIH/NCI) [E]

Elapsed: 0:00:00.001757