Issue Number | 4048 |
---|---|
Summary | [Summaries] Global to populate PMID element in summaries |
Created | 2016-03-28 13:31:51 |
Issue Type | Task |
Submitted By | Juthe, Robin (NIH/NCI) [E] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2016-04-08 17:22:36 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.181370 |
Now that we have the new PMID element available in the summary schema, we would like to globally populate this element with the PMID for each of the summaries available in PubMed (English HP & patient summaries). We will assemble an XML file from PubMed that has the PMIDs and CDRIDs for each summary and post it here when it's ready.
Here's the XML file from PubMed for the 358 publishable English HP & patient summaries. We do not have PMIDs for the Spanish summaries at this point.
Global change implemented and run in test mode on DEV:
https://cdr-dev.cancer.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2016-04-01_16-04-31
You'll notice that all of the documents have a diff size of 47 bytes except for CDR0000062890. That's because that summary document already had the PMID populated, but the value had lots of whitespace on both sides of the Pubmed ID string.
Please review and let me know if everything looks OK.
Sorry for the delay. I reviewed this on DEV last Friday and it looked good in test mode.
The global change has been run on DEV in live mode. Ready for review.
Verified on DEV.
Request to run a query to find out if there are English summaries which don't have the new PMID element.
Quite a few have no PMID. The few I spot-checked weren't found in the XML document from NLM. One line in the attached report represents a malformed PMID. The rest are documents which haven't got the PMID at all.
Here's the code I used to find the documents to parse:
SELECT doc_id
FROM query_term
WHERE path = '/Summary/SummaryMetaData/SummaryLanguage'
AND value = 'English'
The query found 652 documents. Of these, 295 had no PMID, and one had the malformed PMID (007-911). This is all on DEV (where the global change was run). Looking at the logs for the global change, I see that the global change script found 358 CDR IDs, and processed all but one of these (CDR0000062902, checked out to Volker). Doing the math, 652-295=357, so I think the results come out right (from that perspective).
Would it be possible to filter out blocked and unpublishable summaries from this list?
New query used for second version of report: {code:sql}
SELECT doc_id
FROM query_term_pub
JOIN active_doc
ON id = doc_id
WHERE path = '/Summary/SummaryMetaData/SummaryLanguage'
AND value = 'English'
OK, great. This list looks much better.
Looks like we missed one - CDR688139 (Cannabis patient summary). The
PMID is 26389314 (we can enter that manually).
CDR62902 had test data in it.
The others are all modules (only), so aren't in PubMed.
Thank you!
I reviewed several of the summaries on DEV and used the PubMed IDs to
retrieve the summaries from PubMed and they all matched as
expected.
Verified on DEV.
Hi Robin,
I am wondering why we are not adding the PMID(s) to the Spanish
summaries. We added all the other PubMed info to the Spanish summaries,
taken directly from the English summaries and not translated. It seems
to me that the IDs should be added for consistency especially if they
are not going to get their own IDs.
Hi William - the Spanish summaries are not in PubMed at this time. They could be added to PubMed in the future - all of the data are sent to NLM - at which point they would have their own PMIDs. It wouldn't make sense to add the English PMIDs to the Spanish summaries.
Verified on QA. Several summaries do not have PMIDs but I confirmed all but one - Cannabis, as mentioned above - were checked out when the global ran.
Reminder for myself: CDR688139 (Cannabis patient summary). The PMID is 26389314 (we can enter that manually).
Run on the production server as part of the Darwin deployment. Log attached.
From the log: these documents were locked:
2016-05-12 18:50:54: Document 62755: Unable to check out CWD for
CDR0000062755: Document CDR0000062755 already checked out to user
dyerv
2016-05-12 18:52:18: Document 62771: Unable to check out CWD for
CDR0000062771: Document CDR0000062771 already checked out to user
dyerv
2016-05-12 18:53:05: Document 62787: Unable to check out CWD for
CDR0000062787: Document CDR0000062787 already checked out to user
vshields
2016-05-12 18:58:45: Document 62829: Unable to check out CWD for
CDR0000062829: Document CDR0000062829 already checked out to user
vshields
2016-05-12 19:13:21: Document 62877: Unable to check out CWD for
CDR0000062877: Document CDR0000062877 already checked out to user
dyerv
2016-05-12 19:14:31: Document 62881: Unable to check out CWD for
CDR0000062881: Document CDR0000062881 already checked out to user
vshields
2016-05-12 19:19:37: Document 62910: Unable to check out CWD for
CDR0000062910: Document CDR0000062910 already checked out to user
vshields
2016-05-12 19:19:38: Document 62911: Unable to check out CWD for
CDR0000062911: Document CDR0000062911 already checked out to user
vshields
2016-05-12 19:21:18: Document 62921: Unable to check out CWD for
CDR0000062921: Document CDR0000062921 already checked out to user
vshields
2016-05-12 19:21:19: Document 62922: Unable to check out CWD for
CDR0000062922: Document CDR0000062922 already checked out to user
vshields
2016-05-12 19:22:36: Document 62924: Unable to check out CWD for
CDR0000062924: Document CDR0000062924 already checked out to user
vshields
Thanks. I've sent these to Val and Victoria and we'll plan to update them manually.
Verified on PROD.
File Name | Posted | User |
---|---|---|
ocecdr-4048.log | 2016-05-13 16:10:01 | Kline, Bob (NIH/NCI) [C] |
ocecdr-4048.txt | 2016-04-08 16:45:19 | Kline, Bob (NIH/NCI) [C] |
PDQ_Summaries_PMID.xml | 2016-03-30 15:35:29 | Juthe, Robin (NIH/NCI) [E] |
Elapsed: 0:00:00.001757