Issue Number | 4127 |
---|---|
Summary | React to NCBI switch to HTTPS |
Created | 2016-06-22 09:51:22 |
Issue Type | Improvement |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2016-08-08 17:35:23 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.186639 |
NLM has announced that they will be switching to HTTPS on September 1, and that when they do that POST requests will be broken. From their announcement:
Starting on September 1st, when you visit NCBI pages, you'll see a green lock and https:// in the address bar instead of http://. This lets you know that you are really on an NCBI page - that our server identity is confirmed - and that your communication with our server is encrypted and private.
Here's what to expect if you're a general user or a scripter:
For general users
You will see the changes mentioned above - https:// and a green lock in the address bar - but you don't have to update or change anything.You don't need to clear your cache or update any links to NCBI pages that you've put on your own webpages or shared with people. We will redirect all our pages to https://.
For scripters
To keep calls from failing, use https:, not http:.Scripts that use HTTP POST to send data will not work once we transition from HTTP to HTTPS on September 1st.
If you'd like to know more about this change to HTTPS, please read The HTTPS-Only Standard https://https.cio.gov/ from the Federal Chief Information Officers website.
We currently use POST requests to retrieve articles from PubMed. According to the announcement, if we don't change the requests to use the GET verb our software will no longer work.
Got another email message from NLM this morning:
Please note, starting on October 1, 2016 – we will no longer be supporting HTTP.
Thank you,
PRS TEAM
ClinicalTrials.govIf you did not get an adequate answer to your question or your problem has not been resolved, please email us back at Register@ClinicalTrials.gov.
Investigator's Login Page: https://register.clinicaltrials.gov
Study Record Managers’ Information: https://www.clinicaltrials.gov/ct2/help/for-manager
Protocol Detailed Review Items: https://prsinfo.clinicaltrials.gov/ProtocolDetailedReviewItems.pdf
This may mean that they've pushed back the deadline we're dealing with. Or (more likely?) the October 1 deadline refers to the service for registering trials, and we still need to hit the September 1 deadline for modifying the software that deals with the retrieval service.
Added ~henryec as a watcher so this is on her radar.
~BKline,
I believe they are saying that a POST request to http:// will not work because the post body cannot be
transfered to https. This is a common issue when setting up redirects.
That being said, a POST request to https should not
have an issue. So it sounds like you just need to update your URLs to
HTTPS.
I do have two questions:
Don't they currently have HTTPS version of the URLs we use that we can test against now?
What are we POSTing to NCBI anyway?
Thanks,
Bryan
Answers:
Yes, they do (and yes, POST works against HTTPS – as you predicted it would).
The query parameters, the length of which is unpredictable (there are size limits on GET requests which are not imposed on POST requests).
So the work for this task should be pretty easy (just change the protocol string).
Bumping up the priority, since this has to get taken care of within the next few weeks.
Here's the list of places I've found where changes have to be made:
./DevTools/GlobalChange/GetReplacementCitations.py
./Filters/CDR0000000105.xml
./Filters/CDR0000000124.xml
./Inetpub/wwwroot/cgi-bin/cdr/CiteSearch.py
./Inetpub/wwwroot/cgi-bin/cdr/NewCitations.py
./Inetpub/wwwroot/cgi-bin/cdr/SummaryCitations.py
./Inetpub/wwwroot/cgi-bin/cdr/UpdatePreMedlineCitations.py
./XMetaL/Macros/Cdr.mcr
I have created branch ocecdr-4127 for the work on this ticket.
~henryec: I believe you had said you wanted me to work on this when I returned from vacation. Please let me know if that's not right.
Looks like we might also need to have some URLs cleaned up in some of the summaries:
256666.xml
256685.xml
256697.xml
256716.xml
256757.xml
256758.xml
269596.xml
299612.xml
453795.xml
517309.xml
552637.xml
574548.xml
62675.xml
62756.xml
62779.xml
62789.xml
62824.xml
62855.xml
62856.xml
62863.xml
62872.xml
62876.xml
62879.xml
62881.xml
62890.xml
62910.xml
681246.xml
687776.xml
688139.xml
719335.xml
733624.xml
736855.xml
744468.xml
761538.xml
765469.xml
765470.xml
770386.xml
770611.xml
773656.xml
773657.xml
773658.xml
774921.xml
775868.xml
778123.xml
778124.xml
I have made all the necessary code changes in the branch. I have installed everything except the global change (which is in DevTools, so nothing to install) and the two filters (need to check with ~volker first to make sure this won't mess up anything he's working on in the filters). Ready for user testing:
CIAT/CIPS Staff > Advanced Search > Citation
CIAT/CIPS Staff > Reports > Citations > New Citations Report
CIAT/CIPS Staff > Reports > Summaries and Miscellaneous Documents > Summaries Citations
CIAT/CIPS Staff > Update Pre-Medline Citations
"Search PubMed" on XMetaL Citation Toolbar
I'll work with Volker on getting the filter changes tested, and I'm beginning to suspect the global change script is obsolete (so testing won't be appropriate).
Just realized the users weren't watchers on this ticket. I have installed the changes on QA so you can test with reasonably fresh data.
~volker: I installed the two filters (105 and 124). Do you need to run your tests? Or can these be tested by the users since the changes are in QC report filters?
Are you referring to the diff reports I often run? These are run with
a publishing job run before and after the filter update.
If all you did was to change the protocol from http to https I'm not
sure we need to run diff reports but it's easy enough to do.
I agree that the protocol change doesn't seem to call for full publishing job runs for testing (though I think I remember that you have publishing jobs set up for the QC reports). Perhaps better to have the users test by using the QC reports and clicking on the links that got changed to see if they still work.
~JutheR and ~oseipokuw: will CIAT take care of the cleanup of the URLs in the summaries?
Robin Juthe and William Osei-Poku: will CIAT take care of the cleanup of the URLs in the summaries?
Yes, we will do a cleanup of the URLs in the summaries. I assume we don't have to clean up all the URLs on QA, just enough to test and clean all of them up on PROD. Is that okay?
Yes, that's fine.
Bob, could you please clarify what URL changes you're asking be handled manually?
For example, {code:xml}
<ExternalRef
cdr:xref="http://www.ncbi.nlm.nih.gov/pubmed/20159818">PUBMED
Abstract</ExternalRef>
{code:xml}
needs to become <ExternalRef cdr:xref="https://www.ncbi.nlm.nih.gov/pubmed/20159818">PUBMED Abstract</ExternalRef>
(CDR256666).
I made changes in 256666 and 256697 and they seem okay. I didn't see any ncbi URL in 256685. Can you point me to the exact location of the URL in 256685? I searched all external refs and also did a quick search in text view but still didn't see it.
It's been a while since that script was run. I'll get you a fresh set later today.
I verified that 256685 had such links (four of them) back on the 8th of August when the script was previously run, but has none of them any more. Here's a fresh list:
256666.xml
256685.xml
256694.xml
256697.xml
256716.xml
256757.xml
256758.xml
269596.xml
299612.xml
453795.xml
517309.xml
552637.xml
574548.xml
62675.xml
62756.xml
62779.xml
62789.xml
62824.xml
62855.xml
62856.xml
62863.xml
62872.xml
62876.xml
62879.xml
62881.xml
62890.xml
62910.xml
681246.xml
687776.xml
688139.xml
695927.xml
719335.xml
733624.xml
736855.xml
744468.xml
758151.xml
761538.xml
765469.xml
765470.xml
770386.xml
770611.xml
773656.xml
773657.xml
773658.xml
774921.xml
775868.xml
778123.xml
778124.xml
I've imported a citation on QA and linked it in a summary and that all worked fine. I noticed that the PubMed abstract links in the summary reference lists are still pointing to "http://", but I assume this will be handled in the filter changes you alluded to earlier, Bob. Is that right?
I also tested the Summaries Citations report on QA and the new links are working well.
William, are Cynthia and/or Minaxi testing the other citation reports?
Action items for Bob:
find out where the http:// link Robin describes in the previous comment came from
create a fresh list of documents needing manual changes (analyzing CDR docs, not exported XML; change the search to pick up URLs with either "ncbi.nlm" or "clinicaltrials.gov")
Well, the good news is that widening the net was successful. The bad news is that it was too successful. See attached spreadsheet.
~robin: could you give me the CDR ID(s) for the document(s) involved in your previous comment?
Thanks.
Did you mean to send the question to Robin Baldwin instead of Robin Juthe?
No, it was meant for RJ. I typed the at sign followed by "Robin" and both names showed up in a picklist, with RB first (and selected) and RJ second, so I pressed the down arrow (and watched the selection move down to RJ) and pressed the Enter key. The entries must have switched positions somewhere in that split-second window. I love JIRA so much. :-)
Got it. Bob, the CDR ID is 517309. (Cancer Genetics Overview summary). I added the newly imported citation to the end of the first paragraph in that document.
That is quite a list. Almost 11,000 rows... A few comments/questions upon my quick review (by DocType):
1. Citations: We can ignore the Citation ones that
fall within the AbstractText as that is taken directly from PubMed. (we
only need to worry about the External Refs that we have control over).
By my count, 19 links would require updating.
2. CTgov protocols: Can we ignore the URLs that are
within CTgov protocols? Do these go to Cancer.gov? Those represent the
overwhelming majority of hits - in the neighborhood of 9,700.
3. Glossary: We can ignore the GlossaryTermConcept hits
(at least for now) since those are all within the DefinitionResource
field, which is only used internally. It would be nice to fix them at
some point so they aren't broken links if we try to use them, but I
would consider this a low priority (165 links)
4. Summaries: We need to fix the Summary ExternalRefs,
but can ignore the ones in Comment fields. Unfortunately, this does
include a lot of OMIM links, as we suspected. (315 links)
5. Terms: These all look like they are internal, but
would need Mary/William to confirm as I'm not too familiar with Term
docs. (646 links)
Can we ignore the URLs that are within CTgov protocols?
I would think so for the ones which are mapped directly from values directly imported from the CTRP documents (which would be almost all of them).
I would think so for the ones which are mapped directly from values directly imported from the CTRP documents (which would be almost all of them).
The CTGov protocols URLs are the same URLs at the end of the trials on Cancer.gov that point directly to the corresponding trials on clinicaltrials.gov. I am not sure how they are connected. That is, whether you use what is stored in the CDR to create what is on Cancer.gov. If the URL stored in the CTGov document is not used to create the ones on Cancer.gov, then it looks like we can ignore them.
Also, it seems like the URLs for the trials on clinicaltrials.gov currently are using https.
https://clinicaltrials.gov/show/NCT00898079
- clinicaltrials.gov
http://clinicaltrials.gov/show/NCT00898079
- CDR document
5. Terms: These all look like they are internal, but would need Mary/William to confirm as I'm not too familiar with Term docs. (646 links)
That is correct. Majority of the URLs under the Term docs are coming from the comment element so they can be ignored. The remaining URLs were pulled from the ReferenceSource element which I believe is just for internal purposes only. So, they can be ignored as well.
I noticed that the links from the drug dictionary on Cancer.gov to
the thesaurus is being redirected to the https address:
http://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI%20Thesaurus&code=C48398
- link on Cancer.gov
https://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI%20Thesaurus&code=C48398
- thesaurus site
We said in the review meeting that the terms may not be affected but it looks like we probably need to check with EVS to be sure they won't be affected.
Before we get too far into the details with these URL updates, I think we should split off the change related to importing citations and promote that to PROD ahead of the Sept 1 deadline.
As for the URLs, we need to get some clarification on exactly which websites are affected and when the current redirects will expire (if they will expire). The messages from NLM that Bob pasted above are really unclear. For example, as we discussed yesterday, what about medlineplus.gov URLs (which we link to from DIS)? Will the current redirects expire Sept 1 (the date mentioned in the NCBI memo) or Oct 1 (the date mentioned in the ClinicalTrials.gov memo), or at another time? Would it be possible to reach back out to whomever these messages came from to get some clarification? Or, ~henryec, would you happen to know who we should ask about this? Given all that's going on right now with clinical trials, in particular, I think we need to determine whether we need to worry about those ~9,700 trial URL updates.
William also raises the point about NCIT links currently redirecting to https. It seems like we will continue to identify URLs that are redirecting and may need updating given the HHS requirement to move our websites over to https. Is there also an HHS mandate to redirect the URLs until a certain date? I think we need to consider everything that's going to be affected by this broader HHS change and then determine if we can handle all of the updates at one time or if it makes sense to strategically split them up in some way. Doing them piecemeal as we receive these announcements seems inefficient (and we're more likely to miss something).
William, are Cynthia and/or Minaxi testing the other citation reports?
We've tested the new citations report as well as importing and adding new citations to summary documents. We will continue testing other citations reports on Monday morning.
I can create a global change script to do some of these (as long as we can nail down a consensus on which ones need to change) if that would be helpful. Wouldn't need to happen before the deadline.
I've imported a citation on QA and linked it in a summary and that all worked fine. I noticed that the PubMed abstract links in the summary reference lists are still pointing to "http://", but I assume this will be handled in the filter changes you alluded to earlier, Bob. Is that right?
I ran the modified summary through the Vendor Summary Set, and as far as I can tell, this is what we send to GateKeeper:
ReferenceSection>
<Citation idx="1" PMID="27557410">
<
Zhang M, Lin O: Molecular Testing of Thyroid Nodules: A Review of Current Available Tests for Fine-Needle Aspiration Specimens. Arch Pathol Lab Med : , 2016.Citation>
</Citation idx="2" PMID="18559331">
<
Lindor NM, McMaster ML, Lindor CJ, et al.: Concise handbook of familial cancer susceptibility syndromes - second edition. J Natl Cancer Inst Monogr (38): 1-93, 2008.Citation>
</ReferenceSection> </
So I'm going to guess that the URLs are constructed further downstream from us, and we'd have to get the WCMS team to modify their software to use the HTTPS protocol. Can you verify this, ~volker?
William, are Cynthia and/or Minaxi testing the other citation reports?
Yes, new citations were imported and were added to summaries as well as running citations reports. They have all been tested without noticing any problems.
I believe the only outstanding question is the one I posed in my previous comment to ~volker. As soon as I get confirmation from him that the HTTP links are under the WCMS team's control, and not ours, I'll submit the ticket to CBIIT to deploy the changes for this ticket, unless someone still wants to do more testing.
Can you verify this, Volker Englisch?
Yes and yes:
Yes, I can verify this.
Yes, the URL is created in the Gatekeeper rendering filter.
Does this mean we need to have a GK ticket to modify the GK filter?
Yes. Would you add one, please?
A GK ticket has been created: WCMSGK-53
https://tracker.nci.nih.gov/browse/WEBTEAM-9060 has been submitted for the deployment to production (tested first on STAGE).
The patch has been deployed to STAGE and PROD. We will need to keep this ticket open (and I will keep the branch open) until we have completed the updates needed for existing documents. We'll need some input from ~henryec to guide some of that work (see Robin's questions posted above in her comment from the 26th).
From the Review meeting, we don't have to worry about the static URLs in the data. The translation to https will be handled by the browser.
From the Review meeting, we don't have to worry about the static URLs in the data. The translation to https will be handled by the browser
Does that mean this ticket can be closed?
I guess so. We didn't close it on that day because we decided to wait for Robin to confirm and close it.
Robin's going to do a little more research.
Closed in the status meeting. URLs should continue to be rerouted to the new https URL.
File Name | Posted | User |
---|---|---|
ocecdr-4127.xlsx | 2016-08-25 17:47:43 | Kline, Bob (NIH/NCI) [C] |
Elapsed: 0:00:00.001274