CDR Tickets

Issue Number 4222
Summary Global to update URLs from http to https
Created 2017-01-23 16:22:33
Issue Type Improvement
Submitted By Juthe, Robin (NIH/NCI) [E]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2017-02-07 20:18:52
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.201840
Description

Many URLs stored in the CDR - summary URLs, drug summary URLs, external refs, etc - begin with "http" and are being redirected to "https" automatically. There are a few issues with this:

1) We don't know how long these redirects will be supported.
2) This causes a very long URL check report for CIAT with each of these pointed out as errors.

We'd like to programmatically update these URLs. As we discussed, we can likely use the URL check report to identify which URLs need updating.

I'm linking this to a related issue since there are some relevant comments related to the http to https switch for government websites.

Comment entered 2017-01-24 14:16:40 by Kline, Bob (NIH/NCI) [C]

First question – do we want to drop some of the document types which are less interesting than they used to be from consideration, both for the URL check report, as well as for this global change? The current list of document types handled by the URL check report has:

  • GlossaryTermConcept

  • Summary

  • InScopeProtocol

  • CTGovProtocol

  • DrugInformationSummary

  • Person

  • ClinicalTrialSearchString

  • Citation

  • MiscellaneousDocument

  • Organization

Why (for example) would we care about dead links in InScopeProtocols?

Comment entered 2017-01-27 06:48:41 by Kline, Bob (NIH/NCI) [C]

Would I be right in assuming that we should use the same set of document types we adopted for OCECDR-4219 in performing this global change?

  • Citation

  • Clinical Trial Search String

  • Drug Information Summary

  • Glossary Term Concept

  • Miscellaneous Doc

  • Summary

Comment entered 2017-01-27 08:52:49 by Juthe, Robin (NIH/NCI) [E]

Yes, same document types. Thanks, Bob.

Comment entered 2017-01-27 11:09:02 by Kline, Bob (NIH/NCI) [C]

I have attached a spreadsheet showing the old and new URLs for all of the redirects from "http" to "https" for external refs found in the CDR documents on PROD for the types we decided on. Please review the list and confirm that these are the mappings you want the global change to use.

Comment entered 2017-01-31 16:33:40 by Juthe, Robin (NIH/NCI) [E]

I've spot checked several of these and they look reasonable. I think we can proceed. Thanks, Bob.

Comment entered 2017-01-31 17:32:52 by Osei-Poku, William (NIH/NCI) [C]

I also reviewed some of these and they appear to be fine. There are several links to the mobile site in patient summaries that we may have to cleanup manually since we stopped storing mobile links since NVCG. Examples:

http://m.cancer.gov/topics/treatment/bycancer/esophageal/Patient
http://m.cancer.gov/es/cancer/tratamiento/porcancer/esofago/Patient

Comment entered 2017-01-31 19:21:32 by Osei-Poku, William (NIH/NCI) [C]

If it is not too much work, could you please include the CDR IDs of the documents in the spreadsheet?

Comment entered 2017-01-31 20:56:12 by Kline, Bob (NIH/NCI) [C]

You understand that that would be a significantly bigger spreadsheet, right? Because the one I gave you only showed each unique URL once. So, for example, that spreadsheet had one row for http://www.cancer.gov/about-cancer/treatment/clinical-trials, whereas the one you're asking for would have as many as 249 rows. Are you sure that will be as helpful as you think it will?

Comment entered 2017-02-01 16:56:50 by Osei-Poku, William (NIH/NCI) [C]

Does it mean you can't show me just one occurrence of each unique URL in the spreadsheet? How about if I Identify all the mobile URLS, can you provide me with the CDR IDs for just that?

Comment entered 2017-02-02 12:19:50 by Kline, Bob (NIH/NCI) [C]

How's this? I flattened out the rows for the CDR IDs to squash them all into a single cell per URL. New workbook attached. Note that sometimes the same URL could appear more than once in a given document.

Will this meet your needs?

Comment entered 2017-02-02 14:50:49 by Juthe, Robin (NIH/NCI) [E]

Please remove all mobile URL elements. We no longer need to store these in the CDR so we don't need to include them in the global.

Comment entered 2017-02-07 20:18:52 by Kline, Bob (NIH/NCI) [C]

Implemented and run in test mode on DEV. Ready for user review.

https://cdr-dev.cancer.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2017-02-07_18-16-10

Comment entered 2017-02-09 11:50:27 by Juthe, Robin (NIH/NCI) [E]

I checked several docs and this looks good to me. Mobile URLs are being removed and the URLs are being updated correctly. Could you please run the global in live mode on DEV?

Comment entered 2017-02-09 12:29:34 by Kline, Bob (NIH/NCI) [C]

Running now. Will let you know when it's done.

Comment entered 2017-02-10 14:39:47 by Kline, Bob (NIH/NCI) [C]

The live global job is finished. It seems most likely that the failure of the job yesterday had to do with transitory problems on the database server, as the same document was updated without any problems when I re-started the job. I have added more diagnostic support to the ModifyDocs module to assist in tracking down any similar failures in the future.

Please spot check the results of the global change.

Comment entered 2017-02-16 17:56:11 by Juthe, Robin (NIH/NCI) [E]

I'm still reviewing, but I have found a couple of documents that appear to only have been partially updated:

CDR470863 (Merkel Cell patient summary - Spanish)
The Mobile URL element was removed and several URLs were updated but the Summary URL was not updated to https. It is on the spreadsheet (row 1685 in the version with CDR IDs).

CDR441548 (Merkel Cell patient summary - English)
The Mobile URL element was removed but the Summary URL and a few other URLs were not updated to https. (rows 616, 668, 2198, 2208 on the spreadsheet with CDR IDs)

Comment entered 2017-02-17 10:32:34 by Kline, Bob (NIH/NCI) [C]

I see the problem. I wasn't correctly handling documents in which the same URL was entered inconsistently (with regard to case) in the same document. I fixed the problem and ran a fresh job. There were a couple of locked documents, but most of the URLs that could be mapped were replaced. There are still 3,408 external refs using the HTTP scheme, but these are plain old dead links, which need to be dropped or replaced by hand (e.g., http://cancer.gov/cancerinfo/pdq/screening). They'll show up on the URL check report.

Comment entered 2017-02-17 10:38:04 by Kline, Bob (NIH/NCI) [C]

Before you do any further review, let me examine the documents myself. I'm still not convinced everything's working correctly.

Comment entered 2017-02-17 14:25:01 by Kline, Bob (NIH/NCI) [C]

OK, now I think we may really be ready for you to review the results of the live global change on DEV. I discovered that the web server was inappropriately closing the connection when we asked for the redirect of http://www.cancer.gov/cam instead of returning the 404 ("not found") HTTP code. This caused the script to record "www.cancer.gov" as a host which was off line. The script had been optimized to avoid trying to contact a server which had been determined to be unavailable, causing all subsequent URLs for that host to be skipped. I have removed the optimization and now the script is trying each unique URL, even those for which the host has misbehaved.

Please take a look at the results.

Comment entered 2017-02-17 16:40:34 by Juthe, Robin (NIH/NCI) [E]

Is CDR517309 on the list of locked documents? I had that checked out until about 12:30pm today. If it is, then I'm comfortable saying that everything looks good!

Comment entered 2017-02-17 17:03:14 by Kline, Bob (NIH/NCI) [C]

Yes.

2017-02-17 11:30:52: Document 258011: Unable to check out CWD for CDR0000258011: Document CDR0000258011 already checked out to user oseipoku
2017-02-17 11:33:45: Document 258038: Unable to check out CWD for CDR0000258038: Document CDR0000258038 already checked out to user oseipoku
2017-02-17 11:39:03: Document 350260: Unable to check out CWD for CDR0000350260: Document CDR0000350260 already checked out to user volker
2017-02-17 11:47:25: Document 487564: Unable to check out CWD for CDR0000487564: Document CDR0000487564 already checked out to user volker
2017-02-17 11:49:31: Document 517309: Unable to check out CWD for CDR0000517309: Document CDR0000517309 already checked out to user juther
2017-02-17 11:50:50: Document 538354: Unable to check out CWD for CDR0000538354: Document CDR0000538354 already checked out to user lgrama
2017-02-17 12:07:03: Document 622340: Unable to check out CWD for CDR0000622340: Document CDR0000622340 already checked out to user volker
2017-02-17 12:10:39: Document 632567: Unable to check out CWD for CDR0000632567: Document CDR0000632567 already checked out to user oseipoku
2017-02-17 12:28:30: Document 747922: Unable to check out CWD for CDR0000747922: Document CDR0000747922 already checked out to user volker
2017-02-17 12:33:59: Document 760794: Unable to check out CWD for CDR0000760794: Document CDR0000760794 already checked out to user volker
2017-02-17 12:34:00: Document 760795: Unable to check out CWD for CDR0000760795: Document CDR0000760795 already checked out to user volker
Comment entered 2017-02-17 17:05:17 by Juthe, Robin (NIH/NCI) [E]

Great, thanks! This is verified on DEV then.

Comment entered 2017-02-24 09:41:34 by Juthe, Robin (NIH/NCI) [E]

What's the status of the test global run on QA?

Comment entered 2017-02-24 09:45:26 by Kline, Bob (NIH/NCI) [C]
Comment entered 2017-02-28 10:27:53 by Juthe, Robin (NIH/NCI) [E]

The test results look good to me. Please do the live run. Thanks.

Comment entered 2017-03-01 15:26:39 by Juthe, Robin (NIH/NCI) [E]

Hi Bob,

Just a quick reminder to please do the live run when you get a chance. Thanks! 🙂

Comment entered 2017-03-01 15:35:36 by Kline, Bob (NIH/NCI) [C]

I tried the run yesterday, but it broke (looks like the database wandered off for a bit), so I had to start it again. Still running.

Comment entered 2017-03-02 09:20:53 by Kline, Bob (NIH/NCI) [C]

Failed again, restarted again. The program doesn't have to start all over again, and the document at which the previous failure occurred succeeds on restart, which I take to mean that there isn't something in the code or the documents which is causing the failures. Will keep you posted.

Comment entered 2017-03-02 09:21:54 by Juthe, Robin (NIH/NCI) [E]

Thanks for the update!

Comment entered 2017-03-02 15:50:03 by Kline, Bob (NIH/NCI) [C]

Finally got a clean run. Ready for you to take a look.

Comment entered 2017-03-03 13:36:27 by Juthe, Robin (NIH/NCI) [E]

Verified on QA.

Comment entered 2017-03-17 23:04:52 by Kline, Bob (NIH/NCI) [C]

The live global change has completed on PROD. No documents were skipped because they were checkout by a user.

Attachments
File Name Posted User
ocecdr-4222.xlsx 2017-01-27 11:11:50 Kline, Bob (NIH/NCI) [C]
ocecdr-4222-with-doc-ids.xlsx 2017-02-02 12:18:05 Kline, Bob (NIH/NCI) [C]

Elapsed: 0:00:00.001349