CDR Tickets

Issue Number 4598
Summary [Summary] Links to Clinical Trials Report
Created 2019-03-26 15:15:18
Issue Type New Feature
Submitted By Osei-Poku, William (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2019-08-12 12:25:55
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.241966
Description

We talked a while ago about getting a report that would let us know the number of links in summaries (Protocol Refs) that are linking to trials in Cancer.gov vs trials in Clinicaltrials. gov.  We probably want to start with getting an ad hoc query listing the summaries with protocol ref links and whether they link to Cancer.gov trials or Clinicaltrials.gov trials. We also want to know the count of the links as well. If there is a future need for a real CDR report, I would put in a ticket with specs for that.

Comment entered 2019-03-27 11:20:58 by Englisch, Volker (NIH/NCI) [C]

We probably want to start with getting an ad hoc query listing the summaries with protocol ref links and whether they link to Cancer.gov trials or Clinicaltrials.gov trials.

Yes, we can create an ad-hoc query listing the summaries with protocol ref links but it's impossible to identify whether these refs link to Cancer.gov or Clinicaltrials.gov because that information is not stored anywhere in our database.  As you remember we modified all ExternalRefs pointing to clinical trials and the related filter so that all ProtocolRef elements are now pointing to 

https://www.cancer.gov/clinicaltrials/<NCT-ID>

It is Cancer.gov who identifies where the protocol lives and redirects the link accordingly.

Comment entered 2019-06-12 19:18:44 by Englisch, Volker (NIH/NCI) [C]

We also want to know the count of the links as well.{quote}

Is that the count per summary or the total count?

let us know the number of links in summaries (Protocol Refs) that are linking to trials in Cancer.gov vs trials in Clinicaltrials. gov.{quote}

There are actually three options available:

  • links pointing to Cancer.gov

  • links pointing to ClinicalTrials.gov and

  • links pointing nowhere.  These are links that were removed by our filters due to blocked documents but are listed as ProtocoRefs

How would you like to have the third option handled?

Comment entered 2019-06-13 10:27:30 by Osei-Poku, William (NIH/NCI) [C]

These links may need to be cleaned up from the summaries. Please display that information in a separate column.

Comment entered 2019-06-13 10:55:27 by Englisch, Volker (NIH/NCI) [C]

I'm attaching a screenshot of the report I've created so far.  I'm not certain what extra information you're gaining by creating an extra column for the ProtocolRefs that aren't linking anywhere on Cancer.gov.  My question was more geared towards "What would you like to display if not None"?

Comment entered 2019-06-13 11:56:56 by Osei-Poku, William (NIH/NCI) [C]

I think None is fine. Thanks!

Comment entered 2019-06-13 14:22:59 by Englisch, Volker (NIH/NCI) [C]

For myself to remember:

I will need to re-index the summaries after adding the attribute nct_id to the query_term table.

Comment entered 2019-06-13 17:47:58 by Englisch, Volker (NIH/NCI) [C]

I've updated/created the following files:

  • SummaryAndMiscReports.py

  • SummaryProtocolRefLinks.py

For the moment I'm limiting the report to only display results for 10 summaries.  The full report takes several minutes to complete.

I created the new Summary Report option Summaries with ProtocolRef Links

Please have a look and let me know what you think.

Comment entered 2019-06-18 14:56:18 by Osei-Poku, William (NIH/NCI) [C]

I am getting a "Session not active" error message when I click on the link.

Comment entered 2019-06-18 15:15:41 by Englisch, Volker (NIH/NCI) [C]

Please go through the Admin Reports interface instead of using the link from the above comment.  I copy/pasted the name from the menu page but didn't mean for JIRA to make this a link.

Comment entered 2019-06-18 15:20:42 by Osei-Poku, William (NIH/NCI) [C]

Sure. What is the name of the report on the menu? I didn't know it was on the menu.

Comment entered 2019-06-18 15:24:17 by Englisch, Volker (NIH/NCI) [C]

It's "Summaries with ProtocolRef Links"

Comment entered 2019-06-18 15:26:02 by Osei-Poku, William (NIH/NCI) [C]

Found it thanks!

Comment entered 2019-06-18 15:36:16 by Osei-Poku, William (NIH/NCI) [C]

The report looks good to me. I see that for the cancer.gov links you are including only the relative links/urls instead of the FQDN. Is it because of space ?

Comment entered 2019-06-18 15:39:37 by Englisch, Volker (NIH/NCI) [C]

No, that's the output that the tool I'm using returns.  I'm guessing that the tool drops the hostname for local URLs.  I can add that to the report if you'd prefer.

Comment entered 2019-06-18 15:41:23 by Osei-Poku, William (NIH/NCI) [C]

Yes, please! Add it. Thanks!

Comment entered 2019-06-18 17:31:08 by Englisch, Volker (NIH/NCI) [C]

I've modified the report to display the full URL and I removed the restriction for only 10 documents.  

Please note that this report will run for a long time (around 7 minutes).  I've added a note to that affect to the menu option.  Obviously, PROD is a bit faster than DEV but I'm guessing the report will still run around 4-5 minutes.

Comment entered 2019-07-05 06:13:12 by Osei-Poku, William (NIH/NCI) [C]

I cannot see the report on the admin menu anymore.

Comment entered 2019-07-08 11:41:39 by Englisch, Volker (NIH/NCI) [C]

Please remember that Bob had updated DEV with the Joule release last week.  This change is not merged into Joule yet.  I'll have to manually restore the files and once you approved the report I'll move it into Joule

Give me a few minutes to make the change before testing again.

Comment entered 2019-07-08 12:09:27 by Englisch, Volker (NIH/NCI) [C]

, the report is available on DEV again.

Comment entered 2019-07-08 15:02:19 by Osei-Poku, William (NIH/NCI) [C]

This looks good on DEV. Would you be able to provide stats for the report? That is a count for each of the categories (Cancer.gov, Clinicaltrials.gov, and None).

Comment entered 2019-07-08 17:32:12 by Englisch, Volker (NIH/NCI) [C]

Do you want those stats as part of the report or just a general count right now?

If you want the counts listed as part of the report where would you like them displayed (bottom/top)?

Comment entered 2019-07-08 17:43:23 by Osei-Poku, William (NIH/NCI) [C]

Please include it as part of the report and at the top left corner of the report.  Thanks!

Comment entered 2019-07-15 16:20:55 by Englisch, Volker (NIH/NCI) [C]

I've not been able to add the table with the counts to the left but it displays at the top.  I will have to check with Bob if it's possible to position the tables in the HTML format.

  • SummaryProtocolRefLinks.py
    [cdr4598-prot-links 3b047f0f]

This change is ready for review on DEV.

Comment entered 2019-07-25 17:04:10 by Osei-Poku, William (NIH/NCI) [C]

Looks good. Thanks! There is no need to align the stats table to the left of the page. 

Verified on DEV.

Comment entered 2019-07-29 17:02:47 by Englisch, Volker (NIH/NCI) [C]

The changes have been merged into our release branch Joule.

Comment entered 2019-08-05 16:05:32 by Osei-Poku, William (NIH/NCI) [C]

The report times out on QA (and now on Dev even though it worked on DEV before).

Comment entered 2019-08-05 17:53:07 by Englisch, Volker (NIH/NCI) [C]

Unfortunately, the QA server is our slowest machine and the timeout is right around the report completion time. The STAGE and PROD servers are somewhat faster but it's still possible we're running into the timeout on occasion. I'll check if there are ways to make the report faster. Otherwise, we'll have to rewrite the report to be run as a batch job.

Comment entered 2019-08-06 18:07:06 by Englisch, Volker (NIH/NCI) [C]

I tried several different approaches to speed up the report.  The first approach did not improve much.  My second attempt is removing duplicates from the list of ProtocolRefs so that we're only testing the URL once for each summary.  This seems to decrease processing time from about 15 minutes to about 11 minutes - about 25% improvement.  If this isn't working we'll have to rewrite the report to make it a batch report.

Please go ahead and test this version on DEV. If this report will work better I'll have to check with Bob if it's possible to modify the build for Joule or if we need to push the ticket to Kepler instead.

The URL for the modified report is
https://cdr-dev.cancer.gov/cgi-bin/cdr/SPR2.py?Session=guest

Comment entered 2019-08-06 21:34:28 by Osei-Poku, William (NIH/NCI) [C]

The report did run successfully this time around. Thanks!

Comment entered 2019-08-07 14:27:26 by Englisch, Volker (NIH/NCI) [C]

I've copied the report to QA for testing:

https://cdr-qa.cancer.gov/cgi-bin/cdr/SPR2.py?Session=guest

If you're able to run it here as well, I will replace the original report with this version.

Comment entered 2019-08-08 05:49:05 by Osei-Poku, William (NIH/NCI) [C]

The report is failing on QA (and DEV, even though it worked on DEV a couple of days ago).

Comment entered 2019-08-08 11:31:08 by Englisch, Volker (NIH/NCI) [C]

You were running the report SPR2.py and not the original SummaryProtocolRefLinks.py, right?

It is too bad that the report still times out but the report still depends on how fast a requests to CT.gov and Cancer.gov get back.  The new report has only about 1,500 links to test while the old version of the report was testing about 2,200.  I was hoping this change was good enough to keep the report execution within the 15 minute limit.

I'd suggest to submit a new ticket for Kepler to convert the report into a batch report.  Since the report execution time is largely dependent on the network speed I don't think it would complete on the faster PROD server.  The report will still finish on occasion - it hasn't failed for me so far - but it's not reliable.

Comment entered 2019-08-08 16:00:53 by Osei-Poku, William (NIH/NCI) [C]

Yes, I used the links you provided above. I just tried it again and it timed out again.

I'd suggest to submit a new ticket for Kepler to convert the report into a batch report. Since the report execution time is largely dependent on the network speed I don't think it would complete on the faster PROD server. The report will still finish on occasion - it hasn't failed for me so far - but it's not reliable.

Will do. Thanks!

Comment entered 2019-08-09 14:04:45 by Englisch, Volker (NIH/NCI) [C]

Based on William's experience that the report is still timing out for him, we decided to revert my first approach - using the Python module grequests in favor of the module requests - and using the original Python module.  I am still removing duplicate links within each summary.

The changes have been tested on DEV and the updated report is creating the same results.  Changes have been copied to the Joule release.

Comment entered 2019-08-12 12:28:25 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA!

Comment entered 2019-09-09 09:58:08 by Osei-Poku, William (NIH/NCI) [C]

Verified on PROD. Thanks!

Attachments
File Name Posted User
Screen Shot 2019-06-13 at 10.52.08.png 2019-06-13 10:55:12 Englisch, Volker (NIH/NCI) [C]

Elapsed: 0:00:00.001710