CDR Tickets

Issue Number 5264
Summary NLM's clinicaltrials.gov API broken
Created 2023-07-24 17:12:07
Issue Type Inquiry
Submitted By Kline, Bob (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2023-08-02 15:15:10
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.353106
Description

Find out why the nightly job to retrieve and store recent clinical trials stopped working late last month and figure out how to get it working again.

Comment entered 2023-07-24 17:13:23 by Kline, Bob (NIH/NCI) [C]

Added and as watchers.

Comment entered 2023-07-31 18:12:27 by Kline, Bob (NIH/NCI) [C]

I filed a ticket with NLM to report the failure. The reply I got back contained only a link to the description of a new API which is slated to replace the existing API which we have been using. According to that page, clients using the old API were supposed to be redirected to the new URL for that "classic" API, but that didn't happen, and we instead were getting an HTML error message complaining that our query could not be parsed.

So I wrote back with three questions.

  1. Why did the redirect not work?

  2. How long will we be able to use the "classic" API?

  3. How do we map the query we use for that API to the corresponding syntax for the new API?

The support staff answered the last question, but they are unable or unwilling to answer the other two. So I have rewritten the scheduled job to use the new API. I have run the job by hand with a very recent cutoff date, in order to make it easier for the first pass to review the results. Only two trials were pulled in this way. If you run the Clinical Trials Drug Analysis report with the default parameter, these two trials are what should show up. After you have looked at them and confirmed that they look OK, , I will run the scheduled script with a slightly wider net, and we will move the threshold further and further back until all the missing trials are fetched. This is all on DEV, of course.

Some things to note:

  • the values for phase now come back in a sequence of token values (so, for example, "Phase 2/Phase 3" is now represented as two separate values: PHASE2 and PHASE3; I am mapping these tokens back to the values we have been getting, and concatenating the multiple-phase values into the same slash-delimited string we had been getting

  • NLM's documentation says that the token "NA" maps to "Not Applicable" for the old API, but the value we have in the database is "N/A" so that's what I'm mapping it to; please let me know if you want me to use "Not Applicable" instead

  • with the old API, we used the parameter "rcv_s=..." to narrow the retrieved set of trials to those received by NLM on or after the cutoff date we specified, and we used the value from the "study_first_submitted" element of the returned XML for the value we stored in our "first_received" column of the ctgov_trial table; however, the new API does not expose any dates identifying when NLM received the trial; when I asked the support desk if mapping our rcv_s parameter from the old query to the new "studyFirstSubmitDate" field would be appropriate I was told that was wrong, and the correct equivalent was the value of the "studyFirstPostDate" field; my analysis of the actual dates we get from NLM for the various fields leads me to believe that their "studyFirstSubmitDate" field matches the values we have stored in our "first_received" column, so that's what I'm using (both in the query and in the dates I'm storing)

Please let me know if any of those points raise red flags, or if you have any questions about them.

Comment entered 2023-08-02 11:00:10 by Osei-Poku, William (NIH/NCI) [C]

Mary and I have reviewed the two trials and they look good in the report. Please run one more job with the slightly wider range. Thanks!

Comment entered 2023-08-02 11:30:44 by Kline, Bob (NIH/NCI) [C]

I backed up one day, which brought in a few dozen more trials.

Comment entered 2023-08-02 12:59:00 by Osei-Poku, William (NIH/NCI) [C]

They look good but is there a reason why the NCT IDs are not displayed in the column? There appears to be hidden links in the column. Clicking on them gives you a page error though.

Comment entered 2023-08-02 13:26:17 by Kline, Bob (NIH/NCI) [C]

... is there a reason why the NCT IDs are not displayed in the column?

I can't reproduce that problem.

Clicking on them gives you a page error though.

Looks like NLM has broken more than their API. That's a problem with the report, not the nightly job. Please open a new ticket and add it to Quinn.

Comment entered 2023-08-02 13:53:03 by Osei-Poku, William (NIH/NCI) [C]

Sure. I can see the NCT IDs now. I had to enable editing in Excel. I will open a new ticket for the links. Thanks!

Comment entered 2023-08-02 14:23:29 by Osei-Poku, William (NIH/NCI) [C]

Everything else look good. I think we can proceed with the next/final steps.

Comment entered 2023-08-02 15:11:50 by Kline, Bob (NIH/NCI) [C]

DEV has all the trials now.

Comment entered 2023-08-02 17:05:41 by Kline, Bob (NIH/NCI) [C]
Comment entered 2023-08-03 13:10:06 by Osei-Poku, William (NIH/NCI) [C]

Verified on DEV. Thanks!

Comment entered 2023-10-19 09:27:05 by Osei-Poku, William (NIH/NCI) [C]

Verified on QA. Thanks!

Comment entered 2023-11-09 09:59:02 by Osei-Poku, William (NIH/NCI) [C]

Verified on PROD. Thanks!

Attachments
File Name Posted User
nct-ids.png 2023-08-02 13:19:08 Kline, Bob (NIH/NCI) [C]

Elapsed: 0:00:00.001584