Issue Number | 5264 |
---|---|
Summary | NLM's clinicaltrials.gov API broken |
Created | 2023-07-24 17:12:07 |
Issue Type | Inquiry |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2023-08-02 15:15:10 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.353106 |
Find out why the nightly job to retrieve and store recent clinical trials stopped working late last month and figure out how to get it working again.
Added ~duganal and ~oseipokuw as watchers.
I filed a ticket with NLM to report the failure. The reply I got back contained only a link to the description of a new API which is slated to replace the existing API which we have been using. According to that page, clients using the old API were supposed to be redirected to the new URL for that "classic" API, but that didn't happen, and we instead were getting an HTML error message complaining that our query could not be parsed.
So I wrote back with three questions.
Why did the redirect not work?
How long will we be able to use the "classic" API?
How do we map the query we use for that API to the corresponding syntax for the new API?
The support staff answered the last question, but they are unable or unwilling to answer the other two. So I have rewritten the scheduled job to use the new API. I have run the job by hand with a very recent cutoff date, in order to make it easier for the first pass to review the results. Only two trials were pulled in this way. If you run the Clinical Trials Drug Analysis report with the default parameter, these two trials are what should show up. After you have looked at them and confirmed that they look OK, ~oseipokuw, I will run the scheduled script with a slightly wider net, and we will move the threshold further and further back until all the missing trials are fetched. This is all on DEV, of course.
Some things to note:
the values for phase now come back in a sequence of token values (so, for example, "Phase 2/Phase 3" is now represented as two separate values: PHASE2 and PHASE3; I am mapping these tokens back to the values we have been getting, and concatenating the multiple-phase values into the same slash-delimited string we had been getting
NLM's documentation says that the token "NA" maps to "Not Applicable" for the old API, but the value we have in the database is "N/A" so that's what I'm mapping it to; please let me know if you want me to use "Not Applicable" instead
with the old API, we used the parameter "rcv_s=..." to narrow the retrieved set of trials to those received by NLM on or after the cutoff date we specified, and we used the value from the "study_first_submitted" element of the returned XML for the value we stored in our "first_received" column of the ctgov_trial table; however, the new API does not expose any dates identifying when NLM received the trial; when I asked the support desk if mapping our rcv_s parameter from the old query to the new "studyFirstSubmitDate" field would be appropriate I was told that was wrong, and the correct equivalent was the value of the "studyFirstPostDate" field; my analysis of the actual dates we get from NLM for the various fields leads me to believe that their "studyFirstSubmitDate" field matches the values we have stored in our "first_received" column, so that's what I'm using (both in the query and in the dates I'm storing)
Please let me know if any of those points raise red flags, or if you have any questions about them.
Mary and I have reviewed the two trials and they look good in the report. Please run one more job with the slightly wider range. Thanks!
I backed up one day, which brought in a few dozen more trials.
They look good but is there a reason why the NCT IDs are not displayed in the column? There appears to be hidden links in the column. Clicking on them gives you a page error though.
... is there a reason why the NCT IDs are not displayed in the column?
I can't reproduce that problem.
Clicking on them gives you a page error though.
Looks like NLM has broken more than their API. That's a problem with the report, not the nightly job. Please open a new ticket and add it to Quinn.
Sure. I can see the NCT IDs now. I had to enable editing in Excel. I will open a new ticket for the links. Thanks!
Everything else look good. I think we can proceed with the next/final steps.
DEV has all the trials now.
Verified on DEV. Thanks!
Verified on QA. Thanks!
Verified on PROD. Thanks!
File Name | Posted | User |
---|---|---|
nct-ids.png | 2023-08-02 13:19:08 | Kline, Bob (NIH/NCI) [C] |
Elapsed: 0:00:00.001584