Issue Number | 3135 |
---|---|
Summary | [CTGov import] Explore modifications to ctgov import program to include terms in keywords |
Created | 2010-04-29 17:00:57 |
Issue Type | Improvement |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2010-06-16 13:14:16 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.107463 |
BZISSUE::4817
BZDATETIME::2010-04-29 17:00:57
BZCREATOR::William Osei-Poku
BZASSIGNEE::Bob Kline
BZQACONTACT::William Osei-Poku
In the CDR status meeting today, we discussed the need to explore (on Mahler) modifications to the ctgov import program to retrieve trials that have 'cancer' and 'neoplasm' as part of their keywords.
BZDATETIME::2010-05-04 15:13:12
BZCOMMENTOR::Bob Kline
BZCOMMENT::1
I looked for evidence that "KEYWORD" is used as part of the explicit syntax for specifying CT.gov searches (in the same way that "SYNDROME" is used), but didn't see any such evidence. That's not conclusive, as the online documentation I am able to find is not comprehensive, but unless someone on this issue can point me to documentation to the contrary, I'm going to assume that "KEYWORD" in this context means "appears anywhere" instead of "matches an explicitly assigned term used to identify this document." Here's the output of the test program I used to do the investigation to determine the difference between the two queries:
term=(cancer+OR+neoplasm)+%5BALL-FIELDS%5D&studyxml=true
25885 files in result from new query
term=(cancer)%5BCONDITION%5D+OR(lymphedema)%5BCONDITION%5D+OR(myelodysplastic+syndromes)%5BCONDITION%5D+OR(neutropenia)%5BCONDITION%5D+OR(aspergillosis)%5BCONDITION%5D+OR(mucositis)+%5BCONDITION%5D&studyxml=true
24245 files in result from original query
124 files in old set but not in new
1764 files in new set but not in old
26009 files in combined set
I will attach (and identify) the sets I captured during this exercise.
BZDATETIME::2010-05-04 15:15:07
BZCOMMENTOR::Bob Kline
BZCOMMENT::2
Attachment old-ctgov-query.set has been added with description: Documents retrieved by original query
BZDATETIME::2010-05-04 15:15:41
BZCOMMENTOR::Bob Kline
BZCOMMENT::3
Attachment new-ctgov-query.set has been added with description: Documents retrieved by new query
BZDATETIME::2010-05-04 15:16:35
BZCOMMENTOR::Bob Kline
BZCOMMENT::4
Attachment old-not-new.set has been added with description: Documents retrieved by original query but not the new query
BZDATETIME::2010-05-04 15:17:03
BZCOMMENTOR::Bob Kline
BZCOMMENT::5
Attachment new-not-old.set has been added with description: Documents retrieved by new query but not the original query
BZDATETIME::2010-05-04 15:17:56
BZCOMMENTOR::Bob Kline
BZCOMMENT::6
Attachment combined.set has been added with description: Union of sets retrieved by original and new queries
BZDATETIME::2010-05-04 15:55:55
BZCOMMENTOR::Bob Kline
BZCOMMENT::7
I did a little more experimenting to see if I could reverse-engineer a little more information about the syntax of the query language and possibly create a single query which would do the combination of the two approaches in one submission, and I believe I have succeeded for both goals. Here's the URL which worked:
... and I confirmed that it retrieves exactly the same 26,009 documents which the two separate queries did together (after de-duplication).
BZDATETIME::2010-05-04 16:05:26
BZCOMMENTOR::Bob Kline
BZCOMMENT::8
I did one more check and found that even with the new query there are still 80 of the 219 trials marked for forced download in the ctgov_import table which aren't included as part of the query's result set, so we'd still need to pull those down separately (at least, those that are still retrievable at all).
BZDATETIME::2010-05-05 13:30:46
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::9
(In reply to comment #1)
> I looked for evidence that "KEYWORD" is used as part of the
explicit syntax for
> specifying CT.gov searches (in the same way that "SYNDROME" is
used), but
> didn't see any such evidence. That's not conclusive, as the
online
> documentation I am able to find is not comprehensive, but unless
someone on
> this issue can point me to documentation to the contrary, I'm going
to assume
> that "KEYWORD" in this context means "appears anywhere" instead of
"matches an
> explicitly assigned term used to identify this document." Here's
the output
Mary, Ning and I looked at the results. We do have the following suggestion:
On the ctgov public site, there is a "Keywords provided by [Name of responsible party or institution]" section at the end of the page for some of the trials. Can the search be done this way? That is, limit the search to only the keywords provided by the responsible party instead of looking at the entire document?
For example:
http://www.clinicaltrials.gov/ct2/show/NCT00000124?term=NCT00000124&rank=1
In the above trial, The Keywords section states:
Keywords provided by National Eye Institute (NEI):
Choroidal Melanoma
Ocular Melanoma
BZDATETIME::2010-05-05 14:45:22
BZCOMMENTOR::Bob Kline
BZCOMMENT::10
I have asked NLM to provide us with documentation on the supported URL query syntax. The online help page for advanced searching makes no mention of support for searching using these explicit "keywords" (indeed, that help page does not appear to be related to anything but the user interface for human searchers).
Here's the message I sent to John Gillen and Nick Ide:
============================================================================
Nick/John:
Could you point me in the right direction for the documentation of the URL query syntax accepted by clinicaltrials.gov? I recall that you mentioned that your web site (and the URL syntax used by the system) had been changed back in 2008, and I'm pretty sure I've never found any documentation of exactly what's supported by that syntax. Our users have requested that we augment the logic we currently use so that in addition to using condition terms in the query we also pick up trials which don't index any of the terms were looking for as conditions, but have 'cancer' or 'neoplasm' as a keyword. I'm aware that some systems use the term 'keyword' to refer vaguely to the ability to search for documents containing specific words or phrases appearing anywhere in the document, supported by full-text indexing, whereas other systems explicitly identify "key words" used to identify documents. At first I assumed our users had the first usage of 'keyword' in mind, because I was unable to find any mention of support for the second kind of usage in the online help for advanced searching in clinicaltrials.gov. But the users came back and pointed out the explicit keywords enumerated in the CT.gov documents, and asked if there were any way we could formulate the search to look for documents using those explicitly provided keywords. See, for example, "Keywords provided by Case Comprehensive Cancer Center" near the bottom of [1].
Such documentation on the supported syntax would be generally helpful for our work, not just for this specific enhancement request.
Thanks!
[1] http://www.clinicaltrials.gov/ct2/show/NCT00899132
–
Bob Kline
Contractor
Communications Technology Branch
Office of Communications and Education
National Cancer Institute
National Institutes of Health
http://www.rksystems.com
mailto:bkline@rksystems.com
BZDATETIME::2010-05-05 18:31:45
BZCOMMENTOR::Bob Kline
BZCOMMENT::11
Reply from Nick:
=============================================================================
We don't currently have a great document that explains all of
this.
The "Help" link next to the SEARCH button mentions a few things.
The other general tip is to use "Advanced Search" using the fields and
then click on "Refine Search" and then click on "Expert Search" (right
hand side) – which will show you the field names that go along with the
boxes from advanced search.
The specific answer to the keyword question is that a [DISEASE]
search includes a search of tagged areas:
condition, brief_title, official_title, mesh_terms, and keyword
=============================================================================
... to which I wrote back:
=============================================================================
So, [KEYWORD] won't do what we want (a little more surgically)?
=============================================================================
... and he responded:
=============================================================================
We do have a KEYWORDS area which requires an "exact match" – so "cancer [KEYWORDS]" won't find <keyword>fallopian cancer</keyword>
I really don't recommend this.
If you get extremely specific about how/where you look, you will miss
things.
I think [DISEASE] does what you want.
Do you feel like it is returning too much stuff? If so, what is an
example of something it returns that you wish it wouldn't return ?
=============================================================================
I told him I would bounce his questions back to CIAT and let him know what you have to say.
BZDATETIME::2010-05-06 10:28:14
BZCOMMENTOR::Bob Kline
BZCOMMENT::12
I tested using [DISEASE] instead of [ALL-FIELDS] so CIAT and Lakshmi can determine whether (1) the new query is sufficiently inclusive (picks up all or at least enough of the trials missed by the original query) and (2) has a sufficiently low number of hits for trials we don't really want. The list to be reviewed to make this determination is attached. Here's the output of the program I used for the testing (note that the revised syntax I was able to use based on reverse engineering of their query syntax resulted in a query which takes half the time to process as the original query):
term=(cancer)%5BCONDITION%5D+OR(lymphedema)%5BCONDITION%5D+OR(myelodysplasti
c+syndromes)%5BCONDITION%5D+OR(neutropenia)%5BCONDITION%5D+OR(aspergillosis)
%5BCONDITION%5D+OR(mucositis)+%5BCONDITION%5D&studyxml=true
elapsed: 306.464000 seconds
24285 files in result from original query
term=(cancer+OR+neoplasm)+%5BDISEASE%5D&studyxml=true
elapsed: 221.575000 seconds
24289 files in result from new query
166 files in old set but not in new
170 files in new set but not in old
24455 files in combined set
term=(cancer+OR+neoplasm)[DISEASE]OR(lymphedema+OR+myelodysplastic+syndromes
OR+neutropenia+OR+aspergillosis+OR+mucositis)+[CONDITION]&studyxml=true
elapsed: 158.873000 seconds
24455 files in result from combo query
two approaches to fetching combined lists match
Attachment new-not-old.set has been added with description: Trials picked up by new query but not by original query
BZDATETIME::2010-05-11 13:50:49
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::13
(In reply to comment #12)
> Created attachment 1915 [details]
> Trials picked up by new query but not by original query
>
> I tested using [DISEASE] instead of [ALL-FIELDS] so CIAT and
Lakshmi can
> determine whether (1) the new query is sufficiently inclusive
(picks up all or
> at least enough of the trials missed by the original query) and (2)
has a
> sufficiently low number of hits for trials we don't really want.
The list to
> be reviewed to make this determination is attached. Here's the
output of the
> program I used for the testing (note that the revised syntax I was
able to use
> based on reverse engineering of their query syntax resulted in a
query which
> takes half the time to process as the original query):
Here is a breakdown of what we found after reviewing about half of the trials contained in the attachment.
77 – total trials reviewed:
11 – Missed and will like to import
36 – Missed but now completed and don’t want to import
4 - Missed but currently terminated and don’t want to import
21 – Out of scope and should not be imported (Includes trials for AIDS,
Rheumatoid arthritis, arthritis, Crohns)
5 - Missed in the original download due to incorrect indexing. Now
corrected and will come through the existing import.
Out of the 77 trials we reviewed, we found 11 to be useful and for which will like to import. If completed and terminated trials can be excluded, for example, the list can be manageable.
BZDATETIME::2010-05-11 16:09:01
BZCOMMENTOR::Bob Kline
BZCOMMENT::14
Completed and terminated trials are automatically skipped, except for those which are marked with the 'force' flag. Lakshmi said she will review your findings and make a determination about whether to proceed with modifying the query we submit to NLM.
BZDATETIME::2010-05-13 12:50:47
BZCOMMENTOR::Bob Kline
BZCOMMENT::15
Test modification of the query to add "National Cancer Institute" as sponsor.
BZDATETIME::2010-05-19 16:03:17
BZCOMMENTOR::Bob Kline
BZCOMMENT::16
We now have three possible components to the query we submit to NLM:
CONDITION=cancer OR lymphedema OR ... [what we've been using in
production]
DISEASE=cancer OR neoplasms [what this issue asked that we add]
SPONSOR=National Cancer Institute [the latest suggestion]
I've done a number of queries combining these components in various ways to see if we can determine what the SPONSOR component brings to the table.
If we use just the original CONDITION component, we get 24,406 hits. When we add the DISEASE component, we get another 168 trials. The is roughly in line with what we saw in comment #12 (the numbers have changed since several days have elapsed and NLM's database has changed). If we add the SPONSOR component, we pick up another 236 trials not caught by the combination of the CONDITION and DISEASE pieces. The list of those 236 trials is attached.
I also compared the CONDITION+SPONSOR components against the original (CONDITION only) version of the query (ignoring the DISEASE component altogether): adding the SPONSOR component picks up 257 trials not hit by the original query. Finally, I checked to see what we would miss by leaving out the DISEASE component by comparing the version which uses all three components against the CONDITION+SPONSOR version, which picked up 147 fewer trials.
I can post the delta lists for those additional comparisons if there is interest in reviewing them, but the thrust of the latest request from Lakshmi was to see what adding the SPONSOR component picked up that's not in the CONDITION+DISEASE version.
Here's the output of the script which did the searches and comparisons:
term=(lymphedema+OR+myelodysplastic+syndromes+OR+neutropenia+OR+aspergillosis+OR+mucositis+OR+cancer)+[CONDITION]&studyxml=true
condition (24406 hits): 207.357000 seconds
term=(lymphedema+OR+myelodysplastic+syndromes+OR+neutropenia+OR+aspergillosis+OR+mucositis+OR+cancer)[CONDITION]OR(cancer+OR+neoplasm)[DISEASE]&studyxml=true
condition+disease (24574 hits): 299.386000 seconds
term=(lymphedema+OR+myelodysplastic+syndromes+OR+neutropenia+OR+aspergillosis+OR+mucositis+OR+cancer)[CONDITION]OR(cancer+OR+neoplasm)[DISEASE]OR(National+Cancer+Institute)+[SPONSOR]&studyxml=true
condition+disease+sponsor (24810 hits): 175.919000 seconds
term=(lymphedema+OR+myelodysplastic+syndromes+OR+neutropenia+OR+aspergillosis+OR+mucositis+OR+cancer)[CONDITION]OR(National+Cancer+Institute)[SPONSOR]&studyxml=true
condition+sponsor (24663 hits): 170.294000 seconds
257 in sponsor-not-condition
236 in sponsor-not-condition-or-disease
147 in disease-not-condition-or-sponsor
Attachment sponsor-not-condition-or-disease.set has been added with description: Added to results by picking up trials for which the sponsor is National Cancer Institute
BZDATETIME::2010-05-24 11:38:37
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::17
Here is a breakdown of our review. Please, let me know if you have any questions. Also, the inclusion of “National Cancer Institute” as a collaborator or sponsor, appears to have brought in NCT00987597, which is trial sponsored by "National Cancer Institute, France". I am not sure if there are others like that but this is the only we came across so there might not be too many like that one.
Import/CANCER = 59
Closed/CANCER = 52
Active with NCI Sponsor/collaborator but not cancer = 13
Suspended with NCI as collaborator = 1
Completed/CANCER = 31
Completed but NCI collaborator/Sponsor = 68
DO NOT IMPORT = 13 (Withdrawn, non-cancer, completed)
BZDATETIME::2010-06-04 15:56:18
BZCOMMENTOR::Lakshmi Grama
BZCOMMENT::18
(In reply to comment #17)
> Also, the inclusion of “National Cancer Institute” as a
collaborator
> or sponsor, appears to have brought in NCT00987597, which is trial
sponsored by
> "National Cancer Institute, France". I am not sure if there are
others like
> that but this is the only we came across so there might not be too
many like
> that one.
I don't think it is problematic to bring in the French trial - it would still be in scope.
Are we ready to implement these changes?
BZDATETIME::2010-06-04 16:07:11
BZCOMMENTOR::Bob Kline
BZCOMMENT::19
(In reply to comment #18)
> Are we ready to implement these changes?
As soon as I get the green light I can plug them in.
BZDATETIME::2010-06-04 16:31:45
BZCOMMENTOR::Lakshmi Grama
BZCOMMENT::20
Looks good to me. If William and Ning are OK we can promote.
BZDATETIME::2010-06-04 16:32:27
BZCOMMENTOR::Lakshmi Grama
BZCOMMENT::21
Do we need to test the promotion on Mahler or Franck first
BZDATETIME::2010-06-04 16:35:57
BZCOMMENTOR::Bob Kline
BZCOMMENT::22
Not a bad idea to test on Franck.
BZDATETIME::2010-06-04 16:53:06
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::23
Yes. Testing on Franck before promoting to Bach will be good, and yes we are ready.
BZDATETIME::2010-06-04 20:03:30
BZCOMMENTOR::Bob Kline
BZCOMMENT::24
I'm testing on Mahler first, then on Franck.
BZDATETIME::2010-06-08 14:21:48
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::25
(In reply to comment #24)
> I'm testing on Mahler first, then on Franck.
I see the trials are on the review page (Mahler). Should we be testing too? Or we should wait and test on Franck?
BZDATETIME::2010-06-08 14:24:24
BZCOMMENTOR::Bob Kline
BZCOMMENT::26
You can look at them if you like, but you probably only need to review the results on Franck. I tested on Mahler just to make one last check to make sure nothing blew up after I folded the new query logic into the download program (pulled in from the separate program built to perform the analysis of the differences in results).
BZDATETIME::2010-06-08 14:26:24
BZCOMMENTOR::Bob Kline
BZCOMMENT::27
(In reply to comment #26)
> ... but you probably only need to review the results on Franck.
Just so there's no confusion: the tests on Franck are still running, so nothing to review quite yet. Will let you know.
BZDATETIME::2010-06-09 12:09:55
BZCOMMENTOR::Bob Kline
BZCOMMENT::28
Test download/import jobs have completed on Franck. Please check to make sure that (a) new trials you were hoping to pick up are included and (b) amount of unwanted trials on the review list is not overwhelming.
BZDATETIME::2010-06-14 12:17:33
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::29
(In reply to comment #28)
> Test download/import jobs have completed on Franck. Please check to
make sure
> that (a) new trials you were hoping to pick up are included and (b)
amount of
> unwanted trials on the review list is not overwhelming.
We have reviewed the trials on Franck and our findings are consistent with our review of the query results above. Please promote this to Bach.
BZDATETIME::2010-06-14 13:42:07
BZCOMMENTOR::Bob Kline
BZCOMMENT::30
Promoted to Bach; please check (and close if OK).
BZDATETIME::2010-06-16 13:13:58
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::31
(In reply to comment #30)
> Promoted to Bach; please check (and close if OK).
Verified on Bach. Issue closed. Thanks!
BZDATETIME::2010-06-16 13:14:16
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::32
Now closed.
File Name | Posted | User |
---|---|---|
combined.set | 2010-05-04 15:17:56 | |
new-ctgov-query.set | 2010-05-04 15:15:41 | |
new-not-old.set | 2010-05-06 10:28:14 | |
new-not-old.set | 2010-05-04 15:17:03 | |
old-ctgov-query.set | 2010-05-04 15:15:07 | |
old-not-new.set | 2010-05-04 15:16:35 | |
sponsor-not-condition-or-disease.set | 2010-05-19 16:03:17 |
Elapsed: 0:00:00.002300