CDR Tickets

Issue Number 1782
Summary Projected Accrual to Expected Enrollment
Created 2005-12-13 08:21:43
Issue Type Improvement
Submitted By Grama, Lakshmi (NIH/NCI) [E]
Assigned To alan
Status Closed
Resolved 2006-04-25 16:00:32
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.106110
Description

BZISSUE::1931
BZDATETIME::2005-12-13 08:21:43
BZCREATOR::Lakshmi Grama
BZASSIGNEE::Alan Meyer
BZQACONTACT::Sheri Khanna

Volker created a spreadsheet identifying the values that occur in the accrual field within <Para> tags. We need to also identify values that are in this element, but within other tags possibly <TT>.

We would like to programmatically transfer the total number of patients that will be recruited for the trial into a numerical value. It seems that there may be ways in which we can parse the text to obtain the value. Could you propose some solutions for review.

Comment entered 2005-12-13 08:24:09 by Grama, Lakshmi (NIH/NCI) [E]

BZDATETIME::2005-12-13 08:24:09
BZCOMMENTOR::Lakshmi Grama
BZCOMMENT::1

The attachement contains only text in <Para> element. Need to ask Volker to include other values also.

Comment entered 2005-12-13 10:44:15 by alan

BZDATETIME::2005-12-13 10:44:15
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

From looking at Volker's results, it appears to me that it will
be impossible to get 100% reliable output from any program that
tries to convert the text field into an integer. Even a human
would have trouble with some of the data, for example

"14 patients will be entered to receive fotemustine as
first-line therapy and/or 19 to receive the drug as second-line
therapy."

That sentence may mean that there will be somewhere between 19
and 33 (14+19) patients total, but I'm not certain of that
interpretation.

Some possible issues are:

1. Should the program attempt to determine a confidence level for
any decision it makes?

2. Should only high confidence decisions be propagated to then
ExpectedEnrollment element? How high is high for this
purpose?

3. How large a percentage of high confidence results are required
to make this worthwhile? How much effort is justified to
achieve that?

and perhaps separately,

4. Should ExpectedEnrollment be a binary field, e.g.,
MinExpectedEnrollment and MaxExpectedEnrollment?

Let's discuss this at the meeting.

Comment entered 2005-12-13 11:54:37 by alan

BZDATETIME::2005-12-13 11:54:37
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

ANALYSIS
--------

After discussing this further with Lakshmi, I have the following
understanding of what we will try to do:

1. The goal of the program is to retrospectively assign
ExpectedEnrollment values to InScopeProtocol documents. It
will only be used for retrospective assignment of these values
since current new documents have the ExpectedEnrollment
assigned as part of the document creation process.

2. The program will attempt to isolate total expected numbers of
patients from the target documents. Where a range is found,
e.g., "10-20", the higher number will be used.

3. The initial targets of the assignments are the more important
protocols and the ones that are the most likely to have
recognizable text patterns containing expected enrollment
numbers. Initially, this will be all published trials with:

/InScopeProtocol/ProtocolIDs/OtherID/IDType =
'ClinicalTrials.gov ID'

4. Whether we go beyond those will depend on how successful we
are with the initial group.

IMPLEMENTATION
--------------

In order to implement this, I propose to do the following:

1. Re-run Volker's program, limiting it to documents with the
above IDType value.

That will show us just those patterns of greatest interest for
the initial target. These are expected to be much more
regular than the complete set of protocol documents.

2. Identify generalized text patterns found in Volker's output.

3. Write a program to search for these patterns and derive the
ExpectedEnrollment values.

4. Output a report containing the following data:

Document ID

Text value of the Para element from which enrollment
numbers were taken.

An identifier for the pattern that was recognized so that
the reader of the report can see what the program thought
it was seeing.

Possibly a confidence number indicating how confident the
program was in its output. I'm not sure this will actually
work, but I'll at least look into it.

User override input field (see NOTES below.)

I'll try to find a useful way to sort this report. For
example, if I have confidence values, I'll make that the first
sort key. I might then use the specific recognized pattern so
that we can see how successful any particular pattern group is
as a whole, and then the document ID.

Hopefully, we will be able to spot problems via this report,
fix them, and re-run iteratively until we get it right.

5. If and when QA determines that we are getting good quality
output, I will add another step to update the actual
documents.

If QA determines that only some of the patterns are working
reliably, we may use just those patterns and report which
documents must still be checked by hand.

There may also be some documents that fall outside any
recognized patterns and must also be reported for hand
checking and updating.

NOTES
-----

It might be that the easiest way for a human user to correct the
program output would be to output the report as an HTML form with
one extra column containing an input field that a user can use to
input corrected data. I could then read the report back in and
do the following:

For each document:

If the user has input a number into the input field,
overriding whatever the program produced, update the
corresponding document with that number.

Else update the document with the program generated
number.

If there are more than a few documents that the program failed to
analyze successfully, that should be a lot faster than requiring
the user to call up each document individually in XMetal.

If we take this approach, we may want to do the final run in
batches of, say, 100 documents at a time. That might make it
easier for the user to handle updates without being overwhelmed
by a report on thousands of documents which must all be checked
before submitting the changes.

Comment entered 2005-12-22 16:30:02 by alan

BZDATETIME::2005-12-22 16:30:02
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

I plan to implement the program with the following method of
batching updates:

The user who initiates the program will be prompted for some
number of documents to process. He then enters a number, for
example, 100.

The program will run the algorithm on the first 100 documents it
finds that do not already have an ExpectedEnrollment element. It
will compute an expected enrollment for each document and add the
document to a report with the data it found, the algorithm it
selected, the number it computed, and an input box allowing a
user to override the number.

At the bottom of the report will be a submit button. A user may
change any expected enrollment numbers he wishes and click
Submit. The program will read the submitted data and update the
100 documents.

At that point a user can run again for another 100 (or whatever
number of documents desired) and continue until all documents
have ExpectedEnrollment fields.

For the first few runs of the report, a user might run several
thousand documents and not fix anything and not click Submit -
just evaluate whether the program is doing a reasonable job and
whether any changes are needed to the algorithm. I will apply
any such changes to the logic until users think the program is
doing as well as we can expect. Then a user can run real batches
through the program to perform the actual updates.

Comment entered 2005-12-22 17:02:04 by Grama, Lakshmi (NIH/NCI) [E]

BZDATETIME::2005-12-22 17:02:04
BZCOMMENTOR::Lakshmi Grama
BZCOMMENT::5

Looks good to me.

Comment entered 2005-12-27 23:10:20 by alan

BZDATETIME::2005-12-27 23:10:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6

I've started writing the code for this. It's complicated and
takes advantage of tricky Python features that I wouldn't
ordinarily like to use, but which seem to me to significantly
simplify the code in this unusual case.

I plan to use the ModifyDocs module (the "one-off global change
harness") and XSLT to do the actual addition of
ExpectedEnrollment elements to the documents. I'm using that
module because it does all of the things we normally want any
global change to do, namely:

Update both the CWD and the last publishable version.

Create a new version of the current CWD before modifying it,
if the new CWD is not the same as the last version.

Provide some testing and logging features.

I assume we want to do those things.

Comment entered 2006-01-03 22:40:29 by alan

BZDATETIME::2006-01-03 22:40:29
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7

I think it is now working.

It worked much better than I thought it would. I tried 1,000
documents on Mahler and only hit 25 for which I could not
generate an ExpectedEnrollment number. I think almost all of my
guesstimated numbers are accurate - assuming we are right in
using the highest number available when multiple numbers are
presented.

To use the program, call up the CDR Admin menu for
Developers/System Administrators on Mahler, and choose the next
to last menu entry "Guess ExpectedEnrollment from
ProjectedAccrual". Then follow the on-screen directions.

Like all global change programs, this one has a "test mode". If
the test mode check box is checked (the default), no documents
will be changed. Output is generated into diff directories in
the usual place (d:\cdr\GlobalChange...). If test mode is
unchecked, the real documents will be updated, including
publishable and current working versions.

To just test the pattern recognition techniques, request a large
number of test records (I generated 1,000 in about two minutes)
and look at the output without submitting anything.

To make this work on Bach, I'll need to re-index the protocols
there in order to index the ExpectedEnrollment element. I'll do
that if and when I'm told to put this into production.

I have not put the new Admin Developer/SA menu under CVS. I
presume we only want to use this software for a short period, and
then delete it from the menu.

Notes to myself for promoting the software:

Update query_term_def table on Bach.
Re-index InScopeProtocols.
Copy cgi/DevSA.py
Copy cgi/Request1391.py

Comment entered 2006-01-18 13:21:56 by priced

BZDATETIME::2006-01-18 13:21:56
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::8

I generated a report for 1,000 test records in Mahler and most of the output is correct. There are a small amount of records out of the ones that I looked at (around 10-15) where it is unlikely that a program could calculate the Expected Enrollment correctly because there is more than one phase of the study with different numbers of patients that need to be added together, etc., as Alan pointed out in comment#2, and there are others that don't follow the set rules that will need to be looked at manually.

I also ran a few in live mode in Mahler and the global worked correctly.

I think we can proceed with moving this to Bach.

Comment entered 2006-01-19 19:09:35 by alan

BZDATETIME::2006-01-19 19:09:35
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9

In moving this to Bach I ran into a small problem. Without
realizing it, I had used a Python capability in the program
that is only supported under Python 2.4, not the 2.3.4 version
running on Bach.

I modified the code on Mahler (it was a very small change),
tested there, then re-promoted the software to Bach and
Franck. I tested on Bach by running 10 documents in test
mode and everything looks fine.

I noticed while doing it that our ModifyDocs module (the
"one-off global change harness" that is used by this and
other programs is writing debugging statements to stderr
that wind up on the user's screen. We probably want to
eliminate that, but I left it alone for now. When running
the program, harmless messages like:

"2006-01-19 18:58:09: Processing CDR0000063244 "

may appear on the screen after submitting the request for
updates.

Comment entered 2006-02-01 18:25:58 by priced

BZDATETIME::2006-02-01 18:25:58
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::10

Verified in Bach.

Comment entered 2006-02-16 11:29:08 by alan

BZDATETIME::2006-02-16 11:29:08
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

I've modified this program on Mahler to sort the retrieved
documents by CurrentProtocolStatus, then by document ID.
This should put all the "Active" protocols at the top, then
"Closed", "Completed", "Temporarily closed", etc.

On Mahler, there are actually a few that have no status at
all that appear at the top. I don't know if that will happen
on Bach.

Comment entered 2006-02-16 12:31:40 by priced

BZDATETIME::2006-02-16 12:31:40
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::12

I've looked at the first 50 in Mahler and they all are Active, so it appears to be sorted okay. There is a small formatting problem with Admin screen now. A few titles on the first page of the Admin screen are not displaying. I've attached a view so you will understand what I am talking about.

Comment entered 2006-02-16 15:14:53 by alan

BZDATETIME::2006-02-16 15:14:53
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13

(In reply to comment #12)
> .... There is a small formatting problem with Admin screen now. A
> few titles on the first page of the Admin screen are not displaying. I've
> attached a view so you will understand what I am talking about.

The problem is in the data rather than the program. These
few records are exotic documents that have no current status
(they aren't "Active" or anything else) and have empty
ProjectedAccrual elements. Some are probably copied versions
from Bach that were saved before editing was complete, and one
is a strange exceptional case.

They may not be on Bach at all. If they are, they'll need much
more work than just entering an estimated enrollment. I don't
think we need to worry about them for this program.

Comment entered 2006-03-21 19:07:17 by alan

BZDATETIME::2006-03-21 19:07:17
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14

Sheri discovered that the program was behaving incorrectly
when there is more than one ProtocolPhase. It created
multiple ExpectedEnrollment elements. I hadn't noticed
when I first wrote the filter that there can be more than
one ProtocolPhase. (Oops)

I have modified it to only insert an ExpectedEnrollment
element after the last ProtocolPhase.

I have erased the old program on Bach. We know it was
damaging data and should not be run. When Sheri says the
new version is right, I'll promote it to Bach.

There are 1477 InScopeProtocols on Bach with more than one
ProtocolPhase. I don't know how many of them have more than
one ExpectedEnrollment. I could create a new index and
re-index to find out. Do we need to do this, and perhaps
write a one-off global change to fix the fixes, i.e., to
erase any duplicate ExpectedEnrollment's that were created?

I should probably do that unless Sheri already had the
problem under control.

We're also having a performance problem on Bach. I was
able to transform 100 documents on Mahler in 2 minutes
and 19 seconds. But when Sheri did it on Bach, it timed
out.

Bob and I have asked the sysadmins to turn off "on access"
virus scanning on Bach. It doesn't protect us from
anything and we suspect it dramatically slows the machine.

Hopefully this can be done and will solve the problem. If
it doesn't, I can make this program perform the updates in
the background as a batch process, but I'm hoping it won't
come to that.

Comment entered 2006-03-22 18:13:02 by priced

BZDATETIME::2006-03-22 18:13:02
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::15

Modification to global (for more than one ProtocolPhase) has been verified in Mahler. Please promote to Bach.

As far as the documents with multiple Protocol Phase's in Bach, I just need a list of the 1477 documents that have them. I can compare the Doc ID to the one's that have been changed by the global to make sure that I have caught them all. I haven't changed that many documents yet, so this should be fairly easy.

Comment entered 2006-03-23 10:31:39 by alan

BZDATETIME::2006-03-23 10:31:39
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16

I've attached a list of 1477 IDs, in doc id order.

I've also moved the corrected program to Bach.

Comment entered 2006-04-11 10:28:25 by alan

BZDATETIME::2006-04-11 10:28:25
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17

The program is working. It runs slower than expected on Bach,
but we have taken steps to fix some memory problems on Bach that
may solve the problem.

Comment entered 2006-04-11 11:49:54 by priced

BZDATETIME::2006-04-11 11:49:54
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::18

I aimed high and tried to run the global for 100 docs, but it timed out.
It worked okay for 50 docs, though.

Comment entered 2006-04-25 15:59:49 by priced

BZDATETIME::2006-04-25 15:59:49
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::19

Closing issue for now. I'm proceeding to process the globals for 50 docs at a time.

Comment entered 2006-04-25 16:00:32 by priced

BZDATETIME::2006-04-25 16:00:32
BZCOMMENTOR::Sheri Khanna
BZCOMMENT::20

Issue closed.

Comment entered 2013-06-28 12:58:24 by Kline, Bob (NIH/NCI) [C]

"Estimate ExpectedEnrollment.htm" was uploaded by Sheri Khanna 2006-02-16 12:31 EST as "UserAdminScreen"; "x.txt" was uploaded by Alan Meyer 2006-03-32 10:31 EST as "Doc IDs for docs on Bach with more than one ProtocolPhase"; a third attachment was uploaded by Lakshmi Grama 2005-12-13 08:24 EST as "Entries in the Projected Accrual element" but this attachment was subsequently deleted. For some reason the conversion of the Bugzilla database to JIRA failed to include any attachments for this issue, so I am attaching them by hand at the request of Calla Pearce (CBIIT). Bob Kline (2013-06-28 12:58 PM).

Attachments
File Name Posted User
Estimate ExpectedEnrollment.htm 2013-06-28 12:58:24
x.txt 2013-06-28 12:58:24

Elapsed: 0:00:00.001230