Issue Number | 3478 |
---|---|
Summary | ICRDB Stats Report |
Created | 2012-02-07 15:57:13 |
Issue Type | Improvement |
Submitted By | Beckwith, Margaret (NIH/NCI) [E] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2012-04-19 14:17:46 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.107806 |
BZISSUE::5173
BZDATETIME::2012-02-07 15:57:13
BZCREATOR::Margaret Beckwith
BZASSIGNEE::Volker Englisch
BZQACONTACT::William Osei-Poku
Rick has asked me to provide a monthly report capturing various ICRDB production statistics. I am attaching the report that I have been doing manually twice a year as a template, but I would like to figure out a way to generate the report more automatically if possible. (I only need to provide the table of numbers, not the additional pages listing new and revised documents.)
Attachment ICRDBProdNumbers 2011.doc has been added with description: ICRDB Production Statistics
BZDATETIME::2012-02-08 11:13:38
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1
Since you are passing this on to Rick/Lenora do you have a preference
on the output format?
a) An HTML email report
b) Excel spreadsheet
c) Plain ASCII or CSV format
If you need to modify the report before passing it on options (b) is probably the easiest to modify.
BZDATETIME::2012-02-09 17:36:24
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2
These are the numbers for January based on the ad-hoc reports.
ICRDB Pub Numbers - Summaries reformatted
CDR-ID Title Audience Creation Date
---— -------------------------------— -----— -------------------
717154 Descripción del tratamiento de los Patients 2012-01-06
11:19:13
tumores del cerebro y la médula
espinal en niños*BLOCKED*;
Treatment;Patients
698070 Late Effects of Treatment for Patients 2012-01-10 12:22:34
Childhood Cancer BLOCKED;
Treatment;Patients
721528 Náusea y vómito*BLOCKED*; Patients 2012-01-24
15:18:15
Supportive care;Patients
ICRDB Pub Numbers - Meeting recordings created.
CDR-ID Title Creation Date
---— ------------------------------------------—
--------------------
723667 Adult Treatment Meeting Recording 01-17-12; 2012-01-18
18:26:55
meeting recording;MP3
724178 Genetics Meeting Recording 01-24-12; 2012-01-26 16:48:17
meeting recording;MP3
724179 Screening and Prevention Meeting Recording 2012-01-26
17:02:27
01-18-12;meeting recording;MP3
724492 Pediatrics Meeting Recording 01-27-12; 2012-01-31 13:00:10
meeting recording;MP3
ICRDB Pub Numbers - Images published this year
CDR-ID Title Publishing Date
---— -------------------------------------------—
-------------------
722030 breast-sparing surgery;test or procedure;JPEG 2012-01-31
18:00:04
ICRDB Pub Numbers - Drug Terms Published
CDR-ID Title First_pub
---— ------------------------------------------—
--------------------
721951 cisatracurium besylate;Index term;Drug/agent 2012-01-04
18:00:04
721983 alpha 1 antitrypsin;Index term;Drug/agent 2012-01-05
18:00:03
721984 sodium iodide I-131;Index term;Drug/agent 2012-01-05
18:00:03
721986 capecitabine rapidly disintegrating tablet; 2012-01-05
18:00:03
Index term;Drug/agent
722029 iodine I 124 monoclonal antibody 8H9; 2012-01-06 18:00:04
Index term;Drug/agent
722041 rhubarb/licorice herbal supplement; 2012-01-06 18:00:04
Index term;Drug/agent
722307 iodine I 124 NM404;Index term;Drug/agent 2012-01-09
18:00:04
722308 iodine I 131 NM404;Index term;Drug/agent 2012-01-09
18:00:04
722324 pbi-shRNA STMN1 lipoplex;Index term;Drug/agent 2012-01-10
18:00:04
722359 anti-myostatin monoclonal antibody LY2495655; 2012-01-10
18:00:04
Index term;Drug/agent
723999 radotinib hydrochloride;Index term;Drug/agent 2012-01-24
18:00:04
ICRDB Pub Numbers - Audio published = 768
(I spare you with the list)
BZDATETIME::2012-02-14 10:51:15
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3
The mystery of the duplicated records for the Published Media
Documents report has been fixed. Because we're excluding media documents
from this report that are listed with a category of 'pronunciation' or
'meeting recording' selection of the category field had to be included
in the SQL query. This resulted in image files with multiple category
values in the document to be listed once for each category.
This has been fixed and - since it was a very minor change - copied to
BACH.
As mentioned yesterday, I also realized that I hadn't been careful
enough with some of my SQL queries in terms of using the correct date
range. This caused the results of our numbers in some instances to
include or exclude one more day depending which day had been entered as
the last day of the month for which query.
I will make sure to handle this issue correctly for this report.
BZDATETIME::2012-02-14 19:49:35
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4
Would you like to include or have the option to include the individual document IDs and Titles?
BZDATETIME::2012-02-15 09:30:09
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::5
This is fantastic! I think it would be great to be able to see the ID's and Titles for the report (especially the titles).
BZDATETIME::2012-02-15 13:12:32
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6
If there are now documents for a given "row" (like 'New Media', 'New Summaries', etc.) are we going to report the record anyway with a count of zero or are we going to drop that row from the report?
Same question for the report including the option to display all records. If there are no summaries, for instance, for a given time period do we want to include a message like 'No Summaries found' or just drop this category?
I'm thinking I will also create options to individually display categories. For one thing it will help me with testing since I can selectively run a section/document type I'm working on and secondly it may be nice being able to run the report for just Summaries and Media or just GlossaryTerms and DIS if need be.
BZDATETIME::2012-02-15 13:29:54
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::7
We want the category to show with 0 (at least for the monthly report I am going to be sending).
BZDATETIME::2012-02-17 10:05:39
BZCOMMENTOR::Volker Englisch
BZCOMMENT::8
On the report you are showing multiple rows for the summaries:
Total Summaries
HP Summaries (EN)
HP Summaries (ES)
Pat Summaries (EN)
Pat Summaries (ES)
If you're running this report with the option to show the individual documents would you like to have those documents split into the different "groups" as well or would it be OK to just list all summaries with a column for the audience and language?
BZDATETIME::2012-02-22 13:48:13
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9
I think I've done as much as I could without additional guidance from
Margaret. The report currently includes all of the new published
documents.
We will still need to work on the reports for the revised documents.
BZDATETIME::2012-03-01 11:57:15
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::10
Is there a way to look at the report? I think the answer to your question aboutt he list of summaries is that I would prefer to have it at least separated by language, and then indicate audience. I also want to have the lists of doucments at the end so I can decide to just send the table with numbers if I want to. Does this make sense?
BZDATETIME::2012-03-09 11:54:06
BZCOMMENTOR::Volker Englisch
BZCOMMENT::11
I finished❓ the part of this report to get the numbers for Genetics Dictionary terms and before I forget I want to summarize how I'm getting the numbers and how this may or may not be what we're looking for.
a) Getting the number for revised terms.
We decided to use the DateLastModified (DLM) associated with the
individual
TermDefinition to identify updated terms. The DLM is not (or should not
be)
created when a new TermDefinition gets created, only when it's
updated.
b1) Getting the number for new terms from the GTN document
We're counting a new Genetics dictionary term only when it's created
along
with a brand new GlossaryTermName document (and we're counting the
first_pub date within our time frame). From the data that I have looked
at
this is often then case but we are missing any Genetics dictionary
term
definition that is being added to an already existing concept.
b2) Getting the number for new terms from the GTC document
Robin said we could try using the StatusDate of the Genetics
TermDefinition
which appears to be a good approximation. I am counting those
Genetics
terms for which the StatusDate falls within our time frame provided
the
DLM does not exist. Unfortunately, the DLM often does exist with a
date
that's identical to the status date.
Assume the following situation:
StatusDate: 2011-07-01
DateLastModified: 2011-07-01
First_Pub: 2009-01-01
Time Frame: CY2011
This Genetics Termdefinition has been added to an existing concept.
We
won't count this term as a new term because the DLM exists even though
my
guess is that the DLM had been incorrectly added when the
TermDefinition
block had been created. We do count this term as an updated term
though.
We should probably decide if we rather want to err on the side of the
new
term definitions or the revised ones.
At the moment I am creating the numbers for (b1) and (b2) so that we can take a look and decide which ones we want to use for the monthly report.
BZDATETIME::2012-03-09 17:53:40
BZCOMMENTOR::Volker Englisch
BZCOMMENT::12
Questions regarding the Genetics Professionals:
a) We are only interested in new persons, right? No updates.
b) We are counting new Person documents with a
GeneticsProfessionalDetails
block.
Do we take into account the Include or CurrentStatus elements?
BZDATETIME::2012-03-14 15:08:10
BZCOMMENTOR::Volker Englisch
BZCOMMENT::13
I believe I've finished all of the queries for the ICRDB Stats report at this point and we need to talk about fine tuning. The report (on MAHLER) comes out as follows:
Document From 2011-01-02
to 2012-02-29
Total Summaries - New 4
HP Summaries (ES) 0
Pat. Summaries (EN) 2
HP Summaries (EN) 2
Pat. Summaries (ES) 0
Total Summaries - Revised 135
Pat. Summaries (EN) 27
Pat. Summaries (ES) 29
HP Summaries (EN) 46
HP Summaries (ES) 33
Reformatted Summaries 8
Dictionary (ES) - New 73
Dictionary (EN) - New 75
Dictionary - Revised 26
Dictionary (ES) 21
Dictionary (EN) 19
Dictionary w/ pronunciation 104
Genetics Dictionary - New (GTN) 1
Genetics Dictionary - New (GTC) 2
Genetics Dictionary - Revised 0
Genetics Professionals - New 2
NCI Drug Dictionary 15
Drug Information Summaries 19
New DIS 5
Revised DIS 14
PDQ Board Members 26
Advisory 20
Editorial 6
PDQ Board Meetings 38
including WebEx 9
PDQ Images 19
New Images 18
Revised Images 2
I'd like to summarize where these numbers are coming from:
Total Summaries - New 4
This number is the sum of the following four individual numbers - HP-EN,
HP-ES,
Pat-EN, Pat-ES. The numbers are based on the first_pub date of the CDR
document table which is set the moment a document has been published for
the first time.
HP Summaries (ES) 0
Pat. Summaries (EN) 2
HP Summaries (EN) 2
Pat. Summaries (ES) 0
Same as above but numbers are counted by audience and language.
Total Summaries - Revised 135
Pat. Summaries (EN) 27
Pat. Summaries (ES) 29
HP Summaries (EN) 46
HP Summaries (ES) 33
Same as above but looking at the DLM instead of the first_pub
date.
Reformatted Summaries 8
We had discussed not to include these anymore and I just forgot to
remove them. This was calculated by checking for an existing WillReplace
element of an inactivated document with the validation date within the
given time frame.
Dictionary (ES) - New 73
Dictionary (EN) - New 75
These two numbers are counting the first_pub date of the
GlossaryTermName documents.
I didn't display a 'Total' here since we don't have Spanish terms
without an English term.
Dictionary - Revised 26
Dictionary (ES) 21
Dictionary (EN) 19
These numbers are counted from the DateLastModified elements of the
TermDefinition or TranslatedTermDefinition blocks in the
GlossaryTermConcept documents. The 'Dictionary - Revised' is not a sum
of the EN and ES numbers since both definitions may be revised within
one concept.
Note that the data is coming from the latest publishable version (not
the CWD).
Dictionary w/ pronunciation 104
This number counts the new media files published by counting the element
first_pub. The media document must be of the category of
'pronunciation'.
Note that the count does not directly link to a newly published English
dictionary term.
Genetics Dictionary - New (GTN) 1
Genetics Dictionary - New (GTC) 2
Genetics Dictionary - Revised 0
Please see comment #11 for further details. This is counting the number
of Genetics dictionary terms only if it's created as a brand new GTN or
based on the status date of the GTC document provided the DLM doesn't
exist. The number of revised terms is based on the DLM value for the
dictionary term.
Genetics Professionals - New 2
This counts the number of first_pub for Person documents published that
include a GeneticsProfessionalDetails block.
NCI Drug Dictionary 15
This number includes all terminology documents with a SemanticType of
'Drug/Agent' for which a DefinitionText exists, counting the first_pub
element.
Note that we're looking at CWD here. Do we limit the number to documents
'published to Cancer.gov' or 'publishable version available'?
Drug Information Summaries 19
New DIS 5
Revised DIS 14
Looking at first_pub for new DIS and DLM for revised documents.
PDQ Board Members 26
Advisory 20
Editorial 6
The number is counting the TermStartDate of a board member record's
MembershipDetails block with an 'Accepted' value for the
ResponseToInvitation.
PDQ Board Meetings 38
including WebEx 9
This number is counting the BoardMeetingDate in the BoardMeetings block
of the Organization document.
PDQ Images 19
New Images 18
Revised Images 2
These numbers count the Media documents with an ImageData block. The new
documents are counted by their first_pub date, the revised documents by
their DLM value.
BZDATETIME::2012-03-20 17:06:27
BZCOMMENTOR::Volker Englisch
BZCOMMENT::14
There are most likely minor adjustments but the report is now
finished and installed on all three servers.
ICRDBStatsReport.py - R10362
The report is located in the directory
D:\cdr\publishing
It will be submitted as a scheduled job on BACH running once a month
(we need to identify on which day) and will submit an email to all
members of the distribution list
ICRDB Statistics Notification
when run in live mode or to the members of
Test Publishing Notification
when run in test mode.
The output of the report is written to the file
d:\cdr\reports\ICRDBStats_YYYY-mm-DDTHHMMSS.html
There are many options available for this report which are:
--testmode, --livemode
The report is run either in testmode or in livemode. One of these
options
must be specified. When choosing TEST mode the output file name changes
the
extension from *.html to *.test.html and the email distribution list
will
change.
--email, --noemail
By default the report runs in the EMAIL mode resulting in the report to
be
submitted by email. Specify the NOEMAIL option to suppress sending
the
report by email.
--startdate, --enddate
By default the report is counting (published) documents of the
previous
month. For instance, today the report will run for the time period
of
Feb-01 to Feb-29. When specifying either the start date or the end
date
the default dates will be overwritten.
Please note the dates specified are inclusive.
Specifying
--startdate=2012-02-01 --enddate=2012-03-01
will count the documents for February plus the documents for March
1st.
--include
By default the report will only list a count of all documents. With
this
option the user will be able to also receive a list of the counted
documents.
--cdrids
By default the report - when run with the INCLUDE option - will not
display
the CDR-IDs of the documents. With this option the CDR-IDs will be
listed
as well.
Please note: When this option is specified the list of documents is
sorted
by CDR-ID instead of document title.
--rowmax=N
By default the report - when run with the INCLUDE option - will display
all
documents counted. This could result in hundreds or thousands of
documents.
With this option the user can specify a maximum number of records to
be
displayed. For document types with fewer rows than specified by the
ROWMAX
option the option has no effect.
--debug
By default the report runs without the DEBUG option. This option
displays
additional print statements, in particular the SQL query used for
each
document type.
The following options allow the user to run the report for
particular
sections. If one of these options is specified the report will not be
saved
to disc and is only submitted by email.
--summary - display the count of new/revised summaries
--dis - display the count of new/revised drug information
summaries
--audio - display the count of audio pronunciations
--images - display the count of new/revised images
--glossary - display the count of new/revised glossary terms and
audio
--genetics - display the count of new/revised genetics dict. terms
--drugterms - display the count of new drug dictionary terms
--boardmember - display the count of new board members
--boardmeetings - display the count of board meetings and WebEx
meetings
Typically one would run the report with only two options:
ICRDBStatsReport.py --livemode --include
BZDATETIME::2012-03-20 17:08:44
BZCOMMENTOR::Volker Englisch
BZCOMMENT::15
Sample report run on BACH limited to 50 rows per document type.
Attachment ICRDBStats_2012-03-20T170603.test.html has been added with description: ICRDB Stats Report for Feb. 2012
BZDATETIME::2012-03-29 12:19:39
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::16
This looks fantastic! I think we can close the issue. Will this be something that gets generated automatically on the first of the month and emailed to me?
BZDATETIME::2012-04-19 14:17:46
BZCOMMENTOR::Volker Englisch
BZCOMMENT::17
Closing issue per discussion at status meeting.
File Name | Posted | User |
---|---|---|
ICRDBProdNumbers 2011.doc | 2012-02-07 15:57:13 | |
ICRDBStats_2012-03-20T170603.test.html | 2012-03-20 17:08:44 | Englisch, Volker (NIH/NCI) [C] |
Elapsed: 0:00:00.000607