CDR Tickets

Issue Number 3478
Summary ICRDB Stats Report
Created 2012-02-07 15:57:13
Issue Type Improvement
Submitted By Beckwith, Margaret (NIH/NCI) [E]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2012-04-19 14:17:46
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107806
Description

BZISSUE::5173
BZDATETIME::2012-02-07 15:57:13
BZCREATOR::Margaret Beckwith
BZASSIGNEE::Volker Englisch
BZQACONTACT::William Osei-Poku

Rick has asked me to provide a monthly report capturing various ICRDB production statistics. I am attaching the report that I have been doing manually twice a year as a template, but I would like to figure out a way to generate the report more automatically if possible. (I only need to provide the table of numbers, not the additional pages listing new and revised documents.)

Comment entered 2012-02-07 15:57:13 by Beckwith, Margaret (NIH/NCI) [E]

Attachment ICRDBProdNumbers 2011.doc has been added with description: ICRDB Production Statistics

Comment entered 2012-02-08 11:13:38 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-08 11:13:38
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1

Since you are passing this on to Rick/Lenora do you have a preference on the output format?
a) An HTML email report
b) Excel spreadsheet
c) Plain ASCII or CSV format

If you need to modify the report before passing it on options (b) is probably the easiest to modify.

Comment entered 2012-02-09 17:36:24 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-09 17:36:24
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2

These are the numbers for January based on the ad-hoc reports.

ICRDB Pub Numbers - Summaries reformatted
CDR-ID Title Audience Creation Date
---— -------------------------------— -----— -------------------
717154 Descripción del tratamiento de los Patients 2012-01-06 11:19:13
tumores del cerebro y la médula
espinal en niños*BLOCKED*;
Treatment;Patients
698070 Late Effects of Treatment for Patients 2012-01-10 12:22:34
Childhood Cancer BLOCKED;
Treatment;Patients
721528 Náusea y vómito*BLOCKED*; Patients 2012-01-24 15:18:15
Supportive care;Patients

ICRDB Pub Numbers - Meeting recordings created.
CDR-ID Title Creation Date
---— ------------------------------------------— --------------------
723667 Adult Treatment Meeting Recording 01-17-12; 2012-01-18 18:26:55
meeting recording;MP3
724178 Genetics Meeting Recording 01-24-12; 2012-01-26 16:48:17
meeting recording;MP3
724179 Screening and Prevention Meeting Recording 2012-01-26 17:02:27
01-18-12;meeting recording;MP3
724492 Pediatrics Meeting Recording 01-27-12; 2012-01-31 13:00:10
meeting recording;MP3

ICRDB Pub Numbers - Images published this year
CDR-ID Title Publishing Date
---— -------------------------------------------— -------------------
722030 breast-sparing surgery;test or procedure;JPEG 2012-01-31 18:00:04

ICRDB Pub Numbers - Drug Terms Published
CDR-ID Title First_pub
---— ------------------------------------------— --------------------
721951 cisatracurium besylate;Index term;Drug/agent 2012-01-04 18:00:04
721983 alpha 1 antitrypsin;Index term;Drug/agent 2012-01-05 18:00:03
721984 sodium iodide I-131;Index term;Drug/agent 2012-01-05 18:00:03
721986 capecitabine rapidly disintegrating tablet; 2012-01-05 18:00:03
Index term;Drug/agent
722029 iodine I 124 monoclonal antibody 8H9; 2012-01-06 18:00:04
Index term;Drug/agent
722041 rhubarb/licorice herbal supplement; 2012-01-06 18:00:04
Index term;Drug/agent
722307 iodine I 124 NM404;Index term;Drug/agent 2012-01-09 18:00:04
722308 iodine I 131 NM404;Index term;Drug/agent 2012-01-09 18:00:04
722324 pbi-shRNA STMN1 lipoplex;Index term;Drug/agent 2012-01-10 18:00:04
722359 anti-myostatin monoclonal antibody LY2495655; 2012-01-10 18:00:04
Index term;Drug/agent
723999 radotinib hydrochloride;Index term;Drug/agent 2012-01-24 18:00:04

ICRDB Pub Numbers - Audio published = 768
(I spare you with the list)

Comment entered 2012-02-14 10:51:15 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-14 10:51:15
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3

The mystery of the duplicated records for the Published Media Documents report has been fixed. Because we're excluding media documents from this report that are listed with a category of 'pronunciation' or 'meeting recording' selection of the category field had to be included in the SQL query. This resulted in image files with multiple category values in the document to be listed once for each category.
This has been fixed and - since it was a very minor change - copied to BACH.

As mentioned yesterday, I also realized that I hadn't been careful enough with some of my SQL queries in terms of using the correct date range. This caused the results of our numbers in some instances to include or exclude one more day depending which day had been entered as the last day of the month for which query.
I will make sure to handle this issue correctly for this report.

Comment entered 2012-02-14 19:49:35 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-14 19:49:35
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4

Would you like to include or have the option to include the individual document IDs and Titles?

Comment entered 2012-02-15 09:30:09 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2012-02-15 09:30:09
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::5

This is fantastic! I think it would be great to be able to see the ID's and Titles for the report (especially the titles).

Comment entered 2012-02-15 13:12:32 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-15 13:12:32
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6

If there are now documents for a given "row" (like 'New Media', 'New Summaries', etc.) are we going to report the record anyway with a count of zero or are we going to drop that row from the report?

Same question for the report including the option to display all records. If there are no summaries, for instance, for a given time period do we want to include a message like 'No Summaries found' or just drop this category?

I'm thinking I will also create options to individually display categories. For one thing it will help me with testing since I can selectively run a section/document type I'm working on and secondly it may be nice being able to run the report for just Summaries and Media or just GlossaryTerms and DIS if need be.

Comment entered 2012-02-15 13:29:54 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2012-02-15 13:29:54
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::7

We want the category to show with 0 (at least for the monthly report I am going to be sending).

Comment entered 2012-02-17 10:05:39 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-17 10:05:39
BZCOMMENTOR::Volker Englisch
BZCOMMENT::8

On the report you are showing multiple rows for the summaries:
Total Summaries
HP Summaries (EN)
HP Summaries (ES)
Pat Summaries (EN)
Pat Summaries (ES)

If you're running this report with the option to show the individual documents would you like to have those documents split into the different "groups" as well or would it be OK to just list all summaries with a column for the audience and language?

Comment entered 2012-02-22 13:48:13 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-22 13:48:13
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9

I think I've done as much as I could without additional guidance from Margaret. The report currently includes all of the new published documents.
We will still need to work on the reports for the revised documents.

Comment entered 2012-03-01 11:57:15 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2012-03-01 11:57:15
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::10

Is there a way to look at the report? I think the answer to your question aboutt he list of summaries is that I would prefer to have it at least separated by language, and then indicate audience. I also want to have the lists of doucments at the end so I can decide to just send the table with numbers if I want to. Does this make sense?

Comment entered 2012-03-09 11:54:06 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-03-09 11:54:06
BZCOMMENTOR::Volker Englisch
BZCOMMENT::11

I finished❓ the part of this report to get the numbers for Genetics Dictionary terms and before I forget I want to summarize how I'm getting the numbers and how this may or may not be what we're looking for.

a) Getting the number for revised terms.
We decided to use the DateLastModified (DLM) associated with the individual
TermDefinition to identify updated terms. The DLM is not (or should not be)
created when a new TermDefinition gets created, only when it's updated.

b1) Getting the number for new terms from the GTN document
We're counting a new Genetics dictionary term only when it's created along
with a brand new GlossaryTermName document (and we're counting the
first_pub date within our time frame). From the data that I have looked at
this is often then case but we are missing any Genetics dictionary term
definition that is being added to an already existing concept.

b2) Getting the number for new terms from the GTC document
Robin said we could try using the StatusDate of the Genetics TermDefinition
which appears to be a good approximation. I am counting those Genetics
terms for which the StatusDate falls within our time frame provided the
DLM does not exist. Unfortunately, the DLM often does exist with a date
that's identical to the status date.
Assume the following situation:
StatusDate: 2011-07-01
DateLastModified: 2011-07-01
First_Pub: 2009-01-01
Time Frame: CY2011
This Genetics Termdefinition has been added to an existing concept. We
won't count this term as a new term because the DLM exists even though my
guess is that the DLM had been incorrectly added when the TermDefinition
block had been created. We do count this term as an updated term though.
We should probably decide if we rather want to err on the side of the new
term definitions or the revised ones.

At the moment I am creating the numbers for (b1) and (b2) so that we can take a look and decide which ones we want to use for the monthly report.

Comment entered 2012-03-09 17:53:40 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-03-09 17:53:40
BZCOMMENTOR::Volker Englisch
BZCOMMENT::12

Questions regarding the Genetics Professionals:
a) We are only interested in new persons, right? No updates.
b) We are counting new Person documents with a GeneticsProfessionalDetails
block.
Do we take into account the Include or CurrentStatus elements?

Comment entered 2012-03-14 15:08:10 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-03-14 15:08:10
BZCOMMENTOR::Volker Englisch
BZCOMMENT::13

I believe I've finished all of the queries for the ICRDB Stats report at this point and we need to talk about fine tuning. The report (on MAHLER) comes out as follows:

Document From 2011-01-02
to 2012-02-29
Total Summaries - New 4

  • HP Summaries (ES) 0

  • Pat. Summaries (EN) 2

  • HP Summaries (EN) 2

  • Pat. Summaries (ES) 0

Total Summaries - Revised 135

  • Pat. Summaries (EN) 27

  • Pat. Summaries (ES) 29

  • HP Summaries (EN) 46

  • HP Summaries (ES) 33

Reformatted Summaries 8

Dictionary (ES) - New 73
Dictionary (EN) - New 75
Dictionary - Revised 26

  • Dictionary (ES) 21

  • Dictionary (EN) 19
    Dictionary w/ pronunciation 104

Genetics Dictionary - New (GTN) 1
Genetics Dictionary - New (GTC) 2
Genetics Dictionary - Revised 0

Genetics Professionals - New 2
NCI Drug Dictionary 15
Drug Information Summaries 19

  • New DIS 5

  • Revised DIS 14
    PDQ Board Members 26

  • Advisory 20

  • Editorial 6
    PDQ Board Meetings 38

  • including WebEx 9

PDQ Images 19

  • New Images 18

  • Revised Images 2

I'd like to summarize where these numbers are coming from:

Total Summaries - New 4
This number is the sum of the following four individual numbers - HP-EN, HP-ES,
Pat-EN, Pat-ES. The numbers are based on the first_pub date of the CDR document table which is set the moment a document has been published for the first time.

  • HP Summaries (ES) 0

  • Pat. Summaries (EN) 2

  • HP Summaries (EN) 2

  • Pat. Summaries (ES) 0
    Same as above but numbers are counted by audience and language.

Total Summaries - Revised 135

  • Pat. Summaries (EN) 27

  • Pat. Summaries (ES) 29

  • HP Summaries (EN) 46

  • HP Summaries (ES) 33
    Same as above but looking at the DLM instead of the first_pub date.

Reformatted Summaries 8
We had discussed not to include these anymore and I just forgot to remove them. This was calculated by checking for an existing WillReplace element of an inactivated document with the validation date within the given time frame.

Dictionary (ES) - New 73
Dictionary (EN) - New 75
These two numbers are counting the first_pub date of the GlossaryTermName documents.
I didn't display a 'Total' here since we don't have Spanish terms without an English term.

Dictionary - Revised 26

  • Dictionary (ES) 21

  • Dictionary (EN) 19
    These numbers are counted from the DateLastModified elements of the TermDefinition or TranslatedTermDefinition blocks in the GlossaryTermConcept documents. The 'Dictionary - Revised' is not a sum of the EN and ES numbers since both definitions may be revised within one concept.
    Note that the data is coming from the latest publishable version (not the CWD).

Dictionary w/ pronunciation 104
This number counts the new media files published by counting the element first_pub. The media document must be of the category of 'pronunciation'.
Note that the count does not directly link to a newly published English dictionary term.

Genetics Dictionary - New (GTN) 1
Genetics Dictionary - New (GTC) 2
Genetics Dictionary - Revised 0
Please see comment #11 for further details. This is counting the number of Genetics dictionary terms only if it's created as a brand new GTN or based on the status date of the GTC document provided the DLM doesn't exist. The number of revised terms is based on the DLM value for the dictionary term.

Genetics Professionals - New 2
This counts the number of first_pub for Person documents published that include a GeneticsProfessionalDetails block.

NCI Drug Dictionary 15
This number includes all terminology documents with a SemanticType of 'Drug/Agent' for which a DefinitionText exists, counting the first_pub element.
Note that we're looking at CWD here. Do we limit the number to documents 'published to Cancer.gov' or 'publishable version available'?

Drug Information Summaries 19

  • New DIS 5

  • Revised DIS 14
    Looking at first_pub for new DIS and DLM for revised documents.

PDQ Board Members 26

  • Advisory 20

  • Editorial 6
    The number is counting the TermStartDate of a board member record's MembershipDetails block with an 'Accepted' value for the ResponseToInvitation.

PDQ Board Meetings 38

  • including WebEx 9
    This number is counting the BoardMeetingDate in the BoardMeetings block of the Organization document.

PDQ Images 19

  • New Images 18

  • Revised Images 2
    These numbers count the Media documents with an ImageData block. The new documents are counted by their first_pub date, the revised documents by their DLM value.

Comment entered 2012-03-20 17:06:27 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-03-20 17:06:27
BZCOMMENTOR::Volker Englisch
BZCOMMENT::14

There are most likely minor adjustments but the report is now finished and installed on all three servers.
ICRDBStatsReport.py - R10362

The report is located in the directory
D:\cdr\publishing

It will be submitted as a scheduled job on BACH running once a month (we need to identify on which day) and will submit an email to all members of the distribution list
ICRDB Statistics Notification
when run in live mode or to the members of
Test Publishing Notification
when run in test mode.

The output of the report is written to the file
d:\cdr\reports\ICRDBStats_YYYY-mm-DDTHHMMSS.html

There are many options available for this report which are:
--testmode, --livemode
The report is run either in testmode or in livemode. One of these options
must be specified. When choosing TEST mode the output file name changes the
extension from *.html to *.test.html and the email distribution list will
change.

--email, --noemail
By default the report runs in the EMAIL mode resulting in the report to be
submitted by email. Specify the NOEMAIL option to suppress sending the
report by email.

--startdate, --enddate
By default the report is counting (published) documents of the previous
month. For instance, today the report will run for the time period of
Feb-01 to Feb-29. When specifying either the start date or the end date
the default dates will be overwritten.
Please note the dates specified are inclusive. Specifying
--startdate=2012-02-01 --enddate=2012-03-01
will count the documents for February plus the documents for March 1st.

--include
By default the report will only list a count of all documents. With this
option the user will be able to also receive a list of the counted documents.

--cdrids
By default the report - when run with the INCLUDE option - will not display
the CDR-IDs of the documents. With this option the CDR-IDs will be listed
as well.
Please note: When this option is specified the list of documents is sorted
by CDR-ID instead of document title.

--rowmax=N
By default the report - when run with the INCLUDE option - will display all
documents counted. This could result in hundreds or thousands of documents.
With this option the user can specify a maximum number of records to be
displayed. For document types with fewer rows than specified by the ROWMAX
option the option has no effect.

--debug
By default the report runs without the DEBUG option. This option displays
additional print statements, in particular the SQL query used for each
document type.

The following options allow the user to run the report for particular
sections. If one of these options is specified the report will not be saved
to disc and is only submitted by email.
--summary - display the count of new/revised summaries
--dis - display the count of new/revised drug information summaries
--audio - display the count of audio pronunciations
--images - display the count of new/revised images
--glossary - display the count of new/revised glossary terms and audio
--genetics - display the count of new/revised genetics dict. terms
--drugterms - display the count of new drug dictionary terms
--boardmember - display the count of new board members
--boardmeetings - display the count of board meetings and WebEx meetings

Typically one would run the report with only two options:
ICRDBStatsReport.py --livemode --include

Comment entered 2012-03-20 17:08:44 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-03-20 17:08:44
BZCOMMENTOR::Volker Englisch
BZCOMMENT::15

Sample report run on BACH limited to 50 rows per document type.

Comment entered 2012-03-20 17:08:44 by Englisch, Volker (NIH/NCI) [C]

Attachment ICRDBStats_2012-03-20T170603.test.html has been added with description: ICRDB Stats Report for Feb. 2012

Comment entered 2012-03-29 12:19:39 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2012-03-29 12:19:39
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::16

This looks fantastic! I think we can close the issue. Will this be something that gets generated automatically on the first of the month and emailed to me?

Comment entered 2012-04-19 14:17:46 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-04-19 14:17:46
BZCOMMENTOR::Volker Englisch
BZCOMMENT::17

Closing issue per discussion at status meeting.

Attachments
File Name Posted User
ICRDBProdNumbers 2011.doc 2012-02-07 15:57:13
ICRDBStats_2012-03-20T170603.test.html 2012-03-20 17:08:44 Englisch, Volker (NIH/NCI) [C]

Elapsed: 0:00:00.000607