CDR Tickets

Issue Number 2715
Summary Bring NCI Fact Sheets into the CDR
Created 2008-11-21 11:02:51
Issue Type Improvement
Submitted By Kline, Bob (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2012-01-03 10:32:45
Resolution Won't Fix
Path /home/bkline/backups/jira/ocecdr/issue.107043
Description

BZISSUE::4389
BZDATETIME::2008-11-21 11:02:51
BZCREATOR::Bob Kline
BZASSIGNEE::Bob Kline
BZQACONTACT::William Osei-Poku

Lakshmi has asked us to investigate options for maintaining and publishing the NCI Fact Sheets in/from the CDR. Currently they are maintained as Microsoft Word documents and converted by hand to HTML for display on Cancer.gov. Some of them are also converted to PDF files.

The documents will be exported to Cancer.gov as denormalized XML files and Cancer.gov will convert the documents to HTML for display on the web site. PDF files will also be generated and exported from the CDR, and Lakshmi would like the ability to generate a form of the documents which can be sent out for review and edited directly. One possibility for this last requirement would be RTF files, such as we generated for the verification emailers. Another possibility that was mentioned in yesterday's status meeting was editable PDF files.

I noticed that on Cancer.gov only some of the Fact Sheets have PDFs. Will they all have PDFs eventually? Some of the features I saw were:

  • simple tables

  • simple lists (no nesting, as far as I could tell)

  • boxed sections

  • images with captions

  • footers with page numbering, version, and date

  • bar code on the first page

Will all of these features (including the bar codes) need to be supported from the CDR? What flavor ("symbology" in the barcode world) barcode is used? Are there additional features that will need to be supported?

A meeting will be scheduled to discuss these and other questions and to show us how their current operation works. Alan and I decided we will both attend this meeting (and possibly include Volker) and decide later which of us will take the lead on this project based on the scope and current workloads.

Here's a preliminary list of options for generating the PDFs:

  • generate HTML, then convert with html2pdf

This might allow the users to do a conversion to Microsoft Word from
within Internet Explorer, to which one of the users referred yesterday,
but we might not have as much control over layout using the html->pdf
route; we'll have to experiment.

  • generate LaTeX, which supports conversion to PDF

This lets us build on the existing tools we have for generating
mailers, but doesn't directly provide a way to get something which
can be edited; we don't know enough about how to make editable
PDFs at this point, so more investigation is needed; also, LaTeX
does not support some of the really complicated things you can do
in HTML, for example, with odd-shaped table cells; don't think
this would be a problem, though, as the Fact Sheets I saw presented
much less complexity than the summaries.

  • generate XSL-FO, which would in turn create PDF and possibly also RTF

If Adobe's open source XSL-FO package (or some other open source or
commercial XSL-FO package) is capable of supporting all of the
layout requirements for the task, and can generate both PDF and
RTF, then this may be a good candidate solution. One drawback to
this approach is that it would introduce another technology/tool
set to the CDR project.

  • generate RTF directly, and then convert the RTF to PDF

Don't know if this is feasible, but it seems likely that there
would be some tools available which could to this. It would
have the advantage of building on software we already have
for some of the mailers.

  • generate Microsoft Word documents directly, then convert to PDF

Similar to the previous option, possibly supporting more
esoteric formatting capabilities, but requiring the presence of
Microsoft Office client software running on the server, and
the use of COM automation. Haven't had recent experience with
this combination, but in the past this has been a somewhat
buggy, fragile, resource-intensive, and slow environment in
which to develop systems. Furthermore, Microsoft Office
applications have historically been a favorite front door for
hackers on Windows.

Lakshmi/Margaret:

Could you correct any mis-representations or omissions above on the requirements?

Alan/Volker:

Any other options you can think of? Or comments on the ones I've listed?

Comment entered 2008-11-21 12:36:59 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2008-11-21 12:36:59
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1

> Any other options you can think of? Or comments on the ones I've listed?

I'm not clear about this:
Are we going to create PDF documents to be submitted from the CDR, are we going to create the PDF documents to be offered on Cancer.gov or both?

Comment entered 2008-11-26 14:36:31 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2008-11-26 14:36:31
BZCOMMENTOR::Bob Kline
BZCOMMENT::2

(In reply to comment #1)
> > Any other options you can think of? Or comments on the ones I've listed?
>
> I'm not clear about this:
> Are we going to create PDF documents to be submitted from the CDR, are we going
> to create the PDF documents to be offered on Cancer.gov or both?
>

I'll let Lakshmi field this question. Don't think it makes any difference, though, for design decisions on the task, as either way we should be able to generate the PDFs in batch mode.

Comment entered 2008-12-09 11:12:39 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2008-12-09 11:12:39
BZCOMMENTOR::Bob Kline
BZCOMMENT::3

Next step for this task is for Lakshmi or Margaret to schedule a meeting with the content owners so they can describe their requirements.

Comment entered 2009-02-24 13:35:04 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-02-24 13:35:04
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4

Removing Sheri from the CC list.

Comment entered 2009-03-05 10:27:13 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2009-03-05 10:27:13
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::5

Demo of CDR to Fact Sheet team is scheduled for March 13.

Comment entered 2009-03-16 11:23:13 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-03-16 11:23:13
BZCOMMENTOR::Bob Kline
BZCOMMENT::6

(In reply to comment #5)
> Demo of CDR to Fact Sheet team is scheduled for March 13.

How did it go?

Comment entered 2009-03-16 11:28:17 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-03-16 11:28:17
BZCOMMENTOR::Volker Englisch
BZCOMMENT::7

It went well, in my opinion. The fact sheet team seemed to be excited and they are eager to start the process of bringing in their documents and using the features the CDR has to offer.

Comment entered 2009-03-16 12:12:54 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2009-03-16 12:12:54
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::8

I agree with Volker. I thought the demo went very well. They had a lot of questions, and seemed very eager to get things moving. The next step will be for us to sit down with the users and get more details about what they will need the system to do.

Comment entered 2009-03-17 13:49:28 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-03-17 13:49:28
BZCOMMENTOR::Bob Kline
BZCOMMENT::9

Lakshmi indicated in yesterday's meeting with the Cancer.gov team that she hopes to be able to extract the fact sheet information from the live web site in order to ensure that we bring in the most current version of each document. As a first step in this direction, I have written a script which walks through the menus on Cancer.gov for the fact sheets, finds all of them, extracts the portion of the page containing the actual fact sheet information (leaving behind banners, sidebars, footers, etc.) and converts that information into well-formed XML documents. The results can be viewed at

http://mahler.nci.nih.gov/fact-sheets/

The documents for "Cancer Advances in Focus" have been extracted as fairly clean structural documents, but the others need further analysis in order to separate the structural essence of the documents from the markup used solely for visual presentation. There are two documents whose URLs are in the 'factsheet' tree but which were skipped because it does not appear that they are actually Fact Sheet documents:

http://cancer.gov/cancertopics/factsheet/NCI/cancer-centers
http://cancer.gov/cancertopics/factsheet/support/organizations

Let me know if I'm wrong about those two.

I'll continue to nibble away at teasing out the structure from the documents.

Comment entered 2009-03-19 16:20:09 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-03-19 16:20:09
BZCOMMENTOR::Bob Kline
BZCOMMENT::10

Making some progress. Here's a sample:

http://mahler.nci.nih.gov/fact-sheets/cancertopics_factsheet_ALLinchildren.xml

It's a bit tedious, but we get a double benefit from this exploratory work: in addition to code we'll be able to use in an actual conversion, we also get useful information about what the FactSheet schema's structure will need to accommodate.

Note to myself: need to remember to pick up information from the Cancer.gov web pages about which menus each of these appears in on the web site (and with what synopsis text).

Comment entered 2009-03-20 12:57:00 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-03-20 12:57:00
BZCOMMENTOR::Bob Kline
BZCOMMENT::11

There are some anomalies which no amount of parsing will resolve. One is that although the standard format for the Fact Sheets is:

Title
Key Points
Q&A List
Additional Information

some of the documents don't have any questions. That may be OK for some documents, but surely not for those which have "Questions and Answers" in the title, such as

http://www.cancer.gov/cancertopics/factsheet/AvastinFactSheet

Comment entered 2009-04-21 16:37:07 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-04-21 16:37:07
BZCOMMENTOR::Bob Kline
BZCOMMENT::12

We've been asked by Jonathan (via Reza) to put this task on hold.

Comment entered 2009-06-30 09:31:06 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-06-30 09:31:06
BZCOMMENTOR::Bob Kline
BZCOMMENT::13

Priority now reflects "on-hold" status of task.

Comment entered 2009-06-30 17:18:14 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-06-30 17:18:14
BZCOMMENTOR::Bob Kline
BZCOMMENT::14

Adjusting priority again to test Bugzilla for Volker.

Comment entered 2012-01-03 10:23:45 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-01-03 10:23:45
BZCOMMENTOR::Volker Englisch
BZCOMMENT::15

Does it make any sense to keep this issue open or should we just close it?

It's on hold for over two years now.

Comment entered 2012-01-03 10:29:31 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2012-01-03 10:29:31
BZCOMMENTOR::Bob Kline
BZCOMMENT::16

(In reply to comment #15)
> Does it make any sense to keep this issue open or should we just close it?
>
> It's on hold for over two years now.

Since the request came from Lakshmi, that would be a question for her.

Lakshmi?

Comment entered 2012-01-03 10:30:25 by Grama, Lakshmi (NIH/NCI) [E]

BZDATETIME::2012-01-03 10:30:25
BZCOMMENTOR::Lakshmi Grama
BZCOMMENT::17

It can be closed. Don't know if this will happen any time!

Comment entered 2012-01-03 10:32:28 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2012-01-03 10:32:28
BZCOMMENTOR::Bob Kline
BZCOMMENT::18

Per Lakshmi.

Comment entered 2012-01-03 10:32:45 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2012-01-03 10:32:45
BZCOMMENTOR::Bob Kline
BZCOMMENT::19

Closing issue.

Elapsed: 0:00:00.001718