CDR Tickets

Issue Number 2220
Summary Publishing 2.0 - Collection of Issues
Created 2007-05-18 11:08:51
Issue Type Improvement
Submitted By Englisch, Volker (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status In Progress
Resolved
Resolution
Path /home/bkline/backups/jira/ocecdr/issue.106548
Description

BZISSUE::3265
BZDATETIME::2007-05-18 11:08:51
BZCREATOR::Volker Englisch
BZASSIGNEE::Volker Englisch
BZQACONTACT::Alan Meyer

Alan, in his infinite wisdom, had a good idea yesterday.
He suggested to create a Bugzilla entry and collect various items/issues/enhancement requests/etc. that we would like to see improved should we ever decide to rewrite our publishing software.

I am setting this as a P4 for now but we might want to set it to P5 to remove from the weekly status report.
I'm giving this task the alias 'publish' so it would be easy to search for.

Comment entered 2007-05-18 11:11:18 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2007-05-18 11:11:18
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1

Currently, the publishing software allows only the user who submitted a publishing job to release it once it has the status of 'Waiting for user'.
It would be useful to allow others (i.e. members of the admin group) to be able and release waiting push jobs.

Comment entered 2007-05-18 11:16:07 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2007-05-18 11:16:07
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2

When submitting a publishing job it would be nice to seed the publishing document with a list of valid values (e.g. for the GKPubTarget in (GateKeeper, Preview, Live) and possibly a hint on the format of parameters (e.g. for JobStartDateTime, GKServer).

Comment entered 2007-06-21 11:59:24 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2007-06-21 11:59:24
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3

Rewrite the publishing software without the use of the non-standard module xmlproc if possible.

Comment entered 2007-07-13 18:06:46 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2007-07-13 18:06:46
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4

I'm adding an earlier email thread listing an "oddity" of the system that may have to be handled differently in a new system.
In short: A Term document that has a status of 'Pending' will be published if it is part of an interim hot-fix but will then be removed with a following Export job. The document history report, however, doesn't recognize the Export event as a remove event.

-----------------------------------------------------------------------------
From: Englisch, Volker (NIH/NCI) [C]
Sent: Wednesday, July 11, 2007 6:51 PM
To: 'Lakshmi Grama (CIPS)'; Margaret Beckwith; 'Khanna, Sheri L'; 'Bob Kline (bkline@rksystems.com)'; Alan Meyer; Kidder, Charlie (NIH/NCI) [C]
Subject: Publishing Term documents

Sheri and I discovered a "feature" today in the CDR that I'd like to bring to everyone's attention.

I had looked at the spreadsheet from Olga listing all documents that were removed from production as part of the latest full load and compared this with the documents we have listed as removed.
I discovered a GlossaryTerm document (sleeve resection - CDR482346) that had been removed from Cancer.gov. However, when running the document history report it is not indicated that this Glossary had been removed. Instead, it is being listed as published.

Sheri and I looked at this for a while and came up with the following sequence of events as an explanation for the behavior:
1) The GlossaryTerm has a TermStatus of 'Pending' and a publishable version is being created.
We had discussed earlier that the publishable GlossaryTerm would not get published since the query to select GlossaryTerm documents only includes documents with the "correct" status.
2) Probably by accident, the CDR-ID for this GlossaryTerm is included in the list of interim updates and gets published.
The interim update does publish the GlossaryTerm because this publishing event assumes that all document-IDs that are passed to the program need to be updated regardless of the TermStatus, for instance.
3) The monthly export selects all GlossaryTerm documents that need to be updated (excluding those with a TermStatus='Pending') and identifies that the document that was published earlier as part of the interim update is not part of the list of GlossaryTerm documents anymore and removes it.
4) The document has never been blocked by CIAT (in order to be removed), therefore the document history report lists this document as published twice, even so it actually got published and then removed.

We might want to introduce a check for GlossaryTerm documents to make sure they have the proper TermStatus before they are being interim-updated. Although, with the new MFP this problem is likely to never appear again but I thought we should at least be aware of it.

--------------------------------------------------------------------

From: Kline, Robert (NCI)
Sent: Wednesday, July 11, 2007 7:04 PM
To: Englisch, Volker (NIH/NCI) [C]
Cc: Grama, Lakshmi (NIH/NCI) [E]; Beckwith, Margaret (NIH/NCI) [E]; Khanna, Sheri L; Alan Meyer; Kidder, Charlie (NIH/NCI) [C]
Subject: Re: Publishing Term documents

Englisch, Volker (NIH/NCI) [C] wrote:
> The interim update does publish the GlossaryTerm because this publishing event assumes that all document-IDs that are passed to the program need to be updated regardless of the TermStatus, for instance.
>

This behavior is as designed. This is one of a number of places where
the users have a choice between having the software pick documents based
on canned selection criteria, or submitting a list of documents by
hand. When the users manually give the software a list of document IDs,
the software assumes the user knows what he/she is doing.

---------------------------------------------------------------------

From: Englisch, Volker (NIH/NCI) [C]
Sent: Wednesday, July 11, 2007 7:48 PM
To: Kline, Robert (NCI)
Cc: Grama, Lakshmi (NIH/NCI) [E]; Beckwith, Margaret (NIH/NCI) [E]; Khanna, Sheri L; Alan Meyer; Kidder, Charlie (NIH/NCI) [C]
Subject: Re: Publishing Term documents

On 07/11/2007 07:03 PM Bob Kline wrote:
> Englisch, Volker (NIH/NCI) [C] wrote:
>> The interim update does publish the GlossaryTerm because this
>> publishing event assumes that all document-IDs that are passed to the
>> program need to be updated regardless of the TermStatus, for instance.
>>
>
> This behavior is as designed. This is one of a number of places where
> the users have a choice between having the software pick documents based
> on canned selection criteria, or submitting a list of documents by
> hand. When the users manually give the software a list of document IDs,
> the software assumes the user knows what he/she is doing.
>

I can agree with the behavior from a publishing point of view. What I don't
agree with is the fact that the doc history report isn't
recognizing/marking the document as being removed as part of the export.
However, since this is a very special case it may be too expensive to
include this type of check.

Comment entered 2007-08-15 12:10:09 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2007-08-15 12:10:09
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5

During publishing a directory is being created with the name
d:/cdr/Output/JobNNNN.InProcess
to store all licensee output files. After all documents have been filtered this directory is being renamed to
d:/cdr/Output/JobNNNN

If this directory is accessed at the time the rename is attempted the publishing job finishes but the rename fails silently. This causes the following push job to fail because the output directory specified (d:/cdr/Output/JobNNNN) does not exist.

We would want to continue the publishing job but submit a warning if the rename of the directory fails and we should not submit a push job but instead indicating that the push job should be initiated manually after the directory has been renamed.

Comment entered 2007-10-09 11:09:42 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2007-10-09 11:09:42
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6

The publishing control document uses two parameters, the NumDocs and NumDocsPerDocType, which essentially do the same thing - to limit the number of returned rows of a SELECT statement as in
SELECT TOP $NumDocs doc_id
FROM document;
The 'NumDocsPerDocType' is used for publishing of multiple document types like the Export or Full Load while the 'NumDocs' parameter is used for publishing of single document types like the Export-Summary, Export-Country, etc.

In my opinion, we should not use two different parameters when we are doing the same thing - limiting the number of rows returned.

Comment entered 2007-10-23 17:28:00 by alan

BZDATETIME::2007-10-23 17:28:00
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7

The current mechanism for selecting documents using the
SubsetSelect element in the publishing control document is too
inflexible. We need a method that allows us either to run a
series of queries that build a temporary table and then select
the results from it, or perhaps execute a stored procedure that
can do that. Of the two approaches, the stored procedure is
potentially the most efficient, but the multi-query approach has
the virtue of putting the SQL logic in plain sight in the control
document.

We might even wnat to be able to execute Python code. I don't
have a use case for that and am not sure there is one, but I'd
like us to consider it before deciding we don't need it.

Being able to create and use multiple queries and temporary
tables will make the program more efficient as well as more
flexible.

Comment entered 2007-10-23 17:33:10 by alan

BZDATETIME::2007-10-23 17:33:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::8

It turns out that many of the document selection queries in the
publishing control document are identical, but with slight
changes that may be nothing more than a doctype name, or
something like that.

If we could create named, parameterized, selection queries I
think we could invoke them from many places, dramatically
reducing the amount of SQL in the document and also, potentially,
reducing the maintenance burden if and when something needs to
change in all of them.

It would also be useful to have named SubsetParameters sets (or
whatever the new program uses) that could be re-used. Ideally,
we would want to be able to say something like: Use the
Interim-Export SubsetParameters but override this one and add
that one.

Comment entered 2007-10-24 16:06:12 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2007-10-24 16:06:12
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9

During publishing one can monitor how many documents have already been processed.
The system prints out a message after every 1000 documents listing the number of processed and failed documents (failed in this context also includes "documents processed with warnings").
For the weekly export we don't need a message after every 1000 document. Every 5000/10000 would be sufficient. For publishing jobs of small document types, however, it would be nice to get messages more frequently.

It would be nice if the delay for displaying a progress record could be a variable.

Comment entered 2007-10-25 15:52:35 by alan

BZDATETIME::2007-10-25 15:52:35
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10

In comment #8 I mentioned the possibility of named, parameterized
selection queries and parameter sets.

The general principle of named objects that can be invoked from
multiple places may apply elsewhere too, for example to whole
subsets.

Comment entered 2007-10-25 15:57:56 by alan

BZDATETIME::2007-10-25 15:57:56
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

IIRC that back in the early days of publishing we lamented the
fact that the publishing system had a fixed number of
hierarchical levels in the control document. I don't recall
whether we came up with a better plan or just decided that, since
the fixed system has already been implemented, it wasn't worth
thinking further about the problem.

However we should now consider whether fixed levels are the right
way to re-implement the software. Although I have nothing
specific in mind, my intuition is that we can do better with a
more flexible hierarchy.

Comment entered 2008-08-13 12:47:25 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2008-08-13 12:47:25
BZCOMMENTOR::Volker Englisch
BZCOMMENT::12

We have the ability to allow a certain number of validation errors when submitting a publishing job by specifying the AbortOnError variable.

When a publishing job fails because the specified maximum number of allowed validation errors was reached it would be helpful to specify (in the logfile or as part of the messages) the reason for this failure such as
"Number of allowed document failures exceeded"

Currently the only reason for the publishing failure states:
"publish: Aborting on error in thread 1, see logfile"
without any useful information in the logfile.

Comment entered 2009-02-13 15:48:47 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-02-13 15:48:47
BZCOMMENTOR::Volker Englisch
BZCOMMENT::13

On those pages allowing the user to modify publishing parameters remove the column 'Default Value'.

We typically have a table listed with the columns
Name Default Value Current Value

The default value is redundant since the current value is populated with the default value.

Comment entered 2009-02-13 16:51:07 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-02-13 16:51:07
BZCOMMENTOR::Volker Englisch
BZCOMMENT::14

If a publishing job finishes with a status of 'Failure' there is no email message submitted to the user. An email is only submitted when the publishing job finished successfully.

It would help to also submit an email message for failed jobs.

Comment entered 2010-05-26 13:28:14 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-05-26 13:28:14
BZCOMMENTOR::Volker Englisch
BZCOMMENT::15

Something to think about:

If a media document has been picked up for publishing (by our selection criteria in the publishing document) the document will be extracted from the CDR (using cdr.getDoc()). Then, after it has been retrieved, a test is being performed identifying if processing of the specific media type is supported.
For very large files, such as audio files, the cdr.getDoc() process might take between 5-10 minutes for a single file.
We may want to think about testing the valid media types prior to retrieving the document.

Comment entered 2010-05-26 13:29:08 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-05-26 13:29:08
BZCOMMENTOR::Volker Englisch
BZCOMMENT::16

Removing Charlie as a CC from this issue.

Comment entered 2010-06-17 10:26:01 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-17 10:26:01
BZCOMMENTOR::Volker Englisch
BZCOMMENT::17

Rolling this issue, which belongs to the Publishing 2.0 category, from it's original OCECDR-2358 into this one.

                                    • Description ----------------------
                                      It is possible that an updated filter could be promoted from the development
                                      server to the production server during the course of a publication run.

If a filter changed over the course of a run, some documents might be filtered
according to one rule while others are filtered according to a different rule.
Inconsistent publications could be produced.

For additional information please see item (5) in the following attachment of
OCECDR-715:
http://verdi.nci.nih.gov/tracker/attachment.cgi?id=72

Comment entered 2011-06-08 12:43:15 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2011-06-08 12:43:15
BZCOMMENTOR::Volker Englisch
BZCOMMENT::18

There is a problem writing to the log file if multiple publishing jobs are running at the same time (at this point this only happens during testing).
Let's say there are two publishing events, one for Job8766 and one for Job8768. Both of these are writing to the log file records like this where the output is mixed:
!5648 Tue Jun 07 15:46:56 2011: Job 8766: Publishing CDR0000648989.
!2944 Tue Jun 07 15:46:56 2011: Job 8768: Publishing CDR0000445586.
!5648 Tue Jun 07 15:46:56 2011: Job 8766: Publishing CDR0000648991.
!5648 Tue Jun 07 15:46:56 2011: Job 8766: Publishing CDR0000648993.
!2944 Tue Jun 07 15:46:56 2011: Job 8768: Publishing CDR0000445605.
!2944 Tue Jun 07 15:46:56 2011: Job 8768: Publishing CDR0000446541.
!5648 Tue Jun 07 15:46:57 2011: Job 8766: Publishing CDR0000648995.

However, every so often it appears that one of these log strings are cut-off and concatenated with the previous output line:
!5648 Tue Jun 07 15:46:57 2011: Job 8766: Publishing CDR0000649005.
!2944 Tue Jun 07 15:46:58 2011: Job 8768: Publishing CDR0000447049.
!2944 Tue Jun 07 15:46:58 2011: Job 8768: Publishing CDR0000447136.
!5648 Tue Jun 07 15:46:58 2011: Job 8766: Publishing CDR0000649010.
!5648 Tue Jun 07 15:46:58 2011: Job 8768: Publishing CDR0000448620Job 8766: Publ
ishing CDR0000649011..

!2944 Tue Jun 07 15:46:58 2011: Job 8768: Publishing CDR0000449664.
!2944 Tue Jun 07 15:46:58 2011: Job 8768: Publishing CDR0000449683.
!5648 Tue Jun 07 15:46:59 2011: Job 8766: Publishing CDR0000649024.
!2944 Tue Jun 07 15:46:59 2011: Job 8768: Publishing CDR0000450954.

The missing text, here is would be at a minimum the string:
!5648 Tue Jun 07 15:46:58 2011:
and maybe more lines is dropped and doesn't appear anywhere else within the log file.

Comment entered 2014-11-07 16:17:42 by Englisch, Volker (NIH/NCI) [C]

I've looked at all of these issues/comments and - except for one or two - these are all still valid.

Comment entered 2014-11-20 15:13:31 by Englisch, Volker (NIH/NCI) [C]

We're waiting for the work on the rewrite/replatform of the CDR to start. At that point we will create individual sub-tickets for each of the issues identified.

Elapsed: 0:00:00.001372