Issue Number | 2220 |
---|---|
Summary | Publishing 2.0 - Collection of Issues |
Created | 2007-05-18 11:08:51 |
Issue Type | Improvement |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | In Progress |
Resolved | |
Resolution | |
Path | /home/bkline/backups/jira/ocecdr/issue.106548 |
BZISSUE::3265
BZDATETIME::2007-05-18 11:08:51
BZCREATOR::Volker Englisch
BZASSIGNEE::Volker Englisch
BZQACONTACT::Alan Meyer
Alan, in his infinite wisdom, had a good idea yesterday.
He suggested to create a Bugzilla entry and collect various
items/issues/enhancement requests/etc. that we would like to see
improved should we ever decide to rewrite our publishing software.
I am setting this as a P4 for now but we might want to set it to P5
to remove from the weekly status report.
I'm giving this task the alias 'publish' so it would be easy to search
for.
BZDATETIME::2007-05-18 11:11:18
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1
Currently, the publishing software allows only the user who submitted
a publishing job to release it once it has the status of 'Waiting for
user'.
It would be useful to allow others (i.e. members of the admin group) to
be able and release waiting push jobs.
BZDATETIME::2007-05-18 11:16:07
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2
When submitting a publishing job it would be nice to seed the publishing document with a list of valid values (e.g. for the GKPubTarget in (GateKeeper, Preview, Live) and possibly a hint on the format of parameters (e.g. for JobStartDateTime, GKServer).
BZDATETIME::2007-06-21 11:59:24
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3
Rewrite the publishing software without the use of the non-standard module xmlproc if possible.
BZDATETIME::2007-07-13 18:06:46
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4
I'm adding an earlier email thread listing an "oddity" of the system
that may have to be handled differently in a new system.
In short: A Term document that has a status of 'Pending' will be
published if it is part of an interim hot-fix but will then be removed
with a following Export job. The document history report, however,
doesn't recognize the Export event as a remove event.
-----------------------------------------------------------------------------
From: Englisch, Volker (NIH/NCI) [C]
Sent: Wednesday, July 11, 2007 6:51 PM
To: 'Lakshmi Grama (CIPS)'; Margaret Beckwith; 'Khanna, Sheri L'; 'Bob
Kline (bkline@rksystems.com)'; Alan Meyer; Kidder, Charlie (NIH/NCI)
[C]
Subject: Publishing Term documents
Sheri and I discovered a "feature" today in the CDR that I'd like to bring to everyone's attention.
I had looked at the spreadsheet from Olga listing all documents that
were removed from production as part of the latest full load and
compared this with the documents we have listed as removed.
I discovered a GlossaryTerm document (sleeve resection - CDR482346) that
had been removed from Cancer.gov. However, when running the document
history report it is not indicated that this Glossary had been removed.
Instead, it is being listed as published.
Sheri and I looked at this for a while and came up with the following
sequence of events as an explanation for the behavior:
1) The GlossaryTerm has a TermStatus of 'Pending' and a publishable
version is being created.
We had discussed earlier that the publishable GlossaryTerm would not get
published since the query to select GlossaryTerm documents only includes
documents with the "correct" status.
2) Probably by accident, the CDR-ID for this GlossaryTerm is included in
the list of interim updates and gets published.
The interim update does publish the GlossaryTerm because this
publishing event assumes that all document-IDs that are passed to the
program need to be updated regardless of the TermStatus, for
instance.
3) The monthly export selects all GlossaryTerm documents that need to be
updated (excluding those with a TermStatus='Pending') and identifies
that the document that was published earlier as part of the interim
update is not part of the list of GlossaryTerm documents anymore and
removes it.
4) The document has never been blocked by CIAT (in order to be removed),
therefore the document history report lists this document as published
twice, even so it actually got published and then removed.
We might want to introduce a check for GlossaryTerm documents to make sure they have the proper TermStatus before they are being interim-updated. Although, with the new MFP this problem is likely to never appear again but I thought we should at least be aware of it.
--------------------------------------------------------------------
From: Kline, Robert (NCI)
Sent: Wednesday, July 11, 2007 7:04 PM
To: Englisch, Volker (NIH/NCI) [C]
Cc: Grama, Lakshmi (NIH/NCI) [E]; Beckwith, Margaret (NIH/NCI) [E];
Khanna, Sheri L; Alan Meyer; Kidder, Charlie (NIH/NCI) [C]
Subject: Re: Publishing Term documents
Englisch, Volker (NIH/NCI) [C] wrote:
> The interim update does publish the GlossaryTerm because
this publishing event assumes that all document-IDs that are passed to
the program need to be updated regardless of the TermStatus, for
instance.
>
This behavior is as designed. This is one of a number of places
where
the users have a choice between having the software pick documents
based
on canned selection criteria, or submitting a list of documents by
hand. When the users manually give the software a list of document
IDs,
the software assumes the user knows what he/she is doing.
---------------------------------------------------------------------
From: Englisch, Volker (NIH/NCI) [C]
Sent: Wednesday, July 11, 2007 7:48 PM
To: Kline, Robert (NCI)
Cc: Grama, Lakshmi (NIH/NCI) [E]; Beckwith, Margaret (NIH/NCI) [E];
Khanna, Sheri L; Alan Meyer; Kidder, Charlie (NIH/NCI) [C]
Subject: Re: Publishing Term documents
On 07/11/2007 07:03 PM Bob Kline wrote:
> Englisch, Volker (NIH/NCI) [C] wrote:
>> The interim update does publish the GlossaryTerm
because this
>> publishing event assumes that all document-IDs that are passed
to the
>> program need to be updated regardless of the TermStatus, for
instance.
>>
>
> This behavior is as designed. This is one of a number of places
where
> the users have a choice between having the software pick documents
based
> on canned selection criteria, or submitting a list of documents
by
> hand. When the users manually give the software a list of document
IDs,
> the software assumes the user knows what he/she is doing.
>
I can agree with the behavior from a publishing point of view. What I
don't
agree with is the fact that the doc history report isn't
recognizing/marking the document as being removed as part of the
export.
However, since this is a very special case it may be too expensive
to
include this type of check.
BZDATETIME::2007-08-15 12:10:09
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5
During publishing a directory is being created with the name
d:/cdr/Output/JobNNNN.InProcess
to store all licensee output files. After all documents have been
filtered this directory is being renamed to
d:/cdr/Output/JobNNNN
If this directory is accessed at the time the rename is attempted the publishing job finishes but the rename fails silently. This causes the following push job to fail because the output directory specified (d:/cdr/Output/JobNNNN) does not exist.
We would want to continue the publishing job but submit a warning if the rename of the directory fails and we should not submit a push job but instead indicating that the push job should be initiated manually after the directory has been renamed.
BZDATETIME::2007-10-09 11:09:42
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6
The publishing control document uses two parameters, the NumDocs and
NumDocsPerDocType, which essentially do the same thing - to limit the
number of returned rows of a SELECT statement as in
SELECT TOP $NumDocs doc_id
FROM document;
The 'NumDocsPerDocType' is used for publishing of multiple document
types like the Export or Full Load while the 'NumDocs' parameter is used
for publishing of single document types like the Export-Summary,
Export-Country, etc.
In my opinion, we should not use two different parameters when we are doing the same thing - limiting the number of rows returned.
BZDATETIME::2007-10-23 17:28:00
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7
The current mechanism for selecting documents using the
SubsetSelect element in the publishing control document is too
inflexible. We need a method that allows us either to run a
series of queries that build a temporary table and then select
the results from it, or perhaps execute a stored procedure that
can do that. Of the two approaches, the stored procedure is
potentially the most efficient, but the multi-query approach has
the virtue of putting the SQL logic in plain sight in the control
document.
We might even wnat to be able to execute Python code. I don't
have a use case for that and am not sure there is one, but I'd
like us to consider it before deciding we don't need it.
Being able to create and use multiple queries and temporary
tables will make the program more efficient as well as more
flexible.
BZDATETIME::2007-10-23 17:33:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::8
It turns out that many of the document selection queries in the
publishing control document are identical, but with slight
changes that may be nothing more than a doctype name, or
something like that.
If we could create named, parameterized, selection queries I
think we could invoke them from many places, dramatically
reducing the amount of SQL in the document and also, potentially,
reducing the maintenance burden if and when something needs to
change in all of them.
It would also be useful to have named SubsetParameters sets (or
whatever the new program uses) that could be re-used. Ideally,
we would want to be able to say something like: Use the
Interim-Export SubsetParameters but override this one and add
that one.
BZDATETIME::2007-10-24 16:06:12
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9
During publishing one can monitor how many documents have already
been processed.
The system prints out a message after every 1000 documents listing the
number of processed and failed documents (failed in this context also
includes "documents processed with warnings").
For the weekly export we don't need a message after every 1000 document.
Every 5000/10000 would be sufficient. For publishing jobs of small
document types, however, it would be nice to get messages more
frequently.
It would be nice if the delay for displaying a progress record could be a variable.
BZDATETIME::2007-10-25 15:52:35
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10
In comment #8 I mentioned the possibility of named,
parameterized
selection queries and parameter sets.
The general principle of named objects that can be invoked from
multiple places may apply elsewhere too, for example to whole
subsets.
BZDATETIME::2007-10-25 15:57:56
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11
IIRC that back in the early days of publishing we lamented the
fact that the publishing system had a fixed number of
hierarchical levels in the control document. I don't recall
whether we came up with a better plan or just decided that, since
the fixed system has already been implemented, it wasn't worth
thinking further about the problem.
However we should now consider whether fixed levels are the
right
way to re-implement the software. Although I have nothing
specific in mind, my intuition is that we can do better with a
more flexible hierarchy.
BZDATETIME::2008-08-13 12:47:25
BZCOMMENTOR::Volker Englisch
BZCOMMENT::12
We have the ability to allow a certain number of validation errors when submitting a publishing job by specifying the AbortOnError variable.
When a publishing job fails because the specified maximum number of
allowed validation errors was reached it would be helpful to specify (in
the logfile or as part of the messages) the reason for this failure such
as
"Number of allowed document failures exceeded"
Currently the only reason for the publishing failure states:
"publish: Aborting on error in thread 1, see logfile"
without any useful information in the logfile.
BZDATETIME::2009-02-13 15:48:47
BZCOMMENTOR::Volker Englisch
BZCOMMENT::13
On those pages allowing the user to modify publishing parameters remove the column 'Default Value'.
We typically have a table listed with the columns
Name Default Value Current Value
The default value is redundant since the current value is populated with the default value.
BZDATETIME::2009-02-13 16:51:07
BZCOMMENTOR::Volker Englisch
BZCOMMENT::14
If a publishing job finishes with a status of 'Failure' there is no email message submitted to the user. An email is only submitted when the publishing job finished successfully.
It would help to also submit an email message for failed jobs.
BZDATETIME::2010-05-26 13:28:14
BZCOMMENTOR::Volker Englisch
BZCOMMENT::15
Something to think about:
If a media document has been picked up for publishing (by our
selection criteria in the publishing document) the document will be
extracted from the CDR (using cdr.getDoc()). Then, after it has been
retrieved, a test is being performed identifying if processing of the
specific media type is supported.
For very large files, such as audio files, the cdr.getDoc() process
might take between 5-10 minutes for a single file.
We may want to think about testing the valid media types prior to
retrieving the document.
BZDATETIME::2010-05-26 13:29:08
BZCOMMENTOR::Volker Englisch
BZCOMMENT::16
Removing Charlie as a CC from this issue.
BZDATETIME::2010-06-17 10:26:01
BZCOMMENTOR::Volker Englisch
BZCOMMENT::17
Rolling this issue, which belongs to the Publishing 2.0 category, from it's original OCECDR-2358 into this one.
Description ----------------------
It is possible that an updated filter could be promoted from the
development
server to the production server during the course of a publication
run.
If a filter changed over the course of a run, some documents might be
filtered
according to one rule while others are filtered according to a different
rule.
Inconsistent publications could be produced.
For additional information please see item (5) in the following
attachment of
OCECDR-715:
http://verdi.nci.nih.gov/tracker/attachment.cgi?id=72
BZDATETIME::2011-06-08 12:43:15
BZCOMMENTOR::Volker Englisch
BZCOMMENT::18
There is a problem writing to the log file if multiple publishing
jobs are running at the same time (at this point this only happens
during testing).
Let's say there are two publishing events, one for Job8766 and one for
Job8768. Both of these are writing to the log file records like this
where the output is mixed:
!5648 Tue Jun 07 15:46:56 2011: Job 8766: Publishing
CDR0000648989.
!2944 Tue Jun 07 15:46:56 2011: Job 8768: Publishing
CDR0000445586.
!5648 Tue Jun 07 15:46:56 2011: Job 8766: Publishing
CDR0000648991.
!5648 Tue Jun 07 15:46:56 2011: Job 8766: Publishing
CDR0000648993.
!2944 Tue Jun 07 15:46:56 2011: Job 8768: Publishing
CDR0000445605.
!2944 Tue Jun 07 15:46:56 2011: Job 8768: Publishing
CDR0000446541.
!5648 Tue Jun 07 15:46:57 2011: Job 8766: Publishing CDR0000648995.
However, every so often it appears that one of these log strings are
cut-off and concatenated with the previous output line:
!5648 Tue Jun 07 15:46:57 2011: Job 8766: Publishing
CDR0000649005.
!2944 Tue Jun 07 15:46:58 2011: Job 8768: Publishing
CDR0000447049.
!2944 Tue Jun 07 15:46:58 2011: Job 8768: Publishing
CDR0000447136.
!5648 Tue Jun 07 15:46:58 2011: Job 8766: Publishing
CDR0000649010.
!5648 Tue Jun 07 15:46:58 2011: Job 8768: Publishing CDR0000448620Job
8766: Publ
ishing CDR0000649011..
!2944 Tue Jun 07 15:46:58 2011: Job 8768: Publishing
CDR0000449664.
!2944 Tue Jun 07 15:46:58 2011: Job 8768: Publishing
CDR0000449683.
!5648 Tue Jun 07 15:46:59 2011: Job 8766: Publishing
CDR0000649024.
!2944 Tue Jun 07 15:46:59 2011: Job 8768: Publishing CDR0000450954.
The missing text, here is would be at a minimum the string:
!5648 Tue Jun 07 15:46:58 2011:
and maybe more lines is dropped and doesn't appear anywhere else within
the log file.
I've looked at all of these issues/comments and - except for one or two - these are all still valid.
We're waiting for the work on the rewrite/replatform of the CDR to start. At that point we will create individual sub-tickets for each of the issues identified.
Elapsed: 0:00:00.001474