CDR Tickets

Issue Number 3039
Summary Exception Handling of Failed Publishing Jobs
Created 2009-12-10 15:20:49
Issue Type Bug
Submitted By Englisch, Volker (NIH/NCI) [C]
Assigned To alan
Status Closed
Resolved 2011-01-17 12:41:32
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107367
Description

BZISSUE::4715
BZDATETIME::2009-12-10 15:20:49
BZCREATOR::Volker Englisch
BZASSIGNEE::Alan Meyer
BZQACONTACT::Volker Englisch

When we are running a publishing job the documents are being stored on the file system in a directory called
Job1234.InProcess
If a job fails part of the exception handling will rename the directory to
Job1234.FAILURE
This means that the directory Job1234.something would exist exactly once.

When I was running a test job today which failed I suddenly had both of these directories in the file system
Job1234.InProcess
and
Job1234.FAILURE
The 'InProcess' directory contained exactly one file. I'm guessing that a document that is being processed fails and starts the 'cleanup' while another document - running in a separate thread - is still working on and writing another document.
Once I reduced the PUB_THREAD variable to '1' the second directory did not get created anymore after a failed job.

Bob and I decided that this would be the perfect task to assign to Alan, primarily because he's on vacation right now and can't veto. :-)

Comment entered 2009-12-21 12:02:10 by alan

BZDATETIME::2009-12-21 12:02:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

(In reply to comment #0)
...
> Bob and I decided that this would be the perfect task to assign to Alan,
> primarily because he's on vacation right now and can't veto. :-)

Since my veto power has been removed, I guess I should
accept this task.

Comment entered 2009-12-22 10:38:51 by alan

BZDATETIME::2009-12-22 10:38:51
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

Assuming Volker is right about the cause of this problem,
there are different solutions depending on whether we want the
last completed document to be written to the file system or not.

If thread 1 fails and aborts the job with status=FAILURE, but
thread 2 is still processing and completes another record,
should that record be written out or not?

I suspect that it doesn't make any real difference. If anyone
thinks otherwise, please say so here since it affects the
choice among alternative solutions to the problem.

I will implement a solution that assumes that it doesn't matter
what happens to the last record (when I've completed higher
priority tasks), unless someone has a reason to do otherwise.

Comment entered 2009-12-22 10:52:08 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-12-22 10:52:08
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3

(In reply to comment #2)
> I will implement a solution that assumes that it doesn't matter
> what happens to the last record

I'm OK with that approach given the fact that the problem already happened in the thread that's finishing up.

Comment entered 2010-10-05 23:46:38 by alan

BZDATETIME::2010-10-05 23:46:38
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

I have implemented a solution to this problem. I'll need to
merge it with the trunk when Volker checks in his changes.

Before I do that, I would like to do a walk through of what
I've done with all three of us. The changes are small and
simple, but they involve thread synchronization - which is
very hard to test and not so hard to get wrong.

Comment entered 2010-10-06 09:21:33 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-10-06 09:21:33
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5

(In reply to comment #4)
> I have implemented a solution to this problem. I'll need to
> merge it with the trunk when Volker checks in his changes.

Which file is that?

Should I go ahead and try to test on MAHLER?

Comment entered 2010-10-14 22:11:44 by alan

BZDATETIME::2010-10-14 22:11:44
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6

Volker and I walked through the code. While looking at it, we thought
of some improvements, which I have made.

Volker thought of another improvement to cdrpub which might possibly
be implemented in this code, or might not.

I'd like to do one more walkthrough with both Bob and Volker before
testing. I won't put the changes into subversion until we've had
a chance to do that.

Comment entered 2010-10-26 14:40:43 by alan

BZDATETIME::2010-10-26 14:40:43
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7

Performed another walkthrough with Bob and Volker.

The code is in subversion and in the live directory on Mahler.

Volker will run a publishing job on Mahler and, if nothing is
broken, we'll move it to Bach and Franck.

I'm marking this issue as resolved-fixed.

Comment entered 2010-10-29 14:48:53 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-10-29 14:48:53
BZCOMMENTOR::Volker Englisch
BZCOMMENT::8

I ran a publishing job which finished without problems.
I'm now trying to remember how I did make the job fail to actually test that the new code is working properly.

Comment entered 2010-11-01 15:29:33 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-11-01 15:29:33
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9

I've ran a few publishing jobs on FRANCK trying to come up with a scenario that displays the reported error but I wasn't successful in doing so.
However, all of the publishing jobs finished successfully and at a minimum the changes didn't break anything.

I believe it's OK to move the changes to production.

Comment entered 2010-11-01 16:59:03 by alan

BZDATETIME::2010-11-01 16:59:03
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10

(In reply to comment #9)
> I've ran a few publishing jobs on FRANCK trying to come up with a scenario that
> displays the reported error but I wasn't successful in doing so.
> However, all of the publishing jobs finished successfully and at a minimum the
> changes didn't break anything.
>
> I believe it's OK to move the changes to production.

Not breaking anything on a successful job is the key. It follows that if worst comes to worst and my new code is wrong, it will only go wrong in the case of a failed publishing job, which is inconvenient but not problematic for publishing.

Comment entered 2010-11-22 10:12:49 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-11-22 10:12:49
BZCOMMENTOR::Volker Englisch
BZCOMMENT::11

(In reply to comment #10)
> Not breaking anything on a successful job is the key.

Just wondering: Where do we go from here?

Is there anything else that I need to do?

Comment entered 2010-11-30 09:35:49 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-11-30 09:35:49
BZCOMMENTOR::Volker Englisch
BZCOMMENT::12

Are these changes already in production and we should close this issue or do the changes still need to be moved, Alan?

Comment entered 2010-11-30 09:58:19 by alan

BZDATETIME::2010-11-30 09:58:19
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13

(In reply to comment #12)
> Are these changes already in production and we should close this issue or do
> the changes still need to be moved, Alan?

Sorry, I missed the last message about this.

The changes are not in production and not even on Franck. I
guess each of us thought the other was going to move the changes
to Franck before the last publishing job run there.

I will move them to Franck now. We should run a job there
and, if nothing is broken, move them to Bach.

Comment entered 2010-11-30 10:00:46 by alan

BZDATETIME::2010-11-30 10:00:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14

(In reply to comment #13)

> I will move them to Franck now.

Done.

Comment entered 2010-12-14 11:28:29 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-12-14 11:28:29
BZCOMMENTOR::Volker Englisch
BZCOMMENT::15

I ran a Friday publishing job on FRANCK without failure.

We should move this to BACH once Alan comes back from vacation.

Comment entered 2011-01-06 16:20:50 by alan

BZDATETIME::2011-01-06 16:20:50
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16

(In reply to comment #15)
...
> We should move this to BACH once Alan comes back from vacation.

Done.

Comment entered 2011-01-17 12:41:32 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2011-01-17 12:41:32
BZCOMMENTOR::Volker Englisch
BZCOMMENT::17

Since publishing ran for over a week without any problems we had discussed at last Thursday's CDR meeting that this issue is now ready to be closed.

Elapsed: 0:00:00.001457