Issue Number | 3039 |
---|---|
Summary | Exception Handling of Failed Publishing Jobs |
Created | 2009-12-10 15:20:49 |
Issue Type | Bug |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | alan |
Status | Closed |
Resolved | 2011-01-17 12:41:32 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.107367 |
BZISSUE::4715
BZDATETIME::2009-12-10 15:20:49
BZCREATOR::Volker Englisch
BZASSIGNEE::Alan Meyer
BZQACONTACT::Volker Englisch
When we are running a publishing job the documents are being stored
on the file system in a directory called
Job1234.InProcess
If a job fails part of the exception handling will rename the directory
to
Job1234.FAILURE
This means that the directory Job1234.something would exist exactly
once.
When I was running a test job today which failed I suddenly had both
of these directories in the file system
Job1234.InProcess
and
Job1234.FAILURE
The 'InProcess' directory contained exactly one file. I'm guessing that
a document that is being processed fails and starts the 'cleanup' while
another document - running in a separate thread - is still working on
and writing another document.
Once I reduced the PUB_THREAD variable to '1' the second directory did
not get created anymore after a failed job.
Bob and I decided that this would be the perfect task to assign to Alan, primarily because he's on vacation right now and can't veto. :-)
BZDATETIME::2009-12-21 12:02:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1
(In reply to comment #0)
...
> Bob and I decided that this would be the perfect task to assign to
Alan,
> primarily because he's on vacation right now and can't veto.
:-)
Since my veto power has been removed, I guess I should
accept this task.
BZDATETIME::2009-12-22 10:38:51
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2
Assuming Volker is right about the cause of this problem,
there are different solutions depending on whether we want the
last completed document to be written to the file system or not.
If thread 1 fails and aborts the job with status=FAILURE, but
thread 2 is still processing and completes another record,
should that record be written out or not?
I suspect that it doesn't make any real difference. If anyone
thinks otherwise, please say so here since it affects the
choice among alternative solutions to the problem.
I will implement a solution that assumes that it doesn't matter
what happens to the last record (when I've completed higher
priority tasks), unless someone has a reason to do otherwise.
BZDATETIME::2009-12-22 10:52:08
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3
(In reply to comment #2)
> I will implement a solution that assumes that it doesn't
matter
> what happens to the last record
I'm OK with that approach given the fact that the problem already happened in the thread that's finishing up.
BZDATETIME::2010-10-05 23:46:38
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4
I have implemented a solution to this problem. I'll need to
merge it with the trunk when Volker checks in his changes.
Before I do that, I would like to do a walk through of what
I've done with all three of us. The changes are small and
simple, but they involve thread synchronization - which is
very hard to test and not so hard to get wrong.
BZDATETIME::2010-10-06 09:21:33
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5
(In reply to comment #4)
> I have implemented a solution to this problem. I'll need to
> merge it with the trunk when Volker checks in his changes.
Which file is that?
Should I go ahead and try to test on MAHLER?
BZDATETIME::2010-10-14 22:11:44
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6
Volker and I walked through the code. While looking at it, we
thought
of some improvements, which I have made.
Volker thought of another improvement to cdrpub which might
possibly
be implemented in this code, or might not.
I'd like to do one more walkthrough with both Bob and Volker
before
testing. I won't put the changes into subversion until we've had
a chance to do that.
BZDATETIME::2010-10-26 14:40:43
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7
Performed another walkthrough with Bob and Volker.
The code is in subversion and in the live directory on Mahler.
Volker will run a publishing job on Mahler and, if nothing is
broken, we'll move it to Bach and Franck.
I'm marking this issue as resolved-fixed.
BZDATETIME::2010-10-29 14:48:53
BZCOMMENTOR::Volker Englisch
BZCOMMENT::8
I ran a publishing job which finished without problems.
I'm now trying to remember how I did make the job fail to actually test
that the new code is working properly.
BZDATETIME::2010-11-01 15:29:33
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9
I've ran a few publishing jobs on FRANCK trying to come up with a
scenario that displays the reported error but I wasn't successful in
doing so.
However, all of the publishing jobs finished successfully and at a
minimum the changes didn't break anything.
I believe it's OK to move the changes to production.
BZDATETIME::2010-11-01 16:59:03
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10
(In reply to comment #9)
> I've ran a few publishing jobs on FRANCK trying to come up with a
scenario that
> displays the reported error but I wasn't successful in doing
so.
> However, all of the publishing jobs finished successfully and at a
minimum the
> changes didn't break anything.
>
> I believe it's OK to move the changes to production.
Not breaking anything on a successful job is the key. It follows that if worst comes to worst and my new code is wrong, it will only go wrong in the case of a failed publishing job, which is inconvenient but not problematic for publishing.
BZDATETIME::2010-11-22 10:12:49
BZCOMMENTOR::Volker Englisch
BZCOMMENT::11
(In reply to comment #10)
> Not breaking anything on a successful job is the key.
Just wondering: Where do we go from here?
Is there anything else that I need to do?
BZDATETIME::2010-11-30 09:35:49
BZCOMMENTOR::Volker Englisch
BZCOMMENT::12
Are these changes already in production and we should close this issue or do the changes still need to be moved, Alan?
BZDATETIME::2010-11-30 09:58:19
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13
(In reply to comment #12)
> Are these changes already in production and we should close this
issue or do
> the changes still need to be moved, Alan?
Sorry, I missed the last message about this.
The changes are not in production and not even on Franck. I
guess each of us thought the other was going to move the changes
to Franck before the last publishing job run there.
I will move them to Franck now. We should run a job there
and, if nothing is broken, move them to Bach.
BZDATETIME::2010-11-30 10:00:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14
(In reply to comment #13)
> I will move them to Franck now.
Done.
BZDATETIME::2010-12-14 11:28:29
BZCOMMENTOR::Volker Englisch
BZCOMMENT::15
I ran a Friday publishing job on FRANCK without failure.
We should move this to BACH once Alan comes back from vacation.
BZDATETIME::2011-01-06 16:20:50
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16
(In reply to comment #15)
...
> We should move this to BACH once Alan comes back from vacation.
Done.
BZDATETIME::2011-01-17 12:41:32
BZCOMMENTOR::Volker Englisch
BZCOMMENT::17
Since publishing ran for over a week without any problems we had discussed at last Thursday's CDR meeting that this issue is now ready to be closed.
Elapsed: 0:00:00.001416