Issue Number | 3481 |
---|---|
Summary | Modify Publishing Program |
Created | 2012-02-22 11:43:27 |
Issue Type | Improvement |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Englisch, Volker (NIH/NCI) [C] |
Status | Closed |
Resolved | 2012-05-11 15:53:12 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.107809 |
BZISSUE::5176
BZDATETIME::2012-02-22 11:43:27
BZCREATOR::Volker Englisch
BZASSIGNEE::Volker Englisch
BZQACONTACT::Margaret Beckwith
In the past six months the publishing process frequently fails without any error message or indication of what the problem may be.
Need to investigate and possibly fix this problem.
BZDATETIME::2012-02-22 13:02:26
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1
After checking the log files for the publishing jobs we identified that for the past 5 or 6 publishing failures the failure always occurred while processing the Media/Audio documents. The audio documents are very small files and require little processing. We are publishing around 1,000 documents per minute for audio files (around 17 docs/sec) while other document types run at around 1-4 docs/sec.
We did have similar processing problems in the past when we tried to insert documents into a SQL table (pub_proc_doc_work) and we were able to resolve that issue by "slowing down" the system.
I've implemented the same approach for this problem:
The assumption is that the system can handle more than 1000 documents
per minute. (It hasn't been tested how many more.)
The process has been modified to check after every 250 documents if the
rate of documents processed per second is above or below a certain
threshold, which is currently set to 5 docs/sec. Once this threshold is
reached, processing will pause for 15 seconds before it continues (for a
given thread).
Given that we're running the publishing job with 2 threads we will never
process more than 1,000 documents within a minute without "taking a
break" of 15 seconds for the system to catch up.
I have this implemented and tested on MAHLER as part of
cdrpub.py
I'm currently running a test job on FRANCK (because there are very
few audio files on MAHLER).
Unfortunately, only running these changes successfully on BACH will tell
us for sure if our assumption for the failures is correct but we'll
certainly ensure that I haven't broken anything.
BZDATETIME::2012-02-24 17:21:16
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2
My test publishing jobs on MAHLER and FRANCK finished successfully.
The following program has been copied to BACH:
cdrpub.py - R10334
I will monitor the publishing job for a few weeks before closing this issue.
BZDATETIME::2012-02-27 10:22:01
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3
The changes to the program caused Friday's publishing job to finish
20 minutes later than usual. However, the publishing job did finish.
:-)
Obviously, at this point it is unclear if the change to the program was
responsible for the publishing job to finish successfully or not. Only
time will tell.
BZDATETIME::2012-03-15 17:00:45
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4
The publishing job finished without problems the past three weeks. It appears we solved the problem.
Closing issue.
BZDATETIME::2012-04-16 15:12:22
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5
Last weekend the publishing job failed again with the same symptoms
that we had experienced earlier.
I may have set the parameters for slowing down the publishing job a
little too generous and will have to adjust these.
When I restarted the publishing job the second time it finished
successfully (without any changes to the code) but failed pushing the
documents to Gatekeeper (which is a different problem).
BZDATETIME::2012-04-20 15:22:53
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6
I modified the publishing job slightly to reduce the maximum docs/sec
ratio from 5 to 4 and I'm evaluating after 200 docs (down from
250).
Hopefully, this will not make the publishing job too much slower but
enough not to fail in the middle of the night.
Because this was not a programming change but merely a change of
parameters I went ahead and copied the change to production.
cdrpub.py - R10384
I will monitor the result after tonight's weekly publishing job.
BZDATETIME::2012-04-20 16:31:21
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7
When we move to CBIIT we should try to remove the brakes from the process and see what happens.
The whole environment will be different enough, running in a virtual machine, having the database on a separate machine, possibly with newer versions of Python and SQL Server, that the problem might go away. Or maybe it will get worse. We'll just have to see.
BZDATETIME::2012-04-30 12:46:55
BZCOMMENTOR::Volker Englisch
BZCOMMENT::8
(In reply to comment #6)
> I will monitor the result after tonight's weekly publishing
job.
And there it was: The shot in the foot that made publishing fail last
weekend.
We've successfully slowed down publishing enough that the publishing job
now thinks something is wrong and cancels the job automatically.
Knowing how long a typical publishing job runs (around 6 hours) I had included a test to see how long the process actually did run and stopped the job when it was running much longer. By slowing the processing enough we ran into this time period when the publishing job is automatically stopped because it assumes problems with the job.
I will try to think about a way to check processing more reliably.
BZDATETIME::2012-05-07 15:20:05
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9
I've increased the time-to-wait before cancelling the job from around
6 to 8 hours.
SubmitPubJob.py - R10396
There are still two alternative options that could be employed if
this change isn't reliable enough:
a) Testing the file size of the log file. If the size stays
constant
there aren't any documents being published
b) Writing a time stamp to a table on regular intervals and cancelling
once the
time stamp doesn't get updated anymore.
For now we should be OK with the simple solution and a unusually large publishing job (35,000+ documents) finished without any CDR publishing problems.
BZDATETIME::2012-05-11 15:53:12
BZCOMMENTOR::Volker Englisch
BZCOMMENT::10
No problems encountered during last publishing job.
Closing issue.
Elapsed: 0:00:00.001455