PDQ Issues

Issue Number	3481
Summary	Modify Publishing Program
Created	2012-02-22 11:43:27
Issue Type	Improvement
Submitted By	Englisch, Volker (NIH/NCI) [C]
Assigned To	Englisch, Volker (NIH/NCI) [C]
Status	Closed
Resolved	2012-05-11 15:53:12
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.107809

Description

BZISSUE::5176
BZDATETIME::2012-02-22 11:43:27
BZCREATOR::Volker Englisch
BZASSIGNEE::Volker Englisch
BZQACONTACT::Margaret Beckwith

In the past six months the publishing process frequently fails without any error message or indication of what the problem may be.

Need to investigate and possibly fix this problem.

Comment entered 2012-02-22 13:02:26 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-22 13:02:26
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1

After checking the log files for the publishing jobs we identified that for the past 5 or 6 publishing failures the failure always occurred while processing the Media/Audio documents. The audio documents are very small files and require little processing. We are publishing around 1,000 documents per minute for audio files (around 17 docs/sec) while other document types run at around 1-4 docs/sec.

We did have similar processing problems in the past when we tried to insert documents into a SQL table (pub_proc_doc_work) and we were able to resolve that issue by "slowing down" the system.

I've implemented the same approach for this problem:
The assumption is that the system can handle more than 1000 documents per minute. (It hasn't been tested how many more.)
The process has been modified to check after every 250 documents if the rate of documents processed per second is above or below a certain threshold, which is currently set to 5 docs/sec. Once this threshold is reached, processing will pause for 15 seconds before it continues (for a given thread).
Given that we're running the publishing job with 2 threads we will never process more than 1,000 documents within a minute without "taking a break" of 15 seconds for the system to catch up.

I have this implemented and tested on MAHLER as part of
cdrpub.py

I'm currently running a test job on FRANCK (because there are very few audio files on MAHLER).
Unfortunately, only running these changes successfully on BACH will tell us for sure if our assumption for the failures is correct but we'll certainly ensure that I haven't broken anything.

Comment entered 2012-02-24 17:21:16 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-24 17:21:16
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2

My test publishing jobs on MAHLER and FRANCK finished successfully. The following program has been copied to BACH:
cdrpub.py - R10334

I will monitor the publishing job for a few weeks before closing this issue.

Comment entered 2012-02-27 10:22:01 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-02-27 10:22:01
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3

The changes to the program caused Friday's publishing job to finish 20 minutes later than usual. However, the publishing job did finish. :-)
Obviously, at this point it is unclear if the change to the program was responsible for the publishing job to finish successfully or not. Only time will tell.

Comment entered 2012-03-15 17:00:45 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-03-15 17:00:45
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4

The publishing job finished without problems the past three weeks. It appears we solved the problem.

Closing issue.

Comment entered 2012-04-16 15:12:22 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-04-16 15:12:22
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5

Last weekend the publishing job failed again with the same symptoms that we had experienced earlier.
I may have set the parameters for slowing down the publishing job a little too generous and will have to adjust these.
When I restarted the publishing job the second time it finished successfully (without any changes to the code) but failed pushing the documents to Gatekeeper (which is a different problem).

Comment entered 2012-04-20 15:22:53 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-04-20 15:22:53
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6

I modified the publishing job slightly to reduce the maximum docs/sec ratio from 5 to 4 and I'm evaluating after 200 docs (down from 250).
Hopefully, this will not make the publishing job too much slower but enough not to fail in the middle of the night.
Because this was not a programming change but merely a change of parameters I went ahead and copied the change to production.
cdrpub.py - R10384

I will monitor the result after tonight's weekly publishing job.

Comment entered 2012-04-20 16:31:21 by alan

BZDATETIME::2012-04-20 16:31:21
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7

When we move to CBIIT we should try to remove the brakes from the process and see what happens.

The whole environment will be different enough, running in a virtual machine, having the database on a separate machine, possibly with newer versions of Python and SQL Server, that the problem might go away. Or maybe it will get worse. We'll just have to see.

Comment entered 2012-04-30 12:46:55 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-04-30 12:46:55
BZCOMMENTOR::Volker Englisch
BZCOMMENT::8

(In reply to comment #6)
> I will monitor the result after tonight's weekly publishing job.

And there it was: The shot in the foot that made publishing fail last weekend.
We've successfully slowed down publishing enough that the publishing job now thinks something is wrong and cancels the job automatically.

Knowing how long a typical publishing job runs (around 6 hours) I had included a test to see how long the process actually did run and stopped the job when it was running much longer. By slowing the processing enough we ran into this time period when the publishing job is automatically stopped because it assumes problems with the job.

I will try to think about a way to check processing more reliably.

Comment entered 2012-05-07 15:20:05 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-05-07 15:20:05
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9

I've increased the time-to-wait before cancelling the job from around 6 to 8 hours.
SubmitPubJob.py - R10396

There are still two alternative options that could be employed if this change isn't reliable enough:
a) Testing the file size of the log file. If the size stays constant
there aren't any documents being published
b) Writing a time stamp to a table on regular intervals and cancelling once the
time stamp doesn't get updated anymore.

For now we should be OK with the simple solution and a unusually large publishing job (35,000+ documents) finished without any CDR publishing problems.

Comment entered 2012-05-11 15:53:12 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-05-11 15:53:12
BZCOMMENTOR::Volker Englisch
BZCOMMENT::10

No problems encountered during last publishing job.
Closing issue.

Elapsed: 0:00:00.001449

CDR Tickets