PDQ Issues

Issue Number	2546
Summary	Approach for offline storage of old log entries
Created	2008-05-29 15:32:20
Issue Type	Improvement
Submitted By	Kline, Bob (NIH/NCI) [C]
Assigned To	Englisch, Volker (NIH/NCI) [C]
Status	Closed
Resolved	2010-02-26 13:24:29
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.106874

Description

BZISSUE::4118
BZDATETIME::2008-05-29 15:32:20
BZCREATOR::Bob Kline
BZASSIGNEE::Volker Englisch
BZQACONTACT::Volker Englisch

Analyze options for backing up older rows in the debug_log and command_log tables and come up with recommended approach(es).

Comment entered 2008-10-01 00:38:19 by alan

BZDATETIME::2008-10-01 00:38:19
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

I've given this issue some thought. I think we want a command
line program that achieves the following outcomes:

Data from the command_log_history and the debug_log_history
should be extracted and deleted from the cdrArchive
database (or the cdr database, see the list item in this
analysis.)

A programmer should be able to limit the amount extracted in
at least the following ways:

Maximum size of the output files.

The programmer specifies that no output file
should be larger than X bytes. This keeps us from
producing an unwieldy file that's tough to
compress, decompress, or copy.

If the output file reaches the maximum size, the
program should close the open output files and
begin new ones.

Latest date that will be extracted and deleted.

The programmer specifies that no records should be
removed after some date. This insures that we
don't take out data that is likely to be used for
debugging and would be inconvenient to remove.

The data should be extracted to files that can be processed
with common tools, e.g., less and grep, and should be human
readable for checking.

The data should be delimited in such a way as to fully
preserve the field boundaries of the data. XML is a
reasonable way to store the data, but other approaches are
also possible. If we use XML, the "command" element from
the command_log_history can be stored as CDATA.

Each file should have a name that reflects its contents, for
example:

CdrDebugLog_20020219-20030515.xml
CdrCommandLog_20020219-20030515.xml

The data extracted from the two logs should be coordinated.
If command_log_history records are extracted up to date X,
then debug_log_history records should be extracted up to the
exact same date.

Compressing the data file might be a separate step. That
way we can check the data before compressing it if desired.
I expect that bzip2 compression will reduce the command_log
file sizes by around 95%.

Deleting the data from the database should be
parameterizable in the running of the program. We should be
able to run it with or without deleting the data. It can be
run without deletions a) to test the program itself and b)
to test the output file for integrity before performing the
deletions.

We might want an automatic integrity check. After
extracting the data and writing it to a file then:

Close the file.
Re-open it for reading.
Repeat:
Read a record.
Validate it using xml tools.
Close the file.
If there were no errors then:
Delete the data from the database if so requested.

A program should be available to restore data from the XML
backup to the database. The program would read one of our
XML files and insert all of the records either into the
cdrarchive database, or the cdr database - whichever is
desired. If the data is being restored for debugging
reasons it's more convenient to restore it to the cdr
database and that may, in fact be all we want. It can later
be deleted using an ad hoc query. If we use very large
output files, we might want the ability to only load
portions of a file back into the database.

It should be straightforward to modify the program (it's
hard to see why it wouldn't be) to run against the cdr
database instead of the cdrarchive database. If so, we
might wish to dispense with the cdrarchive database
altogether. This needs discussion.

Comment entered 2008-10-01 00:57:54 by alan

BZDATETIME::2008-10-01 00:57:54
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

(In reply to comment #1)
> We might want an automatic integrity check. After
> extracting the data and writing it to a file then:
>
> Close the file.
> Re-open it for reading.
> Repeat:
> Read a record.
> Validate it using xml tools.

Another approach would be to do the above, but also compare
each record to the corresponding record in the database.

Comment entered 2009-03-06 14:48:22 by alan

BZDATETIME::2009-03-06 14:48:22
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

I've read the documentation on the Microsoft utility "BCP" and
done a number of experiments to see how the various parameters
work with it, how to control it with query output, and how it
performs in comparison with a Python program.

Not surprisingly, BCP is orders of magnitude faster than Python.
Copying out 10,000 command_log_history records from Bach while
running on Mahler took 6.5 seconds in Python and .031 seconds
using BCP. For the current 17+ million records in the database,
the difference is 189 minutes vs. one minute in BCP (assuming
everything scales proportionately, which it may not for very big
files.) It's faster if I run directly on Bach, but not hugely
so.

However, the difference in time may not matter, partly because
the program need only run perhaps once a month for about 20
minutes in Python vs. less than a minute in BCP, and mostly
because the really large time taken will be in bzipping the file.
The 10,000 record output file took 42 seconds to bzip on mahler.
That means that, in the worst case, the complete process
incorporating bzip2 plus a Python program will only run 15%
slower than one incorporating bzip2 plus BCP.

It seems to me that we will always want to maximally compress the
output files. Our whole reason for doing this is to save space.
So we'll always run bzip2.

Python gives us a little more flexibility in how we extract data
from the database. In particular, it allows us to control output
file size more easily. I don't think BCP can do that. Since the
files will be large and since we can run into problems both in
compressing and in copying files if we don't optimize file sizes,
it therefore seems to me that I should write the program in
Python.

One final alternative is to write the program in C#. I don't
plan to do that unless someone says to. It will probably run a
little faster but, again, the limiting speed will be bzip2 and I
don't know that we want to piecemeal write some programs in
Python and some in C# unless and until there is a firm decision
to switch platforms for client programming.

The level of effort in writing the program in Python and BCP is
about the same. The part that BCP does is only a dozen lines of
code in Python. Even with BCP we still need a Python script to
run everything with the proper parameters - which will probably
involve more than a dozen lines of code.

In conclusion (Have I overanlyzed this?) I'm going to write the
program in Python. It will take slightly longer than using BCP,
but it's not a hard program to write either way.

Someone stop me if I've got any of this wrong. I'm starting now.

Comment entered 2009-03-06 14:55:37 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-03-06 14:55:37
BZCOMMENTOR::Bob Kline
BZCOMMENT::4

(In reply to comment #3)

> I'm going to write the program in Python.

Sounds like the right decision.

Comment entered 2009-03-17 16:52:38 by alan

BZDATETIME::2009-03-17 16:52:38
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5

I did a speed test of the Python bz2 module against the stand
alone bzip2.exe windows executable. The results were that we
should be using the bz2 Python module. It's faster and requires
far less disk space (because you don't have to write an
uncompressed output file as an intermediate before compression.)

The performance of the Python module is greater than the
stand-alone executable. My suspicion is that the actual
compression speed is identical, but the stand-alone executable
must read in the input and later delete it. Actual performance
numbers were:

External bzip2.exe:
Write the output file to be compressed: 5.859 seconds
Compress it: 41.953 seconds
Total: 47.812 seconds

Python bz2:
Write the compressed output file: 39.187 seconds

The output files were bit for bit identical.

Comment entered 2009-03-24 16:17:56 by alan

BZDATETIME::2009-03-24 16:17:56
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6

I am declaring this finished.

To test, run d:\cdr\Utilities\ArchiveCmdDbgLogs.py on Mahler.

All outputs will go to the d:\cdr\Log directory. Archived
information is placed in files that begin with the substring
"History_". A record of what was done is written to
HistoryArchive.log.

Type the command with no arguments for a usage summary describing
what the program does and what the options are.

The program is in CVS.

Comment entered 2009-03-24 22:26:44 by alan

BZDATETIME::2009-03-24 22:26:44
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7

After discussing the program with Bob, I added the following
enhancements:

--outdir dirname - Writes to another drive/directory
--lastdate #nn - Archives to last date nn days ago
--quiet - Assumes "yes" to deletion prompts

The first of these should enable us to use the network attached
storage if we wish. Combining "-~~outdir wherever" with "~~-host
bach" would enable us to do the heavy lifting (xml formatting,
compression and output) for bach from mahler or franck.

The second two enhancements are to make the program easier to use
in a scheduled batch job.

Changes are on Mahler and in CVS.

Comment entered 2009-04-30 13:11:47 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-04-30 13:11:47
BZCOMMENTOR::Bob Kline
BZCOMMENT::8

Volker's going to test this and work with Alan to get it hooked up with a scheduler.

Comment entered 2009-06-03 13:02:51 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-06-03 13:02:51
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9

I started testing this program and found a few things I'm not clear about:

1. When copying records your writing information to the HistoryArchive.log
file. You're reporting about
a) total records and
b) Number of records
The number of records is the number of table rows being copied/deleted/etc
What exactly is the number for (a)?

2. The default database is supposed to be "cdrarchive". However, neither on
BACH nor on MAHLER exists a table named command_log or debug_log under
cdrarchive. There exist instead two tables command_log_history and
debug_log_history.
I am now a little confused - especially after reading issue (3) - what
the tool is supposed to do.
Could you please explain?

3. Since I thought the goal was to remove records from the command_log and
debug_log table (which don't exist in the cdrarchive DB) I used the
option to specify the database to copy/delete records.
The copy process worked as expected. Records from cdr..debug_log and
cdr..command_log got copied.
The delete process, however, appeared to have a mind of its own.
This is the command and the console message:
ArchiveCmdDbgLogs.py --firstdate=2008-10-19 --lastdate=2008-10-20
--database=cdr --quiet delete
Deleting from command log:
DELETE FROM cdr..command_log
WHERE received >= '2008-10-19'
AND received < '2008-10-21'

Deleted 110414 rows from cdr..command_log
Deleting from command log:
DELETE FROM cdrarchive..debug_log_history
WHERE recorded >= '2008-10-19'
AND recorded < '2008-10-21'

Deleted 0 rows from cdrarchive..debug_log_history

First, it deleted the records of the cdr..command_log properly. Then
it reports to be deleting from the 'command' log again but tries to delete
from the cdrarchive..debug_log_history! Wrong DB and wrong table.
This doesn't make sense to me.

4. When trying to delete records without the --quiet option you're stopping
for confirmation before both delete commands. However, since you've written
the program to always copy/delete the same time frame from both tables
I think a single confirmation to delete should be all that's necessary.
We would not want to cancel deleting only from one of the tables, I suppose.

5. The program failes when a 'move' command is following the same 'copy'
command with a
file already exists
error. This error, however, is displayed after the records have been
processed.
If possible, I would check the existence of the file up front.

Comment entered 2009-06-06 12:17:23 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-06-06 12:17:23
BZCOMMENTOR::Bob Kline
BZCOMMENT::10

[From Big Brother's alter system:]

==========================================================================
[2167110] bach.nci.nih.gov.disk red Sat Jun 06 11:03:33 DST 2009 [BACH.nci.nih.gov]

&red /FIXED/D/ (95%) has reached the PANIC level (95%)

Filesystem 1K-blocks Used Avail Capacity Mounted
C 71644784 14625780 57019004 20% /FIXED/C ()
D 716707840 685190144 31517668 95% /FIXED/D (Lightning-BACH)
Q 1044193 7949 1036244 0% /FIXED/Q (Quorum)
R 1342174464 676504960 665669504 50% /FIXED/R (Data)
==========================================================================

So I'm bumping up the priority of this issue.

In answer to one of Volker's questions in the previous comment, the tables in the cdrArchive database for which rows need to be siphoned off are command_log_history and debug_log_history.

Comment entered 2009-06-25 22:33:12 by alan

BZDATETIME::2009-06-25 22:33:12
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

(In reply to comment #9)
> I started testing this program and found a few things I'm not clear about:
> ...

Volker and I went over these and I made some revisions.

I believe it's ready for retest.

Comment entered 2009-07-01 16:57:51 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-07-01 16:57:51
BZCOMMENTOR::Volker Englisch
BZCOMMENT::12

Retesting:

> 1. When copying records your writing information to the HistoryArchive.log
> file. You're reporting about
> a) total records and
> b) Number of records
> The number of records is the number of table rows being copied/deleted/etc
> What exactly is the number for (a)?

I think I remember what confused me:
a) "total records" is reported for each day
b) "Number of records" should probably be a sum of all of the total records
but it's not!
On MAHLER I'm submitting the following command:
ArchiveCmdDbgLogs.py --lastdate=#210 copy

with the output (abbreviated):
<LogOutput>
Finished 2008-11-02: type=command total records=56 total bytes=18043
Finished 2008-11-29: type=command total records=5465 total bytes=1980561
Finished 2008-11-30: type=command total records=10992 total bytes=3978102

Total output from the cdrarchive database:
Command history:
Number of records: 11615
Earliest date: 2008-11-02
Latest date: 2008-12-01
</LogOutput>

The "Number of records" is not the sum of the total records and I don't see what it could be.

2. It would be nice if you could also answer the prompt
Do you want to overwrite "filename": (yes/no)>
with a 'y' or 'n' instead of 'yes'/'no' only

3. Please remember to modify the help message: The --lastdate date parameter
is exclusive and not inclusive.

4. There are a few problems with the date boundaries
When I use the command
ArchiveCmdDbgLogs.py --database=cdr --lastdate=#251 copy

the log output specifies the date boundaries as
First date: 2008-10-19
Last date: 2008-10-23
but only the date between the 19th and 22nd are archived but the output
filename is created as
History_2008-10-23_2008-10-23_Cmd.xml.bz2.

When I specify the dates manually
ArchiveCmdDbgLogs.py --database=cdr --firstdate=2008-10-19
--lastdate=2008-10-22 copy
the lastdate is exclusive (only the 19th - 21st are archived), I am
getting an error message saying
File History_2008-10-19_2008-10-23_Cmd.xml.bz2 already exists
but then the process creates a new file named
History_2008-10-19_2008-10-22_Cmd.xml.bz2

I will stop here and not test the 'delete' command in order to allow you to recreate these messages exactly.

Comment entered 2009-07-11 18:53:37 by alan

BZDATETIME::2009-07-11 18:53:37
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13

(In reply to comment #12)
> Retesting:
>
> with the output (abbreviated):
> <LogOutput>
> Finished 2008-11-02: type=command total records=56 total bytes=18043
> Finished 2008-11-29: type=command total records=5465 total bytes=1980561
> Finished 2008-11-30: type=command total records=10992 total bytes=3978102
>
> Total output from the cdrarchive database:
> Command history:
> Number of records: 11615
> Earliest date: 2008-11-02
> Latest date: 2008-12-01

I see that this is confusing. The "Finished" lines are are not
recording what you think they're recording. What they are is
daily progress lines. In other words:

56 records were found for 2008-11-02.
No records at all were found for 11-03 ... 11-28
5409 records were found for 11-29. The "total" processed up
to the end of that date was 5465.
5527 records were found for 11-30. The total processed up to
the end of that date was 10992.

Some more records were processed after that. You may have
specified a max number of bytes, or something else may have
caused the limitation. But the final total was 11615.

So I think all the numbers produced by the program are correct,
but maybe I should report them differently. I like having a
progress report so the user can estimate how long processing is
likely to take, but maybe I should just report a day total, not
a cumulative total, or maybe I should use different words. Or
maybe I should report the date but not the number of records.

We'll talk about it when you get back.

Comment entered 2009-07-12 17:33:56 by alan

BZDATETIME::2009-07-12 17:33:56
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14

(In reply to comment #12)
> Retesting:
...
> 2. It would be nice if you could also answer the prompt
> Do you want to overwrite "filename": (yes/no)>
> with a 'y' or 'n' instead of 'yes'/'no' only

I thought that, when deleting data, an explicit "yes" might be a
good idea. But I trust you more than I trust myself on this. So
I now accept 'y' or 'n' as well.

> 3. Please remember to modify the help message: The --lastdate date
> parameter
> is exclusive and not inclusive.

This was a bug in my code. I meant for it to be inclusive, but I
had an interesting off-by-one error caused by the fact that the
following statement is false:

'2008-11-01' <= '2008-11-01 00:00:01'

I left the message the same and fixed the code so that I think it
is now correct.

> 4. There are a few problems with the date boundaries
> When I use the command
> ArchiveCmdDbgLogs.py --database=cdr --lastdate=#251 copy
>
> the log output specifies the date boundaries as
> First date: 2008-10-19
> Last date: 2008-10-23
> but only the date between the 19th and 22nd are archived but the output
> filename is created as
> History_2008-10-23_2008-10-23_Cmd.xml.bz2.
>
> When I specify the dates manually
> ArchiveCmdDbgLogs.py --database=cdr --firstdate=2008-10-19
> --lastdate=2008-10-22 copy
> the lastdate is exclusive (only the 19th - 21st are archived), I am
> getting an error message saying
> File History_2008-10-19_2008-10-23_Cmd.xml.bz2 already exists
> but then the process creates a new file named
> History_2008-10-19_2008-10-22_Cmd.xml.bz2

I believe that these were due to the off by one error.

>
> I will stop here and not test the 'delete' command in order to allow you to
> recreate these messages exactly.

We've been having a disk space panic on Bach. I've done a lot of
additional testing of this function and I plan to run it on Bach
tonight to reduce the size of the cdrarchive database, possibly
followed by a DBCC SHRINKDATABASE.

I'm currently running a DBCC SHRINKDATABASE(CDR) on Mahler. I
should have run it on CDRARCHIVE but, well, I didn't and now I
better wait for it to complete.

I will email everyone, including our sysadmins, before I do
anything.

Comment entered 2009-07-14 15:47:39 by alan

BZDATETIME::2009-07-14 15:47:39
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15

I ran the compressions last night, extracted data from the
database, deleted the data by hand (see below), and ran DBCC
SHRINKDATABASE to recover space.

I was unable to run the deletes because, on Bach, the cdrarchive
database is not deletable except by the SQLServer admin account.
I don't think we should embed the admin password in a program, so
we'll have to think about whether we want to change the
permissions on the data, or just always delete it by hand. If we
do change the database permissions we'll still need to do
something to cope with long timeouts. See timings below.

For the record, here are the statistics from the above actions:

Starting database size:
Data: ~240 GB
Log: ~9 GB

Deleted 43 days of log files, from May 18, 2008 through June
30, 2008.

Ending database size:
Data: ~220 GB
Log: ~1 GB

Time taken for deletions: ~1 hour.

Time taken for subsequent shrink of database: ~8 hours.

Total disk space recovered: 28 GB

Except for the permissions problem, everything seemed to work
okay except that my output file name conventions were
inconsistent. Two output files labeled:

...2008-05-18_2008-05-23... and
...2008-05-23_2008-05-28...

were produced. I should fix the first one to read ...05-22.

Comment entered 2009-07-16 21:08:32 by alan

BZDATETIME::2009-07-16 21:08:32
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16

I plan to extract out another 2 or 3 months worth of data
this weekend. If I do three months, that will probably free
up another 40 GB of space.

I'll be curious to see if the amount of time it takes to
shrink the database will be less, more, or the same as when
I removed the last batch. My speculation is that it will
be the same or less, on the theory that shrink time varies
with the size of the database more than with the amount of
data removed.

We'll see.

Comment entered 2009-07-19 09:32:05 by alan

BZDATETIME::2009-07-19 09:32:05
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17

I extracted two more months.

I was wrong about the time taken for the database shrink. It was
significantly longer. Here are the results:

Starting database size:
Data: ~220 GB
Log: ~18 GB

Deleted 63 days of log files, from July 1, 2008 through Sept.
1, 2008.

Ending database size:
Data: ~188 GB
Log: ~5 GB

Time taken for extractions: ~6 hours.

Time taken for deletions: ~1:48.

Time taken for subsequent shrink of database: ~11:24.

Total disk space recovered: 45 GB

The total of all compressed log files, stored on Mahler, is now
6.8 GB.

We now have about 76 GB free on Bach.

Comment entered 2009-08-14 00:12:38 by alan

BZDATETIME::2009-08-14 00:12:38
BZCOMMENTOR::Alan Meyer
BZCOMMENT::18

One of the things I think we need to do to make the archiving
functions run in a more automated but still secure fashion is to
change the ownership of the cdrarchive database on Bach.
Currently, the database is owned by "yeqian". In order to shrink
the database we have to be logged on as Qian or as sa. I tried
granting all permissions to cdr in cdrarchive, but that isn't
enough. Only sa or the db_owner can shrink the database.

I think we want to change ownership to "cdr" so that we can
execute the the dbcc shrinkdatabase command from within a Python
program that connects to it in our standard way. Otherwise we'd
either have to execute the shrink from outside of the Python
environment, or else modify cdrdb.py to permit logging on as
"sa", something I'd be very reluctant to do.

There is a builtin procedure for changing the database owner,
namely sp_changedbowner. However it won't work as things stand
because "The proposed new database owner is already a user in the
database." It looks like we have to delete "cdr" as a user in
the database, then execute sp_changeddbowner while using
cdrarchive as "sa".

Since this all needs to be done on Bach I decided it might be
better to consult with others before doing it on my own
initiative.

For now, I have done the following:

Grant all permissions on cdrarchive to "cdr" using the
Enterprise Manager interface. I did this to try to get the
rights to shrink the database, but it wasn't enough. I could
set them back, but haven't for now.

Siphoned off 4 months worth of data from cdrarchive,
compressing it as xml via the ArchiveCmdDbgLogs.py program.
It's currently stored on Mahler, all in:

d:\cdr\log\CmdDbgArchives*

Deleted the corresponding data from cdrarchive.

Started a shrinkdatabase command at midnight tonight
(Thursday night / Friday morning) running as sa.

The shrink will very likely run into tomorrow's work day, but I
don't expect it to be a significant competitor for resources with
cdr users. I wanted it to complete before publishing started,
and I didn't want to worry about the SQL Server connection
problems we've been experiencing on Friday evenings, or the
network maintenance scheduled for Saturday. So it seemed to me
that this is a good time to do it.

Since we're doing more work on this, I'm changing the status of
the issue to "reopened".

Comment entered 2009-08-14 16:30:09 by alan

BZDATETIME::2009-08-14 16:30:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::19

My recent extraction and compression has completed. We now
have 148 GB free on Bach.

The shrinkdatabase command took 12 hours and 23 minutes after
removing 4 months worth of entries. That's about the same as
I saw after removing 2 months in the past. So I'm hopeful
that, as we get the size down, the total time taken will
dramatically decrease.

The order of effort seems to be:

Shrinkdatabase takes the longest.
Bzip compress the files is next. That might go faster
if I did it on Bach.
Removing the copied out data is next, but it's not
fast by any means.

Comment entered 2009-08-20 11:49:20 by alan

BZDATETIME::2009-08-20 11:49:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::20

(In reply to comment #18)
> One of the things I think we need to do to make the archiving
> functions run in a more automated but still secure fashion is
> to change the ownership of the cdrarchive database on Bach.
...

After discussing this with the rest of the team we are now
leaning towards abandoning the cdrarchive database altogether.
We can run the ArchiveCmdDbgLogs.py program once a week, directly
on the cdr database, allowing the command and debug logs to grow
to a maximum of 3 weeks of historical data, then siphon off one
week to bring it back to 2 weeks of data. The archive program
will simply run in place of the existing script that moves
records from the cdr.command_history and cdr.debug_history tables
to the cdrarchive database.

This will result in much less data storage and Bach and much
faster archiving since only a week's worth of data will ever be
extracted, compressed, and then deleted from the database at once
and no shrinkdatabase command will need to be executed.

I will gradually archive off the rest of the cdrarchive database
and work with Volker on replacing the existing script with the
new one.

Comment entered 2009-09-02 00:40:53 by alan

BZDATETIME::2009-09-02 00:40:53
BZCOMMENTOR::Alan Meyer
BZCOMMENT::21

I modified the archiver program to take input from the cdr
database by default instead of the cdrarchive database. I tested
all of the options on Mahler that I think would be used in a
scripted production environment:

--lastdate #nn
--outdir
--quiet
command = "move"

I think this is ready for additional testing by Volker.

I'm going to drain off the rest of the cdrarchive data tonight
and, if Volker's testing looks good on Thursday, we can disable
the old archiving script and add a new one like this:

python \cdr\Utilities\ArchiveCmdDbgLogs.py
--host bach # this is optional if we're running on bach
--outdir {some directory on the SAN}
--lastdate #15
--quiet
move

That will extract all command_log and debug_log entries from bach
that are 15 days old or older, convert them to xml, write them to
bzip2 compressed output files on the SAN, then delete the data
from the database quietly, i.e., without requiring user
confirmation of the delete.

If we did it every week, I'm guessing it would run for somewhere
around a half hour each time, but we won't know for sure until we
try it on Bach.

Comment entered 2009-09-04 00:29:40 by alan

BZDATETIME::2009-09-04 00:29:40
BZCOMMENTOR::Alan Meyer
BZCOMMENT::22

I had thought we might implement this today but the new issue,
OCECDR-2957 has taken up all my time tonight.

Maybe when I get back from vacation.

Comment entered 2009-10-06 23:39:04 by alan

BZDATETIME::2009-10-06 23:39:04
BZCOMMENTOR::Alan Meyer
BZCOMMENT::23

I've looked at this and have restarted the cleanups. Maybe we can get to the point of putting the cleanup program into regular scheduled production next Sunday or the Sunday after.

Comment entered 2009-10-27 22:29:19 by alan

BZDATETIME::2009-10-27 22:29:19
BZCOMMENTOR::Alan Meyer
BZCOMMENT::24

This is really ready to go.

Volker, do you need to do some more testing? We should
probably get that done and put this in production.

Comment entered 2009-10-27 22:34:33 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-10-27 22:34:33
BZCOMMENTOR::Volker Englisch
BZCOMMENT::25

I thought you said last time that you wanted to manually extract the records over the weekend or so and then put this in production?

If you want me to test again I'll do that but not before we had a successful production run.

Comment entered 2009-10-27 22:40:39 by alan

BZDATETIME::2009-10-27 22:40:39
BZCOMMENTOR::Alan Meyer
BZCOMMENT::26

I agree, publishing is important. This isn't.

Let's talk about it on Thursday morning and not do anything
until then.

Comment entered 2009-10-29 18:47:00 by alan

BZDATETIME::2009-10-29 18:47:00
BZCOMMENTOR::Alan Meyer
BZCOMMENT::27

We discussed this today and decided that we will run the script
as a scheduled job on Mahler for a few weeks using the following
parameters:

--lastdate #14 (leave 14 days of data on the server)
--outdir whatever (put the output somewhere, maybe the SAN)
--quiet (run as a batch job, no user confirmation)

command = "move" (copy out data to compressed file, then delete)

I'm re-assigning to Volker for this.

Comment entered 2009-11-02 17:57:15 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-11-02 17:57:15
BZCOMMENTOR::Volker Englisch
BZCOMMENT::28

I've created a scheduled job named
Archive Cmd Debug Logs
scheduled in SQL server agent. This job will submit the command script
ArchiveCmdDbgLogs.cmd
which will start
ArchiveCmdDbgLogs.py
to do the work.

It is currently schedule to run tonight on MAHLER to check if everything is working. If there aren't any problems tonight I'll schedule it for Sunday's on MAHLER for a few weeks.

The output has been directed to be stored on the SAN at
r:\DB_Backup\Mahler\cdr\CommandDebugLogs

Comment entered 2009-11-06 11:15:47 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-11-06 11:15:47
BZCOMMENTOR::Volker Englisch
BZCOMMENT::29

(In reply to comment #28)
> The output has been directed to be stored on the SAN at
> r:\DB_Backup\Mahler\cdr\CommandDebugLogs

During the last couple of days I've tried to run this job but it failed creating the output file on the SAN due to access/permission problems.
I was unable to access the output directory
a) when accessing the drive by using the drive letter 'R' or
b) when accessing the drive by using the UNC name

Last night I've been able to create the output files by creating the output files in a local directory
D:\Backup\Cdr\CommandDebugLogs
The job finished in about 50 minutes.

I will now run this job once a week.

Alan,
there were two files created for the Cmd log but only one for the Dbg log. Is this expected behavior?

Comment entered 2009-11-10 18:05:45 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-11-10 18:05:45
BZCOMMENTOR::Volker Englisch
BZCOMMENT::30

I've added Alan as a CC. He probably never saw my comment #29.

Comment entered 2009-12-31 16:52:17 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2009-12-31 16:52:17
BZCOMMENTOR::Volker Englisch
BZCOMMENT::31

I've tested this again yesterday to confirm the permissions problem when trying to write to the SAN.
Waiting for John to get back to me.

Comment entered 2010-01-04 16:40:52 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-01-04 16:40:52
BZCOMMENTOR::Volker Englisch
BZCOMMENT::32

I've worked with Carbie and Qian to resolve the issue but we ran into problems on MAHLER. It's scheduled to reboot MAHLER tomorrow morning and continue working/testing on this issue after the reboot.

Comment entered 2010-01-07 11:15:18 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-01-07 11:15:18
BZCOMMENTOR::Volker Englisch
BZCOMMENT::33

We've identified and fixed one permission problem which was caused by using an incorrect mapping to the SAN.

Note: Never use R or Q to map a drive!

A second permission problem is still being worked on.

Comment entered 2010-01-08 13:00:56 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-01-08 13:00:56
BZCOMMENTOR::Volker Englisch
BZCOMMENT::34

(In reply to comment #33)
> A second permission problem is still being worked on.

The second permission problem is only a problem when running the scripts as a test on FRANCK or MAHLER. SQL server doesn't have permissions to switch the cluster group from one server to another. However, on BACH, where this script is supposed to run, the SAN doesn't need to get moved since it's pointing by default to BACH. This means, we can run the script on BACH without problems but need to move the cluster manually if we want to test the script on one of the other servers.

I've copied the following programs to FRANCK and BACH:
ArchiveCmdDbgLogs.py - R9285
ArchiveCmdDbgLogs.cmd - R9465

I've also created the SQL agent job named
Archive Cmd Debug Logs
This scheduled job is going to run on Sundays at 1:30am.
Alan had decided to run the job by leaving 14 days worth of data in the logs and run the job weekly.
I'm going to run the job manually first in order to extract the existing data until we are down to about 2 weeks at which point I'm going to enable the scheduled job.

Comment entered 2010-01-08 13:15:46 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-01-08 13:15:46
BZCOMMENTOR::Volker Englisch
BZCOMMENT::35

(In reply to comment #34)
> I'm going to run the job manually first in order to extract the existing data
> until we are down to about 2 weeks at which point I'm going to enable the
> scheduled job.

Bob, it appears that there is nothing to archive at the moment. Is there another process that already archived the data? I expected lots of data waiting to be copied.

Comment entered 2010-01-08 15:19:33 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-01-08 15:19:33
BZCOMMENTOR::Bob Kline
BZCOMMENT::36

(In reply to comment #35)
> (In reply to comment #34)
> > I'm going to run the job manually first in order to extract the existing data
> > until we are down to about 2 weeks at which point I'm going to enable the
> > scheduled job.
>
> Bob, it appears that there is nothing to archive at the moment. Is there
> another process that already archived the data? I expected lots of data
> waiting to be copied.

It doesn't look as if Alan finished doing everything that was going to happen for this task. If you look at comment #20 you'll see that he planned to siphon off the contents of the debug_log_history and command_log_history tables and then replace the existing scripts (which move rows to those tables from the command_log and debug_log tables in the cdr database) with the new script to archive directly from the debug_log and command_log tables. However, it looks like the old archive tables have several months of logs in them, and the old script is still running to populate those tables by moving rows from the corresponding tables in the main database:

select count⭐ from cdr.dbo.debug_log
--------—
1224364

select count⭐ from cdr.dbo.command_log
--------—
304984

select count⭐ from cdrArchive.dbo.debug_log_history
--------—
31590753

select count⭐ from cdrArchive.dbo.command_log_history
--------—
7838646

select max(received), min(received) from command_log
---------------------------------------------------—
2010-01-08 15:09:11.857 2010-01-03 01:33:02.847

select max(recorded), min(recorded) from debug_log
---------------------------------------------------—
2010-01-08 15:09:12.827 2010-01-03 01:33:02.847

select max(recorded), min(recorded) from cdrArchive.dbo.debug_log_history
---------------------------------------------------—
2010-01-03 01:28:03.747 2009-10-01 00:00:38.650

select max(received), min(received) from cdrArchive.dbo.command_log_history
---------------------------------------------------—
2010-01-03 01:28:03.200 2009-10-01 00:00:38.650

Comment entered 2010-01-08 17:28:09 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-01-08 17:28:09
BZCOMMENTOR::Volker Englisch
BZCOMMENT::37

(In reply to comment #36)
> It doesn't look as if Alan finished doing everything that was going to happen
> for this task.

Maybe I better wait for Mr. Meyer to come back before I turn on the automated archive process. We've waited long enough, an extra week won't hurt.

Comment entered 2010-01-19 10:45:42 by alan

BZDATETIME::2010-01-19 10:45:42
BZCOMMENTOR::Alan Meyer
BZCOMMENT::38

(In reply to comment #37)
> (In reply to comment #36)
> > It doesn't look as if Alan finished doing everything that was going to happen
> > for this task.
>
> Maybe I better wait for Mr. Meyer to come back before I turn on the automated
> archive process. We've waited long enough, an extra week won't hurt.

I had drained off all of the cdrarchive data and turned it
over to Volker for testing, but then he ran into the permissions
problems and turned to higher priority work. The cdrarchive
db started loading up again.

I'll start a job tonight to drain off from the archive again. Let's
aim for getting everything done for a switchover by this weekend.

Comment entered 2010-01-20 00:16:01 by alan

BZDATETIME::2010-01-20 00:16:01
BZCOMMENTOR::Alan Meyer
BZCOMMENT::39

(In reply to comment #38)

> ...
> I'll start a job tonight to drain off from the archive again.
> Let's aim for getting everything done for a switchover by this
> weekend.

I ran the job to archive all log entries for October. I ran it
on Mahler, extracting the data from Bach and archiving it on
Mahler, in order to avoid space problems on Bach.

The extractions for October are complete, but the deletions are
still running. When they finish I then need to run a database
compression. That won't happen tonight, but I'll try to do it
from home tomorrow night. That will give us a bit of breathing
space. Then I'll do succeeding months.

It can take a while but we'll eventually get there.

Comment entered 2010-01-26 23:06:56 by alan

BZDATETIME::2010-01-26 23:06:56
BZCOMMENTOR::Alan Meyer
BZCOMMENT::40

I've archived and compressed all of 2009, and shrunk the database
to recover space. I'm now running the Archiver to do all the
rest of the entries in cdrarchive, Jan 1-23, 2010.

I propose that we change the script on Thursday to the new one.

I also found and fixed a bug in the archiver and promoted the fix
to all 3 servers. I had never run the archiver with no ending
date before and had an uninitialized variable.

Comment entered 2010-01-28 15:20:11 by alan

BZDATETIME::2010-01-28 15:20:11
BZCOMMENTOR::Alan Meyer
BZCOMMENT::41

The cdrarchive database is now empty and all space has
been recovered on Bach. We can take the old script out
of production for moving data from cdr to cdrarchive and
put in the new one.

Volker,

If you're too busy to do both of these, or don't feel that
the archiver is tested enough, let's just disable the old
script and let stuff accumulate in the cdr for now. Then
install the new script in the next week or two.

Comment entered 2010-02-02 19:18:07 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-02-02 19:18:07
BZCOMMENTOR::Volker Englisch
BZCOMMENT::42

The old archiver has been disabled and the new one which is copying data to the SAN has been started.
It will take two weeks before the new archiver will actually copy data because we're moving data older than 14 days but there is currently 7 days worth of data in the database.

Comment entered 2010-02-16 11:42:03 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-02-16 11:42:03
BZCOMMENTOR::Volker Englisch
BZCOMMENT::43

(In reply to comment #42)
> It will take two weeks before the new archiver will actually copy data because
> we're moving data older than 14 days but there is currently 7 days worth of
> data in the database.

Checking the log file I noticed that the job actually didn't run yet since it had not been enabled. The scheduler was enabled but that didn't enable the job itself. I've enabled the job and this weekend we should have some data to be archived.

Comment entered 2010-02-22 10:16:38 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-02-22 10:16:38
BZCOMMENTOR::Volker Englisch
BZCOMMENT::44

The process ran successfully over the weekend (Sunday morning) and created 4 debug/command archive files on the SAN. The job ran for a little more than 50 minutes.

I think we're ready to close this issue if Alan agrees.

Comment entered 2010-02-24 15:38:06 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-02-24 15:38:06
BZCOMMENTOR::Volker Englisch
BZCOMMENT::45

This is in production and is running successfully.
Closing issue.

Comment entered 2010-02-25 17:59:37 by alan

BZDATETIME::2010-02-25 17:59:37
BZCOMMENTOR::Alan Meyer
BZCOMMENT::46

There are a few more small tasks that we need to do to completely
finish this overall task.

1. There is a small residue of records on Bach (7 days worth)
that got copied into the cdrarchive database for data from
Jan. 24-31.

I'll siphon those off tonight.

2. The T-SQL procedure that copies records from cdr to
cdrarchive won't be needed any more.

Volker will remove the procedure from the task schedule on
any servers where it exists, and save a copy of the procedure
in version control. We don't really need it, but should have
copies of any code that was ever part of the system.

3. The cdrarchive database and its two tables won't be needed
any more after the data on Bach is taken out.

We are not running the new log compression utility on Mahler
or Franck, so the command and debug logs will grow without
constraint on those two servers. However, their total size
is limited by the limits on total commands run on those
servers, and will be completely reset each time the
production database is copied to either of them.

I'll drop the entire cdrarchive database on all three
servers.

Comment entered 2010-02-26 13:24:29 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-02-26 13:24:29
BZCOMMENTOR::Volker Englisch
BZCOMMENT::47

(In reply to comment #46)
> 2. The T-SQL procedure that copies records from cdr to
> cdrarchive won't be needed any more.

The procedure has been copied to SVN
TransferDebugLogCommandLog.sql - R9504

Elapsed: 0:00:00.001061

CDR Tickets