Issue Number | 2949 |
---|---|
Summary | Scripts for incremental refresh of CDR version archive db |
Created | 2009-08-21 00:43:00 |
Issue Type | Improvement |
Submitted By | alan |
Assigned To | alan |
Status | Closed |
Resolved | 2011-09-13 09:32:18 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.107277 |
BZISSUE::4624
BZDATETIME::2009-08-21 00:43:00
BZCREATOR::Alan Meyer
BZASSIGNEE::Alan Meyer
BZQACONTACT::Bob Kline
As a follow-on to Issue #3069, we need to periodically migrate
versions of documents from the cdr database doc_version table to
the cdr_archived_versions database.
To do that, we need to create a script that can be run to do
the
job smoothly and safely.
BZDATETIME::2009-08-26 01:08:08
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1
Here are a few questions about how to do this:
1. Do we need to keep the latest version and latest
publishable
version xml in the cdr database?
I remember that we discussed this but I don't recall the
decision we made in the first database split. It looks like
we decided to just move all of the xml versions, and can
repeat that decision in the "top up" script.
Is that right?
2. Do we need to be in single user mode or use locking while
we
run the top up?
It seems to me that, if we separate the copying from the
nullling out of the xml, we don't need to be in single user
mode or to lock any rows while copying data from cdr to
cdr_archived_versions. Any process attempting to read xml
will get it from the cdr database even if it also happens to
be in the cdr_archived_versions database.
The only issue is, what happens while we null out the xml
column in the cdr database? That may be a relatively fast
process since no indexes are updated and no rows are inserted
or deleted.
3. Is there any need to run the copy and the xml nulling near
each other in time?
It seems to me that there is not.
If we do want to lock the table while the nulling of the XML
is in progress, we might do something like this:
a. Run a copy script that inserts rows into the
cdr_archived_versions database.
That script might take a long time to run, but can
run while users are on the system.
Even if it takes two days and big batch jobs occur
while it is running, that should not be a problem.
Backup should also not be a problem since we can turn
off backup of the cdr_archived_versions database
during this process - though if backup of the cdr
puts that database into single user mode, we need to
think about the impact on any long running script (or
vice versa - the impact of a long running script on
backup.)
b. Perform a backup of cdr_archived_versions database. This
should be done before we null out any xml data in the cdr
database.
c. Run a script to null out the xml columns of all rows in
the all_doc_versions table where the xml is available in
the cdr_archived_versions database.
This might need to run in single user mode or with
locking. We need to think about this in the light of
the isolation level used in our configuration.
Hopefully, this will run reasonably quickly since
there are no index updates, and no row insertions or
deletions.
Note: An alternative approach to comparing the xml
columns in the two databases to find columns that are
non-null in the cdr_archived_version database and can
therefore be nulled in cdr is to coordinate
processing by datetime stamps.
If we note the specific datetime at which the
copy processing begins, when the copy processing
is finished we can simply delete all xml from the
cdr.all_doc_versions table where the datetime is
less than the copy-start datetime.
If some versions have been created in the cdr
database while the copy is going on, and if they
are copied to the archive, it makes zero
difference if the xml is not nulled out for those
doc versions.
I would think that would be significantly faster
and, theoretically, should be 100% safe. Are
theory and practice the same in this case?
BZDATETIME::2009-08-26 01:21:54
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2
In the scenario described in the previous comment, we can
run a compare / verify step after completing the copy to be
sure that all XML created in the cdr.all_doc_versions before
the copy-start datetime matches the xml in the
cdr_archived_versions database. That can be run while users
are on the system
If everything matches, we should have a very high confidence
level in a deletion that works solely on datetime.
Running such a verification is probably a smart thing to do
even if we don't use a datetime driven deletion method.
BZDATETIME::2009-08-26 09:30:10
BZCOMMENTOR::Bob Kline
BZCOMMENT::3
(In reply to comment #1)
> Here are a few questions about how to do this:
>
> 1. Do we need to keep the latest version and latest
publishable
> version xml in the cdr database?
>
> I remember that we discussed this but I don't recall the
> decision we made in the first database split. It looks like
> we decided to just move all of the xml versions, and can
> repeat that decision in the "top up" script.
>
> Is that right?
Right.
BZDATETIME::2009-08-27 12:18:18
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4
(In reply to comment #2)
...
> If everything matches, we should have a very high confidence
> level in a deletion that works solely on datetime.
...
On second thought, I'm inclined against a datetime based
selection of data to delete. It's probably faster but has
more opportunity for error - particularly so if the deletions
are done separately from the copies.
BZDATETIME::2009-12-29 23:24:04
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5
(In reply to comment #4)
> ...
> On second thought, I'm inclined against a datetime based
> selection of data to delete. It's probably faster but has
> more opportunity for error - particularly so if the deletions
> are done separately from the copies.
I'm now inclined towards a method that uses both datetime and
cross checking xml values between database tables. I like having
a datetime limit that frees us from at least some considerations
of updates that occur while we're processing.
Here are the queries I'm now leaning towards, and the order of
processing for executing them:
1. Backup the cdr database.
Only the active cdr database needs to be backed up. If the
existing cdr_archived_versions database backup has been
verified, it should be current.
2. Update the archived versions, as follows:
INSERT INTO cdr_archived_versions..doc_version_xml
(id, num, xml)
SELECT id, num, xml
FROM all_doc_versions
WHERE xml IS NOT NULL
AND dt < ?
The dt parameter should be a datetime <= to GETDATE() at
the
start of the backup.
There should be no locking implications whatever of this
statement. The all_doc_versions table will never be updated
with rows with dt < the starting datetime, and the archived
versions database will not be accessed by CDR applications
for rows beyond the datetime until the step 4 is run.
3. Verify the insertions:
This verification should not be necessary, but on the belt
and suspenders principle ...
a. Verify that all rows that should have been copied, were.
SELECT av.id, av.num, cav.id, cav.num
FROM all_doc_versions av
LEFT OUTER JOIN cdr_archived_versions..doc_version_xml cav
ON av.id = cav.id
AND av.num = cav.num
WHERE av.xml IS NOT NULL
AND cav.xml IS NULL
AND av.dt < ?
Using the same datetime as the first query, we exclude any
new rows that were added after the update began.
The left outer join selects any rows that
weren't copied to the archive database but should have been.
The check for "cav.xml IS NULL" tests for any rows in the
archive database that don't have xml.
The query should return zero rows. If it doesn't, we've got
a problem and need to halt the process and investigate.
b. Verify that no spurious rows were created.
SELECT id, num
FROM cdr_archived_versions..doc_version_xml
WHERE NOT EXISTS (
SELECT *
FROM all_doc_versions
WHERE id = cdr_archived_versions..doc_version_xml.id
AND num = cdr_archived_versions..doc_version_xml.num
)
My intent here is to find any spurious rows in the archive
database that don't correspond to rows in the
all_doc_versions table. It should produce 0 rows.
This worked when I tried it on a pair of test tables I
created, finding all and only rows in one table that
weren't in the other. But when I ran it on Mahler with
the real tables it returned the right answer (0 rows
affected) in only two seconds - which is a lot faster than
I expected. So I'm wondering if it really worked.
4. Delete xml from the cdr database:
UPDATE all_doc_versions
SET xml = NULL
FROM all_doc_versions av
JOIN cdr_archived_versions..doc_version_xml cav
ON av.id = cav.id
AND av.num = cav.num
AND av.xml IS NOT NULL
AND cav.xml IS NOT NULL
AND av.dt < ?
This is the final step. The final line with the dt parameter
set to the same value as the original value in step 1, would
appear to be unnecessary since the join should achieve the
same thing. We should discuss whether it is a reasonable
double check or a useless bag on the side.
BZDATETIME::2009-12-30 12:26:53
BZCOMMENTOR::Bob Kline
BZCOMMENT::6
For step 3, I like to use an alias, which makes the SQL tighter, and also makes it more obvious what's what in the inner select):
SELECT a.id, a.num
FROM cdr_archived_versions..doc_version_xml a
WHERE NOT EXISTS (
SELECT *
FROM all_doc_versions
WHERE id = a.id
AND num = a.num
)
For step 4, you could actually simplify by just using the dt test. I don't think it's necessary to check a third time to make sure the xml values are really in the archive table:
UPDATE all_doc_versions
SET xml = NULL
WHERE dt < ?
AND xml IS NOT NULL
Your version works, too; just more complicated than it needs to be.
BZDATETIME::2010-03-11 13:34:06
BZCOMMENTOR::Bob Kline
BZCOMMENT::7
Don't want this to languish too much longer.
BZDATETIME::2010-03-19 00:31:24
BZCOMMENTOR::Alan Meyer
BZCOMMENT::8
I started work on this today but suspended it when I saw the
problem with the global change.
I have started working on a new program named
ArchiveDocVersionXML.py that is packaged to work as follows:
usage: ArchiveDocVersionXML.py user_id password command
command is one of:
report = Report how many documents will move.
copy = Copy XML to cdr_archived_versions.
verify = Verify copy is okay.
delete = Delete copied XML from cdr database.
all = copy/verify/delete, halt if verify fails.
Errors are reported to stderr and %s.
BZDATETIME::2010-04-08 23:27:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9
I've thought of a different way to do this that has possible
advantages and thought I better write it up before continuing on
the path I was on.
I could write a trivial program that does the following:
SELECT id, num, xml
FROM all_doc_versions
WHERE xml IS NOT NULL
ORDER BY id ASC, num ASC
Get all the results in an in-memory list. As of this
writing, there would be 353,081 pairs of integers.
For each one:
Insert the id, num, xml into the cdr_archived_versions
database.
Set xml = NULL in the cdr database.
The advantages and disadvantages are:
+ It's profoundly simple.
+ There are no issues involved in breaking off the program if
it runs into a time when the machine is needed for higher
priority work. We just start it again when it's
convenient. Some new versions may be picked up, but that's
okay.
+ We can break the job off at any time and do an interim
backup if desired. We might be able to run the backup even
while it's running.
+ We can wrap each pair of operations in a transaction to
eliminate any possible conflicts.
+ With only a single row being processed at a time, the
process won't lock large blocks of data and interfere with
production. I think we can set a "nice" value for the
process without worrying about going to sleep with a big
lock in effect.
Total computer time would be greater than methods that move
blocks of data. If each pair of operations is wrapped in a
transaction, I would expect the time to be still greater.
I don't have estimates but the time might be very
substantially greater.
I speculate that it will finish within a few days, but I'm
not sure.
We may have discussed this and rejected it, but I can't
remember
that we did.
If this looks attractive to the three of us, maybe we can walk
through it with Min and get her expert opinions. I probably
should have consulted with her earlier.
BZDATETIME::2010-04-09 12:05:16
BZCOMMENTOR::Bob Kline
BZCOMMENT::10
(In reply to comment #9)
> I could write a trivial program that does the following:
>
> SELECT id, num, xml
> FROM all_doc_versions
> WHERE xml IS NOT NULL
> ORDER BY id ASC, num ASC
>
> Get all the results in an in-memory list. As of this
> writing, there would be 353,081 pairs of integers.
That would be a huge amount of memory, because you'd have more than
the
integers, you'd also have the XML for all of the rows in the
results
set. What's wrong with INSERT ... SELECT TOP ...?
BZDATETIME::2010-04-09 13:18:27
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11
(In reply to comment #10)
> (In reply to comment #9)
>
> > I could write a trivial program that does the following:
> >
> > SELECT id, num, xml
> > FROM all_doc_versions
> > WHERE xml IS NOT NULL
> > ORDER BY id ASC, num ASC
> >
> > Get all the results in an in-memory list. As of this
> > writing, there would be 353,081 pairs of integers.
>
> That would be a huge amount of memory, because you'd have more than
the
> integers, you'd also have the XML for all of the rows in the
results
> set.
Yes, of course. Fingers flew ahead of the brain.
> What's wrong with INSERT ... SELECT TOP ...?
If you mean what I think you mean, the idea was to do each
row individually, each wrapped as a tiny transaction. Wouldn't
a bulk insert defeat that?
Or are you saying the concept of tiny transactions is of no
special value.
Or perhaps I'm overthinking this whole problem.
BZDATETIME::2010-04-09 13:22:20
BZCOMMENTOR::Bob Kline
BZCOMMENT::12
Well, the TOP [n] part would mean that the bulk insert wouldn't be too bulky, and the INSERT ... SELECT removes any gap between when you collect information about what you want to do and when you do it.
BZDATETIME::2010-04-09 13:30:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13
(In reply to comment #12)
> Well, the TOP [n] part would mean that the bulk insert wouldn't be
too bulky,
> and the INSERT ... SELECT removes any gap between when you collect
information
> about what you want to do and when you do it.
I suspect that, unless the [n] was very large, the cost of performing the selections multiple times would overtake the cost of wrapping single rows in transactions. We'd lose the benefit of granularity without getting back a benefit in time.
As for the gap, I'm not concerned about that. Everything selected by an initial selection will be valid since no rows ever change in the all_doc_version table, we only add to them. Unless I'm missing something, the selection will remain valid for the duration of the run of the program.
BZDATETIME::2010-04-09 14:21:44
BZCOMMENTOR::Bob Kline
BZCOMMENT::14
(In reply to comment #13)
> I suspect that, unless the [n] was very large, the cost of
performing the
> selections multiple times would overtake the cost of wrapping
single rows in
> transactions. We'd lose the benefit of granularity without getting
back a
> benefit in time.
Well, with your method, you'll have to do a separate SELECT for each row to go get the XML you want to move, so I guess I don't understand your analysis. What am I missing?
> As for the gap, I'm not concerned about that. Everything
selected by an
> initial selection will be valid since no rows ever change in
the
> all_doc_version table, we only add to them. Unless I'm missing
something, the
> selection will remain valid for the duration of the run of the
program.
True. I guess I was thinking more of the distance created by having the XML move from the DBMS to a Python variable and then back to the DBMS, and I made a muddle of saying what I was talking about.
BZDATETIME::2010-04-09 14:48:27
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15
(In reply to comment #14)
> (In reply to comment #13)
>
> > I suspect that, unless the [n] was very large, the cost of
performing the
> > selections multiple times would overtake the cost of wrapping
single rows in
> > transactions. We'd lose the benefit of granularity without
getting back a
> > benefit in time.
>
> Well, with your method, you'll have to do a separate SELECT for
each row to go
> get the XML you want to move, so I guess I don't understand your
analysis.
> What am I missing?
Good point.
It's hard to say what the times will be without performing experiments. The WHERE xml IS NOT NULL clause will cause the selects to take a lot more time than the pin point, SELECT xml FROM all_doc_versions WHERE id=? AND num=?. But we do a lot more of them.
In any case, you're right that we should do an insert directly from the all_doc_versions table to the cdr_archived_versions..doc_version_xml table without doing a round trip to Python.
BZDATETIME::2010-04-09 15:30:13
BZCOMMENTOR::Bob Kline
BZCOMMENT::16
(In reply to comment #15)
> It's hard to say what the times will be without performing
experiments. The
> WHERE xml IS NOT NULL clause will cause the selects to take a lot
more time
> than the pin point, SELECT xml FROM all_doc_versions WHERE id=? AND
num=?.
I wouldn't worry about that too much. The query
SELECT TOP 100 id, num
FROM all_doc_versions
WHERE xml IS NOT NULL
completed in zero seconds on Bach.
BZDATETIME::2010-04-09 16:17:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17
(In reply to comment #16)
> ... The query
>
> SELECT TOP 100 id, num
> FROM all_doc_versions
> WHERE xml IS NOT NULL
>
> completed in zero seconds on Bach.
It's all in when you do it.
I did the same query for 1,000 before posting my comment #15 and it took 6 seconds. When I did it again just now it took 0 seconds. Then I did it for 2000 and it took 4 seconds. Then again for 2000 and it took 0 seconds.
I think you saw an artificially low number as a result of caching from my earlier query. If we assume the actual time is 4 seconds per thousand to get each next batch of a thousand records, we're looking at 353 * 4 seconds = around 23.5 minutes, which is more than zero, but still probably not significant.
Will doing them one by one will be significantly slower, where "significant" here probably means more than an hour or two slower?
Assuming the speed is okay, then the question in my mind is whether the benefits of single record transactions are worth anything. If so, I think we should do it that way, if not, not.
Let's discuss it on Tuesday.
BZDATETIME::2010-04-14 00:28:21
BZCOMMENTOR::Alan Meyer
BZCOMMENT::18
We discussed what to do and concluded the following:
1. A verification step is not required.
If SQL Server can't copy from one table to another without
dropping data on the floor, we would have been in deep
trouble from the very start.
2. The number of documents to transfer in a single SQL
statement
should be parameterized.
We can set it high for fast operation or low for locking
granularity. The locking issue is probably the most
important thing to maximize in the trade-off since it's hard
to guarantee that we wouldn't run over into time when other
jobs or users want access to the system.
I need to do some research about SQL Server locking
behaviors.
3. Deletion of XML (setting all_doc_versions.xml = NULL) can
be
a separate step.
This is all very straightforward, but I have run into a problem
that I'm trying to think out and haven't resolved yet.
Assume we have a statement that does something like the following:
INSERT INTO cdr_archive_versions..doc_version_xml a
(a.id, a.num, a.xml)
SELECT TOP 100 v.id, v.num, v.xml
FROM all_doc_versions v
WHERE v.xml IS NOT NULL
I believe that the entire insertion of all 100 rows will fail
on
a primary key duplication error if a single one of the rows
already exists in the target database. And I don't believe that
SQL Server will tell us which actual row was the culprit or if
there was only one. If it did, it would be via some non-portable
method.
There are different ways to handle this. One would be to skip
the parameterized count and do these one at a time. That's the
simplest but least efficient.
Another would be to have a fallback routine that inserts one
row
at a time for just those hundred rows, failing on the ones that
fail but succeeding on all the good one. That's not bad. It
would probably never be called though I could be wrong about
that. But it would be there if it were needed.
Another would be to use the SQL Server proprietary
IF @@ERROR GOTO ...
construction in a stored procedure. According to the SQL Server
documentation that construction will goto the error routine for
the one that failed but continue to complete all the rest of the
insertions. The error routine could be essentially empty, or
could log the error information. However I haven't tested that
yet to find out if my understanding of the docs is correct.
I'm inclined against writing significant stored procedures in
TRANSACT SQL just because it's not portable and it's something we
haven't generally done. I don't foresee us ever leaving SQL
Server, but on the other hand, I don't like using a non-portable
technique if there's no really significant advantage to it.
If anyone has a preference or a fourth way to do it, I'm open
to
suggestions.
BZDATETIME::2010-04-14 08:53:01
BZCOMMENTOR::Bob Kline
BZCOMMENT::19
(In reply to comment #18)
> If anyone has a preference or a fourth way to do it, I'm open
to
> suggestions.
I think what you want here is a "WHERE NOT EXISTS (...)" clause.
BZDATETIME::2010-04-14 10:57:08
BZCOMMENTOR::Alan Meyer
BZCOMMENT::20
(In reply to comment #19)
> I think what you want here is a "WHERE NOT EXISTS (...)" clause.
DUH.
Of course. Sometimes the obvious quite eludes me.
Thanks.
BZDATETIME::2010-04-23 00:03:13
BZCOMMENTOR::Alan Meyer
BZCOMMENT::21
I should be ready for a walkthrough on this sometime on
Tuesday.
BZDATETIME::2010-04-30 00:38:14
BZCOMMENTOR::Alan Meyer
BZCOMMENT::22
We did a walk through today and decided that I should offer to
put both databases in single user mode if they aren't already
so.
The code for that turns out to be as large as the rest of the
program since I have to check database single user status,
stop dependent services, stop SQL Server, start it, put it
in the right modes, and check that each thing I did worked
and, if not try to recover by restoring any service that
was taken down, but only those taken down, in the right
order, with logging everywhere, etc.
I've done most of it, but there's a bit more to do. I
hope to be ready for another walk through next Tuesday.
We don't need to walk through the parts we did today
since there is no change to them - it's just the new code.
BZDATETIME::2010-05-10 21:07:28
BZCOMMENTOR::Alan Meyer
BZCOMMENT::23
I've tested everything with small numbers of records on Franck,
in multi-user mode. I think everything works fine. I've also
installed a lot of logging.
I plan to test going into single user mode on Franck for a
small
number of records, then do a big test of the whole database on
Franck.
BZDATETIME::2010-05-13 11:24:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::24
I ran a test on Franck on Tuesday night, requesting that XML be
copied to the archive database for 50 batches of 1,000 documents
each. I ran with SQL Server in single user mode. Statistics for
the 50,000 documents are as follows:
Run time: 57 minutes, 52 seconds
The run time includes only about 10 seconds for turning
services off and on and switching database modes from
multi- to single- and back to multi-user. It should be a
little faster if we use larger batches, but not too much
I think.
I don't know how the average size of the first 50,000
docs compares to the average size of the rest of the docs
(another ~310,000), but if it's comparable, we'll need
between 7 and 8 hours to do the copy - which is very
manageable.
Starting SQL Server transaction log size: 22,675,456
Ending size: 853,016,576
The transaction log size increased in 5 steps of unequal
size. Since the total increase was less than a gigabyte,
I presume that we'll have no trouble with transaction
log size issues.
I ran the "reportnull" function afterwards and it duly
reported that 50,000 documents can be nulled out. I'm going
to do that next.
BZDATETIME::2010-05-13 15:53:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::25
The nulling out took 58 minutes, 2 seconds. The transaction
logs
did not change size. I think a significant amount of that time
was devoted to finding the documents. When I ran the null out
process previously it took 29 minutes and 42 seconds to null out
just 20 documents. So the nulling process is probably just a
half hour to do each 50,000 docs, once they have been selected.
I now propose to do the following:
Perform a complete run on Franck, in single user mode,
copying the remaining 274,620 docs. I'll do a copy followed
by a nulling out process.
When everything is done, do a database shrink on both the cdr
and cdr_archived_versions databases.
Write up the results for Bugzilla.
I may or may not be able to complete the work before leaving
Saturday morning. The shrink commands can run for many hours.
If all goes well, I think we'll be ready to run for real on
Bach
when I get back.
BZDATETIME::2010-05-14 13:38:08
BZCOMMENTOR::Alan Meyer
BZCOMMENT::26
Testing on Franck is complete.
Stats from before and after the shrink database command
are attached.
I think we're ready to run on Bach. I propose to coordinate
with John R to do it when I get back from vacation.
Attachment ArchiveBeforeAfter.txt has been added with description: Results of the test archiving of XML versions on Franck.
BZDATETIME::2010-05-27 22:52:50
BZCOMMENTOR::Alan Meyer
BZCOMMENT::27
I sent the following email tonight to John Rehmert, Bob,
Volker,
Lakshmi, and William. I'm pasting in a copy here to put it in
the Bugzilla record.
---------------------------------------------------------------
I need about six hours to modify the CDR database on Bach when
no
people and no batch jobs are using it.
Background
----------
As you all may recall, CDR data is stored in two separate
databases:
"CDR"
Contains active data that is modified daily, backed up
nightly, and periodically copied to Franck or Mahler so
that our test and development servers have recent data on
them.
"CDR_ARCHIVED_VERSIONS"
Contains static data comprising old versions of documents
that are saved for historical archiving purposes.
Because this data does not change, it is not backed up
frequently, and not copied to Franck or Mahler. If it is
backed up and copied once, that backup and those copies
remain good no matter what changes occur in the CDR
database.
As new versions of documents are created, existing versions are
archived, but we don't immediately move them to the
CDR_ARCHIVED_VERSIONS database. If we did, we'd have to backup
that big database nightly, which is expensive to do. Instead the
existing versions are archived within the CDR database.
Software that needs to find an archived version uses a database
view that looks in both the main and the archive database,
finding the version that it needs wherever it happens to be.
Then, at some point, when the CDR database has a large number of
static versioned documents in it, we need to move that static
data from the active to the archive database, making the main
database smaller and easier to backup and transfer.
I have a script to do that. If I run it with the database in
single user mode, which is the safest and fastest way to do it,
it will probably require 3-4 hours to copy all of the static
data, then I would do some checks, then another 2-3 hours to
delete the data from the main, active database. After that I'll
return to multiuser mode and run a SQL Server shrinkdatabase to
recover the freed space. Users can work, if needed, during the
space recovery process.
Requirements for a Run
----------------------
I think I should do this in the daytime, when no batch jobs are
running, and on Saturday or Sunday when users can be off the
system. Once I start, no users or jobs will be able to logon.
Before starting, there should be a current backup of the system.
Is this coming Saturday or Sunday a good time? If not, I can
delay it to another weekend.
Thanks.
–
Alan Meyer
AM Systems, Inc.
Randallstown, MD, USA
vrmeyer@comcast.net
BZDATETIME::2010-06-06 22:05:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::28
The database migration on Bach is complete. The CDR and other
services are running again and Big Brother is re-enabled.
I have not yet run the shrinkdatabase operation to recover free
space in the file system. I don't know whether it's safe to do
right now or not.
Bach currently has 86.8 GB free. If the shrinkdatabase is done
by copying the data to a new file, it will probably work on the
cdr database but will definitely not work on the
cdr_archived_versions database. So I'd like to confirm with
someone who knows (perhaps one of you does) whether the shrink is
done by copying the active data to a new file, or by moving and
recovering space in place, like a disk defragmentation. If a
copy is done, we'll need to free up some space first.
It might also be a good idea to do a database backup of both
databases before the shrink.
I'll see where we stand on Tuesday and probably run the shrink
Tuesday night.
BZDATETIME::2010-06-09 00:54:03
BZCOMMENTOR::Alan Meyer
BZCOMMENT::29
I've started running the database compaction routine on the CDR database on Bach at 8:15 pm tonight (Tuesday). It's still running and may continue to run until morning. It may still be running when users begin using the system tomorrow (Wednesday). I don't know when the compaction will finish, but I'm guessing that it will be done by around 10 am or before. Users can use the system while it's running.
The process appears to be using up to about 9 GB of previously free disk space on Bach, however the increase in space usage seems to have slowed down. Perhaps the process uses disk for scratch memory but only needs so much. At this rate, there doesn't seem to be any threat of running out of space as we still have almost 88 GB free and there's been no change for a long time.
William,
Users may experience a little more slowness than usual tomorrow morning if the process is still running. However I don't think it should be serious. I've run a number of things during the process and didn't notice any problems.
BZDATETIME::2010-06-09 10:36:54
BZCOMMENTOR::Alan Meyer
BZCOMMENT::30
(In reply to comment #29)
> ... I don't know when the compaction will finish, but I'm
> guessing that it will be done by around 10 am or before.
> Users can use the system while it's running.
It appears to have actually completed around 3:23 am. We're
done with that part and have recovered a ton of disk space.
I still have to run a compaction on the cdr_archived_version
database, then the entire task will be completed.
BZDATETIME::2010-06-09 11:17:58
BZCOMMENTOR::Bob Kline
BZCOMMENT::31
(In reply to comment #30)
> I still have to run a compaction on the
cdr_archived_version
> database, then the entire task will be completed.
The new script is in the Database directory in the svn repository.
I believe we need to get John to back up the updated version database (he was going to set that up as a scheduled backup to run every six months, but we explained to him that the timing of the backups need to be linked to the process of refreshing that database). Let's make sure the documentation for using your new script includes very prominent instructions for requesting a fresh backup.
My understanding of how issue #3069 came out is that synchronizing the archive database with Mahler and Franck is deferred until the next refresh jobs for those servers, right? I'm trying to find instructions on making sure this is done when necessary in the documentation [1][2] for refreshing Franck and Mahler, but no luck yet (though I do see the warning in [3] along these lines; does that documentation page need updating?).
[1] http://bach.nci.nih.gov/cgi-bin/cdr/Filter.py?DocId=CDR0000269608&Filter=name:Documentation+Help+Screens+Filter
[2] http://bach.nci.nih.gov/cgi-bin/cdr/Filter.py?DocId=CDR0000417891&Filter=name:Documentation+Help+Screens+Filter
[3] http://bach.nci.nih.gov/cgi-bin/cdr/Filter.py?DocId=CDR0000617955&Filter=name:Documentation+Help+Screens+Filter
BZDATETIME::2010-06-09 11:36:10
BZCOMMENTOR::Volker Englisch
BZCOMMENT::32
(In reply to comment #31)
> My understanding of how issue #3069 came out is that synchronizing
the archive
> database with Mahler and Franck is deferred until the next refresh
jobs for
> those servers, right?
The way I remember the process is that we can update the archive DB any time. We do not have to wait for the regular refresh. However, once the archive DB has been "topped-off" on BACH the regular refresh can only be done after the archive refresh has been performed.
BZDATETIME::2010-06-09 11:47:17
BZCOMMENTOR::Alan Meyer
BZCOMMENT::33
(In reply to comment #31)
> I believe we need to get John to back up the updated
version
> database (he was going to set that up as a scheduled backup
to
> run every six months, but we explained to him that the timing
> of the backups need to be linked to the process of refreshing
> that database). Let's make sure the documentation for using
> your new script includes very prominent instructions for
> requesting a fresh backup.
Yes. I'll create a checklist and store it in the program itself
so it won't get lost and/or in the system documentation.
In the meantime, Volker took his own backup of the archive
database before I modified it.
Volker, would you do another backup of the archive database?
BZDATETIME::2010-06-09 11:51:08
BZCOMMENTOR::Alan Meyer
BZCOMMENT::34
(In reply to comment #32)
> (In reply to comment #31)
> > My understanding of how issue #3069 came out is that
synchronizing the archive
> > database with Mahler and Franck is deferred until the next
refresh jobs for
> > those servers, right?
>
> The way I remember the process is that we can update the archive DB
any time.
> We do not have to wait for the regular refresh. However, once the
archive DB
> has been "topped-off" on BACH the regular refresh can only be done
after the
> archive refresh has been performed.
I agree.
We've had enough trouble with the copying of large files that
we probably ought to go ahead and copy the archive to mahler
and franck so it will be ready when we do need to refresh the
active database on either of those machines.
This could be done in conjunction with a backup of our own,
independent of John's backup.
I was premature in babbling about the entire task being
completed
in comment 30.
BZDATETIME::2010-06-09 11:52:10
BZCOMMENTOR::Volker Englisch
BZCOMMENT::35
(In reply to comment #33)
> In the meantime, Volker took his own backup of the archive
> database before I modified it.
I wouldn't say it this way: I ran the script on BACH which created
the backup.
If anybody ran the script again, the backup I had started manually would
now be overwritten.
> Volker, would you do another backup of the archive database?
When would you like me to run this?
BZDATETIME::2010-06-09 11:56:05
BZCOMMENTOR::Volker Englisch
BZCOMMENT::36
(In reply to comment #34)
> We've had enough trouble with the copying of large files that
> we probably ought to go ahead and copy the archive to mahler
> and franck
Isn't the backup file on the SAN? We just need to point the SAN to the other servers, use the same backup file and refresh the archive DB. There wouldn't be any additional copy involved. We just need to make sure the refresh is completed before the regular backup on BACH starts, since that process would try to grab the SAN again.
BZDATETIME::2010-06-09 12:04:36
BZCOMMENTOR::Alan Meyer
BZCOMMENT::37
(In reply to comment #35)
> (In reply to comment #33)
> > In the meantime, Volker took his own backup of the
archive
> > database before I modified it.
>
> I wouldn't say it this way: I ran the script on BACH which created
the backup.
> If anybody ran the script again, the backup I had started manually
would now be
> overwritten.
>
> > Volker, would you do another backup of the archive
database?
>
> When would you like me to run this?
The day before yesterday would be ideal.
Get as close to that as practical.
BZDATETIME::2010-06-09 12:07:48
BZCOMMENTOR::Alan Meyer
BZCOMMENT::38
(In reply to comment #36)
>
> Isn't the backup file on the SAN?
I don't know.
I somehow had it in my head that John's backup would work
either every six months, or when the archive database changed.
But I'm not sure that isn't just wishful remembering on my
part.
I'm not sure what's on the SAN.
BZDATETIME::2010-06-09 12:09:00
BZCOMMENTOR::Alan Meyer
BZCOMMENT::39
(In reply to comment #38)
> I'm not sure what's on the SAN.
The snapshot I asked you to take was from before the archive
migration. So if that's what you (Volker) were referring to,
it's not what we need for now.
BZDATETIME::2010-06-09 12:17:14
BZCOMMENTOR::Volker Englisch
BZCOMMENT::40
I just checked on the SAN. We do have a file there that's
called:
V:\DB_Backup\Bach\cdr\CDR_Archive\CDR_archived_versions.zip
with a size of 15GB (the CDR DB is 32GB compressed). The file was
created on 6/1, which is a Tuesday, so I bet that's the file we created,
Alan.
I renamed that file (to ..._before.zip) so that I won't overwrite it and will start the new backup. I would guess the users wouldn't be affected too much by the process or would you think I should wait until this evening to start the process?
BZDATETIME::2010-06-09 12:29:23
BZCOMMENTOR::Alan Meyer
BZCOMMENT::41
(In reply to comment #40)
> I would guess the users wouldn't be affected too much by
> the process or would you think I should wait until this evening to
start the
> process?
I think it's fine to run it now. Microsoft has faced the
problem
of taking backups of running systems for decades and should have
it
pretty well optimized and, of course, none of the data will be
changed either by CDR users or by the backup, so locks should not
be a problem.
BZDATETIME::2010-06-09 12:41:23
BZCOMMENTOR::Bob Kline
BZCOMMENT::42
(In reply to comment #34)
> We've had enough trouble with the copying of large files
that
> we probably ought to go ahead and copy the archive to mahler
> and franck so it will be ready when we do need to refresh the
> active database on either of those machines.
I don't have a strong feeling one way or the other. The most important thing is to ensure that whichever approach is settled on is clearly documented (by updating [1], which needs to incorporate information about the new script anyway) in the online help system.
BZDATETIME::2010-06-09 14:07:46
BZCOMMENTOR::Volker Englisch
BZCOMMENT::43
(In reply to comment #40)
> I just checked on the SAN. We do have a file there that's
called:
> V:\DB_Backup\Bach\cdr\CDR_Archive\CDR_archived_versions.zip
> with a size of 15GB (the CDR DB is 32GB compressed).
The backup is complete. The file size of the new file is 21GB.
BZDATETIME::2010-06-10 23:35:42
BZCOMMENTOR::Alan Meyer
BZCOMMENT::44
I performed the shrinkdatabase on the cdr_archived_versions
database. As expected, the shrink only recovered a very small
amount of space.
The transaction log is still surprisingly large (42.5 GB). If
the last backup was done WITH COPY_ONLY, we might want to do
another backup without that clause to reduce the log size. We
should ask Qian or Min for advice on that.
I will work next on updating the system documentation to enable
operators in the future to run the script safely.
BZDATETIME::2010-06-15 23:10:01
BZCOMMENTOR::Alan Meyer
BZCOMMENT::45
I've revised the first section of CDR617955 on Bach and created
a
version of my changes. The changes I made are:
Changed the terminology to use "migration" for the copying of
data from the active to the archive database, leaving the
term "refresh" for copying data from production to test
and/or development. We were overloading the term "refresh"
in a possibly confusing way.
Beefed up the explanation of why we migrate the data.
Elaborated all the steps in the migration. That was the
major change.
Made some probably gratuitous changes to the capitalization
of names to match my own aesthetic sensibilities. After I
wrote my parts I realized I was using a little different
style from what was there so, in traditional arrogant
programmer fashion, I revised everything to my style. Sorry
about that.
Except for some capitalization changes affected by my global
search and replace operations, I have not touched the sections
below the migration section that discuss how to refresh the
databases on Mahler and Franck. The material there was based on
existing techniques and, presumably, was more up to date.
BZDATETIME::2010-06-16 10:29:55
BZCOMMENTOR::Volker Englisch
BZCOMMENT::46
(In reply to comment #45)
> I was using a little different style from what was there so,
in
> traditional arrogant programmer fashion, I revised everything
to
> my style.
> Sorry about that.
Don't worry, I'll change it back with the next update. :-)
BZDATETIME::2010-06-24 16:50:38
BZCOMMENTOR::Alan Meyer
BZCOMMENT::47
I have an email from Qian saying that the cdr_archived_versions database has been backed up. If I understand her correctly, it looks like she put the backup files in d:\DB_Backup\Daily\TransactionLogBackup, where they will sit (or have sat) until the file system backup sees them (or has seen them) and backs them up.
I see that the backup files have an extension .zip. My unzip utility can't read them but that's presumably because their size is too large (John thinks that's all it is too.)
I think we should copy that backup file to Mahler and restore it there. Our theory is that running with the new archived versions data should work fine even with the old cdr database. Besides testing the backup file and preparing Mahler for the next refresh, this will also verify whether an old cdr with a new cdr_archived_versions really works.
BZDATETIME::2010-07-08 23:48:47
BZCOMMENTOR::Alan Meyer
BZCOMMENT::48
I had planned to install the cdr_archived_versions database on
Mahler tonight but it's not going to happen. The preparation is
taking too long.
I've successfully (I think) copied the backup zip file from
Bach
to Mahler. However I'm trying to complete some processes before
restoring the data on Mahler.
1. Making yet another backup of the existing
cdr_archived_versions database on Mahler.
We already have some of these lying around but it never hurts
to have a fresh one. That too is taking a long time. It has
run for about 2 hours already and I have no way to tell how
much longer it will take.
2. Unzip the backup file created on Bach.
This too is taking a very long time. I'm not even certain it
will be done by tomorrow morning.
I'm using WinZIP to do this since our versions of Info-zip
don't support large files. I have found Win32 binaries of
the latest Info-zip which do claim to support large files and
put the downloads on Mahler in d:\downloads\infozip. I have
not tried to install anything however since I think we should
test these and I don't have time to do that right now.
3. Estimate required disk resources.
I'll need to see how much space is left after the new backup
and the unzip, and compare that to the space taken by the
cdr_archived_versions database on Bach. If there's not
enough space, we'll need to do some more preparation, for
example, moving data off to the SAN before running.
Compressing the new backup may not be a good option for now
since, if we need it, we probably won't want to lose all the
time it will take to uncompress it.
If it looks like there might not be enough space, I won't
take any action without further consultations on what to do.
In any case, the preparation won't be complete in time for me
to
do the restore tonight.
Creating the archive data on Bach took about 5 hours. The
restore will create the same data on Mahler, plus all of the data
that already existed on Bach prior to our most recent migration.
So, while I don't know how long it will take to restore the
database, I think we should block out as much time as we can to
be sure of having everything done when we need it. The database
will be in single user mode during the restore.
Is Friday night or Saturday an okay time to start it? If things
go wildly wrong, or take too long, it gives me until Monday
morning to get things working again.
I'll only be doing this on Mahler. Bach is done and we'll do
Franck after we've seen that everything works fine on Mahler.
BZDATETIME::2010-07-09 06:34:12
BZCOMMENTOR::Bob Kline
BZCOMMENT::49
(In reply to comment #48)
> Is Friday night or Saturday an okay time to start it? If
things
> go wildly wrong, or take too long, it gives me until Monday
> morning to get things working again.
Sounds reasonable.
BZDATETIME::2010-07-09 11:55:36
BZCOMMENTOR::Alan Meyer
BZCOMMENT::50
The preparation is complete.
We have 237 GB free on Mahler. Assuming the database uses the
same space required on Bach, we'll need 140 GB. Even assuming
some scratch space during the restore, I think it should be
enough.
I'll start the process tonight.
BZDATETIME::2010-07-22 18:31:33
BZCOMMENTOR::Alan Meyer
BZCOMMENT::51
Bob successfully loaded the new cdr_archived_versions database
from Bach onto Mahler using the utility he wrote for forcing
Windows to properly copy large files. He loaded it as
xcdr_archived_versions and ran the script (CreateLogins.sql?) to
grant the proper permissions to it.
Volker and I renamed the databases:
cdr_archived_versions -> old_cdr_archived_versions
xcdr_archived_versions -> cdr_archived_versions
We ran a few trivial tests:
Checked that the document history report worked on a random
document with an old last saved version. All versions of
this document (CDR65000) were in the old
cdr_archived_versions table.
Saved a new version of that document and checked that it
appeared correctly in the document version report.
Checked that it worked on the very last document version to
have been created before the rename. Many versions of this
document (CDR28327) were not in the old cdr_archived_versions
table.
Saved a new version of that document and checked that it
appeared correctly in the document version report.
I propose the following:
Run for a few days on Mahler, making sure nothing is broken.
Refresh the cdr database on Mahler from Bach.
Run for a few more days to make sure nothing is broken.
Do it all on Franck.
BZDATETIME::2010-07-22 18:55:08
BZCOMMENTOR::Bob Kline
BZCOMMENT::52
(In reply to comment #51)
> I propose the following:
>
> ....
>
> Do it all on Franck.
Before that step I suggest we discuss the possibility of using a different technique for propagating new rows to Franck (and eventually Mahler), namely just copying the rows added on Bach since the previous event.
BZDATETIME::2010-08-03 13:50:53
BZCOMMENTOR::Alan Meyer
BZCOMMENT::53
For the record: It appears that we have everything working now. We were able to update Mahler after working around the Windows file system bug for copying big files using Bob's utility to restart a copy at some known good point. We were able to update Franck using a new Python script by Bob that copied one record at a time from Bach to a destination machine (Franck in this case). Both methods were verified using a compare script written by Bob.
I propose that we gather all of the relevant scripts together in one place - I've been using d:\cdr\Database - and update the System Documentation to include the latest information about refreshing Franck and Bach and verifying the refresh.
BZDATETIME::2010-08-04 08:35:01
BZCOMMENTOR::Bob Kline
BZCOMMENT::54
(In reply to comment #53)
> I propose that we gather all of the relevant scripts together in
one place -
> I've been using d:\cdr\Database - and update the System
Documentation to
> include the latest information about refreshing Franck and Bach and
verifying
> the refresh.
I have moved the two scripts I created into %CDRTRUNK%/Database:
GetNewArchivedVersions.py - brings Mahler or Franck in sync with
Bach
check-archived-version-xml.py - verifies that local archive version
table
matches the one on Bach
Please update the documentation to refer to the use of these two tools.
BZDATETIME::2010-08-10 11:38:07
BZCOMMENTOR::Alan Meyer
BZCOMMENT::55
(In reply to comment #54)
> Please update the documentation to refer to the use of these two tools.
The document describing how to do all this has two parts, one for migrating XML versions from the cdr to the cdr_archived_versions database, and one for refreshing Franck and Mahler.
To facilitate management of larger documentation documents, I propose that we make a minor change to the DocumentationToC schema as follows:
From:
<complexType name = 'ToCEntry'>
<choice minOccurs = '1'
maxOccurs = '1'>
<element name = 'ConstructedTopic'
type = 'NotEmptyString' />
<element name = 'DocumentationRef'
type = 'DocumentationRef' />
</choice>
To:
<complexType name = 'ToCEntry'>
<choice minOccurs = '1'
maxOccurs = '1'>
<element name = 'ConstructedTopic'
type = 'NotEmptyString'/>
<element name = 'DocumentationRef'
type = 'DocumentationRef'/>
<!– HERE'S THE NEW PART -->
<element name = 'DocumentationFragmentRef'
type = 'DocumentationFragmentRef'/>
</choice>
That enables us to link to a specific section inside a document from the table of contents. [I've also removed the odd whitespace found before "/>"]
BZDATETIME::2010-08-10 11:41:12
BZCOMMENTOR::Bob Kline
BZCOMMENT::56
(In reply to comment #55)
> To facilitate management of larger documentation documents, I
propose that we
> make a minor change to the DocumentationToC schema as follows:
....
OK with me.
BZDATETIME::2010-08-10 14:49:36
BZCOMMENTOR::Alan Meyer
BZCOMMENT::57
(In reply to comment #55)
> <!– HERE'S THE NEW PART -->
> <element name = 'DocumentationFragmentRef'
> type = 'DocumentationFragmentRef'/>
Bob pointed out that, for this to work, the filter that formats
a
help document for display has to support construction of HTML id
attributes from cdr:id attributes.
It turns out that the filter (CDR0000339576) does already
support
this for selected elements. I'm not sure every possible element
is covered, but many are, including <p>, <ul>, <ol>,
and <table>.
I think we should go ahead with the change recommended in
comment
#55. If some cdr:id is not translated in the filter that we wish
to link to, we can modify the filter or the link (e.g., to point
to the first <p> after the desired element.) In the worst
case,
the link will fail but the default browser behavior will take the
user to the top of the document.
I'll implement it.
BZDATETIME::2010-08-11 00:36:12
BZCOMMENTOR::Alan Meyer
BZCOMMENT::58
(In reply to comment #54)
> Please update the documentation to refer to the use of these two tools.
I have edited and re-organized the documentation and checked in
and made versions of three documents:
CDR256207 - The system documentation table of contents.
CDR617955 - Migrate XML Versions from the "cdr" to the
"cdr_archived_versions" Database.
CDR683474 - Refresh CDR_Archived_Versions Database on
MAHLER/FRANCK.
I implemented the change I suggested in comment #55 but I
decided
not to use it. The number of instructions for the task of
copying data to Franck and Mahler was large enough that I thought
it was better to make it a separate document. I just have a few
short paragraphs in the migration document and refer to the
copying document for the details.
After looking at these documents all day long, changing this,
moving that, deleting the other, cutting and pasting text from
individual elements in one document to another, I cannot bear to
look at them any more tonight.
I'll try to do one more run through Thursday, then turn it over
to Bob for QA.
BZDATETIME::2010-08-26 13:31:46
BZCOMMENTOR::Bob Kline
BZCOMMENT::59
Adjusted title to reflect the actual scope of the task.
BZDATETIME::2010-08-27 00:28:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::60
I'm marking this resolved-fixed.
The only part that hasn't been reviewed is the final changes
to documentation.
BZDATETIME::2010-09-14 18:09:13
BZCOMMENTOR::Bob Kline
BZCOMMENT::61
I have reviewed the documentation, which looks good (fixed a couple of typos). Note that comment #58 above switch the CDR IDs for the second ("Migrate ...") and third ("Refresh ...") documents. Closing issue.
File Name | Posted | User |
---|---|---|
ArchiveBeforeAfter.txt | 2010-05-14 13:38:08 |
Elapsed: 0:00:00.001519