CDR Tickets

Issue Number 3151
Summary [HP Summary Section] Global to remove Purpose section from HP summaries
Created 2010-05-13 09:52:48
Issue Type Improvement
Submitted By Beckwith, Margaret (NIH/NCI) [E]
Assigned To alan
Status Closed
Resolved 2010-08-30 11:22:55
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107479
Description

BZISSUE::4838
BZDATETIME::2010-05-13 09:52:48
BZCREATOR::Margaret Beckwith
BZASSIGNEE::Alan Meyer
BZQACONTACT::William Osei-Poku

At the same time that we add the new summary section called About This PDQ Summary to each HP summary, we will need to remove the Purpose section from each HP summary.

Comment entered 2010-05-13 11:37:53 by alan

BZDATETIME::2010-05-13 11:37:53
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

Checking on Bach, there are 175 Summaries that contain a
SummarySection with a Title subelement with the value "Purpose of
this PDQ Summary". I presume that is the SummarySection that
should be removed.

I notice that at least some of them contain information that is
specific to the Summary. I saw Summary specific information in
this section, including section metadata and links to patient and
Spanish summaries. Wouldn't we need to preserve that information
somewhere?

Comment entered 2010-05-13 12:00:11 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-05-13 12:00:11
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::2

That's what we need to talk about--whether we need to preserve any of the information in there. I actually think since we have old versions that have it we are probably good to just delete the sections. We aren't going to use what is in there in the new section. But Lakshmi may have different ideas.

Comment entered 2010-06-10 10:27:46 by alan

BZDATETIME::2010-06-10 10:27:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

Maybe we can discuss this at today's status meeting.

Comment entered 2010-06-10 15:59:38 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-06-10 15:59:38
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::4

We decided that it is fine to completely get rid of the Purpose section; we don't need to keep any of the text.

Comment entered 2010-06-16 00:27:50 by alan

BZDATETIME::2010-06-16 00:27:50
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5

Here is my understanding of the requirements for this task. I'd
like to confirm that I've got it right.

Once Volker's changes for OCECDR-3149 are approved and ready to
move to production, but just before they are actually moved to
production, I should run a global change to do the following:

For every health professional Summary:

If there is a SummarySection with the SectionTitle =
"Purpose of This PDQ Summary":

Delete the entire SummarySection.

Removing the entire SummarySection removes more than just the
paragraph that begins with:

"This PDQ cancer information summary for health professionals
provides comprehensive, peer-reviewed, evidence-based
information about the treatment of bladder cancer. This
summary is reviewed regularly and updated as necessary ..."

It also removes the paragraph on:

"Information about the following is included in this Summary"

and

"This summary is intended as a resource to inform and assist
clinicians who care for cancer patients. It does not provide
formal guidelines ..."

and

"This summary is available in a patient version ..."

All of that will be deleted from the document.

Once I get the program running and tested in test mode, we should
run it in live mode on Mahler on a few documents and then test
mine and Volker's changes using Publish Preview to be sure that
the changes that we have each made work together.

I'm adding a dependency on OCECDR-3149 and vice versa since both
need to be ready to run at the same time.

Comment entered 2010-06-17 10:08:28 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-06-17 10:08:28
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::6

That sounds right to me Alan. And while we want to coordinate this with the new section at the end of the summaries, it wouldn't be the end of the world if we had both sections in the summaries at the same time or if the Purpose section disappeared shortly before the new section got added.

We still have to work out the problem of inserting the correct text into the new PurposeText element in each summary, so I don't think any of this is going to happen before Volker gets back from vacation (end of July).

Comment entered 2010-06-17 10:16:03 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-17 10:16:03
BZCOMMENTOR::Volker Englisch
BZCOMMENT::7

(In reply to comment #6)
> I don't think any of this is going to happen before Volker
> gets back from vacation (end of July).

I wish. It's the 15th of July when I will be back.

Comment entered 2010-06-17 23:38:56 by alan

BZDATETIME::2010-06-17 23:38:56
BZCOMMENTOR::Alan Meyer
BZCOMMENT::8

I wrote a first draft of the global change and ran it on Mahler.
Results are in:

http://mahler.nci.nih.gov/cgi-bin/cdr/ShowGlobalChangeTestResults.py?dir=2010-06-17_22-47-47

I discovered a few issues:

1. Not all documents have Purpose SummarySections.

For example CDR0000062772 does not.

I presume these are fine as they are and should be left alone.

2. There are 34 blocked documents.

Should these be processed? I don't see any obvious reason
why they should not.

3. Spanish documents were not transformed.

I selected all Health professional Summaries and dropped any
SummarySection with the title "Purpose of This PDQ Summary".

That skipped all the Spanish documents.

Should I also look for "PropĆ³sito de este sumario del PDQ"?

Comment entered 2010-06-18 00:17:01 by alan

BZDATETIME::2010-06-18 00:17:01
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9

(In reply to comment #8)
> I wrote a first draft of the global change and ran it on Mahler.
> Results are in:
> ...

There were 11 locked documents on Mahler, with locks that were
presumably brought over from Bach during the last refresh.
However this is of no significance for now and I haven't bothered
to attach a log file that identifies them.

Comment entered 2010-06-29 14:41:39 by alan

BZDATETIME::2010-06-29 14:41:39
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10

(In reply to comment #8)
> ...
> I discovered a few issues:
> ...

No one has weighed in on these. I presume we'll need to
wait for Margaret to return and comment on them.

Comment entered 2010-07-07 13:09:06 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-07-07 13:09:06
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::11

> I discovered a few issues:
> 1. Not all documents have Purpose SummarySections.
> For example CDR0000062772 does not.
> I presume these are fine as they are and should be left alone. COMMENT FROM MB: YES, leave them alone.

> 2. There are 34 blocked documents.
> Should these be processed? I don't see any obvious reason
> why they should not. COMMENT FROM MB: YES, okay to process them.

> 3. Spanish documents were not transformed.
> I selected all Health professional Summaries and dropped any
> SummarySection with the title "Purpose of This PDQ Summary".
> That skipped all the Spanish documents.
> Should I also look for "PropĆ³sito de este sumario del PDQ"? COMMENT FROM MB: YES

Comment entered 2010-07-08 21:21:48 by alan

BZDATETIME::2010-07-08 21:21:48
BZCOMMENTOR::Alan Meyer
BZCOMMENT::12

I've done another test run on Mahler. This one includes the Spanish summaries.

The log file is attached. The only problems reported are the usual record locks one expects on Mahler.

Comment entered 2010-07-08 21:21:48 by alan

Attachment Request4838.log has been added with description: Log file from test run on Mahler

Comment entered 2010-07-19 16:44:09 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-07-19 16:44:09
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::13

So does this mean we are ready to do a live run on Mahler which would actually remove the Purpose section from the summaries?

Comment entered 2010-07-20 14:25:16 by alan

BZDATETIME::2010-07-20 14:25:16
BZCOMMENTOR::Alan Meyer
BZCOMMENT::14

(In reply to comment #13)
> So does this mean we are ready to do a live run on Mahler which would actually
> remove the Purpose section from the summaries?

Yes. We can try it live on Mahler. We can't run on Bach however without coordinating with Volker's changes.

Comment entered 2010-07-22 20:04:00 by alan

BZDATETIME::2010-07-22 20:04:00
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15

I've tested live on Mahler.

The log file is attached.

Comment entered 2010-07-22 20:04:00 by alan

Attachment Request4838.log has been added with description: Log file from live run on Mahler

Comment entered 2010-07-26 15:33:37 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-07-26 15:33:37
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::16

In looking at the live run log file I don't understand what a lot of the messages mean, and also why there are so many when there weren't any in the test run?

Comment entered 2010-07-27 11:43:42 by alan

BZDATETIME::2010-07-27 11:43:42
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17

(In reply to comment #16)
> In looking at the live run log file I don't understand what a lot of the
> messages mean, and also why there are so many when there weren't any in the
> test run?

The answer to the second question has to do with the way validation is done in the global change programs.

When running in test mode, we first validate the old version of a document. If and only if the old untransformed version of the document was valid, we go on to validate the new version. If the old version were valid and the new version is invalid, we report that.

The reason for that behavior is that we decided at some time in the past that the purpose of validation in test mode is not to determine if documents are valid, but to determine if the change that we made in the global change caused a valid document to become invalid.

Since the global change in test mode did not make any valid documents invalid, no validation errors were reported. In live mode the behavior is different. We report the real validation messages, at least when saving a publishable version.

We could, of course, change this behavior. Maybe we should discuss it at the next status meeting.

As for your first question, I'll investigate further and post again.

Comment entered 2010-07-27 15:29:48 by alan

BZDATETIME::2010-07-27 15:29:48
BZCOMMENTOR::Alan Meyer
BZCOMMENT::18

(In reply to comment #17)
> (In reply to comment #16)
> > In looking at the live run log file I don't understand what a
> > lot of the messages mean ...

I worked through a number of the Warning messages in the log
file. There appear to be two main sources of error:

1. Blocked documents.

3 of the first 15 I checked were blocked. Most of these are
probably also blocked on Bach. See below for a way to make
these clearer in the log file.

2. Documents that are wrong on Mahler but right on Bach.

These were cleaned up since the last refresh from Bach to
Mahler.

Maybe we should ignore the log files on Mahler and just run
the test mode on Bach before worrying about error messages.
There may or may not still be some, but there surely won't be
as many.

So the messages look alarming, but they appear to mostly be false
alarms.

We can reduce the confusion caused by errors in blocked documents
in several ways. We could not change them, not validate them, or
specially mark them in the log file. I like the third of those
options best. For example, we could do something like the
following (lines are wrapped for readability here.)

2010-07-22 18:38:02: Warning for CDR0000062698:
cdr:id "_282" used more than once
2010-07-22 18:39:55: Warning for BLOCKED CDR0000062735:
cdr:id matching fragment '_112' not found in this document

Alternatively, we might indicate document status in the initial
message for every processed document, e.g.:

2010-07-22 18:37:59: Processing CDR0000062698
2010-07-22 18:39:52: Processing BLOCKED CDR0000062735

or more tersely, using 'A' for Active and 'I' for Inactive
(blocked):

2010-07-22 18:37:59: Processing A CDR0000062698
2010-07-22 18:39:52: Processing I CDR0000062735

I can't think of any downside to doing that. It might also be
useful to put an indicator in the web page test results.

If there is a question about any specific error message, I can
explain it.

Comment entered 2010-07-27 17:12:46 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-07-27 17:12:46
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::19

I think indicating that a document is blocked is a good idea. Maybe we should do a test run on Franck before moving to Bach.

Comment entered 2010-07-27 17:17:09 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-07-27 17:17:09
BZCOMMENTOR::Volker Englisch
BZCOMMENT::20

(In reply to comment #19)
> Maybe we should do a test run on Franck before moving to Bach.

Please note that we're currently unable to refresh the CDR database on FRANCK until the CDR_archive_versions database has been restored there.
A test run on FRANCK would need to be run with the current data which is at least a few weeks old.

Comment entered 2010-07-29 15:00:46 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-07-29 15:00:46
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::21

Adding Robin H as a cc.

Comment entered 2010-07-29 22:05:07 by alan

BZDATETIME::2010-07-29 22:05:07
BZCOMMENTOR::Alan Meyer
BZCOMMENT::22

(In reply to comment #18)
...
> We can reduce the confusion caused by errors in blocked documents
> in several ways.
...

I've implemented and tested the change to ModifyDocs.py to show
blocked documents as blocked. Here's a sample from the log file
of a test run.

2010-07-29 21:52:58: Processing CDR0000062707
2010-07-29 21:53:10: Processing CDR0000062726
2010-07-29 21:53:14: Processing CDR0000062734
2010-07-29 21:53:18: Processing CDR0000062735 (BLOCKED)
2010-07-29 21:53:20: Processing CDR0000062736

I've been testing with this global change and also the one for
issue #4863. It looks okay, so I'll move it into production.
All global changes, from now on, will have this feature.

Comment entered 2010-08-12 10:24:46 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-08-12 10:24:46
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::23

Are we ready to do a test run on Franck now that it has been refreshed?

Comment entered 2010-08-12 10:43:48 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-08-12 10:43:48
BZCOMMENTOR::Volker Englisch
BZCOMMENT::24

I suggest to refresh FRANCK again today with last night's data.
Then I will run the first summary publishing job after which Alan will run his global changes and I will move the filters to run the second summary publishing job in order to diff the results.

I will start the refresh in a few minutes.

Comment entered 2010-08-12 11:21:42 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-08-12 11:21:42
BZCOMMENTOR::Volker Englisch
BZCOMMENT::25

(In reply to comment #24)
> I will start the refresh in a few minutes.

FYI: The refresh of the CDR database on FRANCK is complete.

Comment entered 2010-08-12 11:30:49 by alan

BZDATETIME::2010-08-12 11:30:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::26

(In reply to comment #25)
> (In reply to comment #24)
> > I will start the refresh in a few minutes.
>
> FYI: The refresh of the CDR database on FRANCK is complete.

I'll run this global and the global to insert PurposeText in
"live" mode on Franck, starting in a few minutes.

Comment entered 2010-08-12 15:58:41 by alan

BZDATETIME::2010-08-12 15:58:41
BZCOMMENTOR::Alan Meyer
BZCOMMENT::27

I ran the global in live mode on Franck using data refreshed from
the last nightly backup on Bach. The log file is attached.

There were many more warnings than I expected. Here are some
statistic from the run:

Documents saved = 300
Docs with warnings = 59
Blocked docs = 38
Blocked w/warnings = 9

If I counted everything correctly, we had 50 unique, active
documents saved that generated warnings. That's a lot more than
I expected.

When I looked at this problem after the live run on Mahler I
checked about a dozen documents that generated warnings and, in
each case, either the documents were blocked, or the warnings
were for problems that were not present in the corresponding
documents on Bach. That lulled me into a false sense that
everything was okay. The false sense of security was further
increased by the fact that, when running in test mode, no errors
were reported. Had the global change caused valid documents to
become invalid, the test mode run should have reported that.

Well, it turns out that all of my tests and assumptions were
wrong. The global change did make many of the documents invalid.

I haven't analyzed all of the errors yet. However I am sure that
many of them involve missing cdr:ids, these are ids of specific
parts of a document that are targets of links.

What happened was that a significant number of documents link to
cdr:ids in the Purpose section. When the Purpose section is
deleted by the global change, those links are broken. In some
cases, the broken link is a self link, i.e., a link to some place
in the purpose section of document A that comes from within
document A itself. In other cases they are links from another
document B to the Purpose section of document A.

When I checked for these errors on Bach after the run on Mahler,
they weren't there, of course, because the Purpose sections with
their cdr:ids were still present.

It looks like all of these need to be fixed before we can run the
global. Otherwise the documents will become unpublishable after
the global change and the last publishable versions from before
the change will continue to be published - something with which
our new publishing filters might have a problem.

As for why the test run didn't reveal this problem - I don't know
yet. I will do more research on that and figure out if there's a
bug in the way the core global change module does validation.

The tasks now are:

[ ] Analyze the errors.

I'll do that and post a report showing all of the documents
that are problematic and why.

[ ] Consider a global change to fix the problem.

This has a technical part, determining if this is practical
(I'm pretty sure it is) and what needs to be done to do it.
It also has a content part. Someone needs to see if
removing the fragment links programmatically is safe from a
content point of view, or whether it renders some of the
links nonsensical.

I'll do the first part but William, Robin or Margaret would
be better at handling the content part.

If they can't be done programmatically, i.e., if a human has
to inspect each one then:

[ ] Fix all the documents by hand.

That's a job for CIAT, using the report from the first
task.

[ ] Research the global change test mode validation.

I'll tackle that one.

[ ] Consider the impact on publishing filters.

If it takes longer than one week to fix all of the
documents, or maybe even if it takes longer than a couple of
days (that requires analysis), then when we publish
Summaries we will have some Summaries that failed validation
after the Purpose section was removed. When that happens,
the publishing program will pick up old versions that still
have a Purpose section.

To overcome that problem I think we must do the following:

[ ] Run the global to add PurposeText before we run the
global to remove Purpose sections.

Adding PurposeText does not render documents invalid.
If we do that first, then all documents with publishable
versions will have publishable versions with
PurposeText. If any are made invalid by deleting the
Purpose sections, it won't be versions without
PurposeText that will be publishable.

[ ] Determine whether the publishing filters should do
something special with documents that have Purpose
sections.

If we think that we won't fix all errors in one fell
swoop, then maybe Volker should modify the new version
of the filter to do something reasonable when both a
Purpose section and PurposeText exist.

There may be more tasks, but I have to go to a meeting right now
and I'll think more about this later.

Comment entered 2010-08-12 15:58:41 by alan

Attachment Request4838.log has been added with description: Log file for live run on Franck

Comment entered 2010-08-12 19:39:17 by alan

BZDATETIME::2010-08-12 19:39:17
BZCOMMENTOR::Alan Meyer
BZCOMMENT::28

(In reply to comment #27)
...
> The tasks now are:
>
> [ ] Analyze the errors.

There are four documents that generated warnings unrelated to the
Purpose section deletion. They are:

CDR0000062775 (BLOCKED):
Expected child elements for empty PDQBoard element of type PDQBoard
Analysis:
This one has an empty PDQBoard element, just as stated,
but it's a blocked document, so I didn't look further.

CDR0000062789:
No match found in content model for type ItemizedList with
child elements of ItemizedList element
(ListItem,ListItem,ListItem,ListItem,ListItem,ListItem,Comment);
stopped at element Comment
Analysis:
The last publishable version (#187) was created
6/25/2010. The global change transformed that
successfully and produce another valid version. Edits
made since then, but before the recent global change,
resulted in validation errors that are still there. But
they are not related to the global change.

The last non-publishable version had an invalid comment
inside deletion markup. That was what generated the
message after the global change. When the deletion is
applied, the problem will go away.

CDR0000062921:
No match found in content model for type SummarySection with child
elements of SummarySection element
(SectMetaData,Title,Para,Para,Para,Title,ItemizedList,
ItemizedList,Para,...); stopped at element Title
Analysis:
Another validation error in deleted text in the last
non-publishable version. It will be corrected when the
deletion is applied.

CDR0000062927:
/Summary/SummarySection[7]/SummarySection[5]/OrderedList[12]
/ListItem[9]: This element must have text content.
Analysis:
This one is missing required text content in an element,
but the text is there inside insertion markup. Again,
when the change markup is applied, the error will go
away.

I think that the above four errors are insignificant.

All the rest of the errors are cdr:id related. I'll analyze them
next.

Comment entered 2010-08-13 00:46:07 by alan

BZDATETIME::2010-08-13 00:46:07
BZCOMMENTOR::Alan Meyer
BZCOMMENT::29

I've done some analysis if the fragment ID errors.

The attached Excel spreadsheet shows all of the errors that need
to be fixed. I created it by writing a program that processed
the Warning statements in the log file from the live run on
Franck. There are three columns:

Source CDR ID:
This is the document that links to a target fragment in
the Purpose section of a Summary.

If the Source and Target CDR IDs are the same, it is an
internal link, i.e., a link from one place in the
document to another place in the Purpose section of the
same document.

Target CDR ID:
This is the document from which the Purpose section was
removed.

Fragment ID:
This is the cdr:id that was inside the Purpose section
but has now disappeared because the Purpose section is
gone.

There were a total of 141 warnings in the global change logfile,
but many are duplicates with respect to this spreadsheet. For
example, if document A links to document B#_123 three times, the
error is reported three times in the log file. If A and B are
both Summaries (they probably are), it can be reported six times,
three times when validating A and three times when validating B.
I've eliminated all of these duplicates in the spreadsheet.

There are a total of 35 documents that need to be fixed.

34 of those contain external links, i.e., links from the
document to another document.

22 contain internal links.

Some of these documents may be blocked and not in need of fixing,
but I don't have a count of those at this time.

We might want to completely exclude them from the global change
because, although they contain Purpose sections, we have not
written a PurposeText for them and, if we remove the Purpose
sections, they become purposeless (in a manner of speaking.)

I can write a global change to fix everything, but it's tricky.
For external links, I can just delete the fragment IDs, e.g.,

Change: "cdr:href='CDR0000012345#_123"

To: "cdr:href='CDR0000012345"

But for internal links, we probably want to completely delete the
link, leaving all related text, for example:

<Para cdr:id="_118"><Strong><SummaryFragmentRef
cdr:href="CDR0000062735#_112">Purpose of This PDQ
Summary</SummaryFragmentRef></Strong></Para><Para
cdr:id="_119 ">Added this new
section.</Para>

Becomes:

<Para cdr:id="_118"><Strong>Purpose of This PDQ
Summary</Strong></Para><Para cdr:id="_119 ">Added this
new section.</Para>

The "<SummaryFragmentRef>" wrapper around the text "Purpose
of This PDQ Summary" has been removed.

However in some cases, maybe including the example above, it
might be better to delete the two paragraphs that I've quoted.

I am inclined to think that these should all be done by hand.
However, doing it by hand in XMetal is problematic. Although
there are only 35 documents to be fixed, most of them have CWD's,
last versions, and last publishable versions, all of which are
different. At least the CWD and last publishable version would
need to be fixed, giving 70 documents that need editing.

Fixing them is not trivial either. We can't even search the docs
for attributes in XMetal except in text mode, which everyone
rightly stays away from, and we can't use the CDR validation
error locator because the documents would not have any errors
until we transform them.

It's a real headache.

At this point, I'll quit for the night and let us all think about
the problem.

Comment entered 2010-08-13 00:46:07 by alan

Attachment DocsRequiringFragmentFix.xls has been added with description: Documents that contain errors spreadsheet

Comment entered 2010-08-17 10:37:41 by alan

BZDATETIME::2010-08-17 10:37:41
BZCOMMENTOR::Alan Meyer
BZCOMMENT::30

I've thought about the problem.

My thinking is that this is hard to solve by hand editing of the
documents for two reasons:

1. We need to fix up to three versions of each document, the
last publishable version, the last version, and the CWD.

Working with XMetal, it is only possible to edit the CWD. To
do the others, a user would have to do the following:

Find the version numbers for last and last pub versions.

Save the CWD as a version.

Recall the oldest of the last and last pub version.

Edit and save it.

Recall the newer of the last and last pub versions.

Edit and save it.

Recall the old CWD.

Edit and save it.

That's a tricky thing for a user to manage by hand and is not
a common CDR editing task.

2. Editing the docs in XMetal is difficult.

It is only possible to search for an attribute value in text
mode.

After finding the attribute, the user must then either edit
in text mode, or switch back, edit in the attribute
inspector, then go back to text mode again. He might also
have to fiddle with the text mode line break undelete
problem.

He's got to do this for up to three versions of each
document, some of which can have several references that need
fixing.

Instead of all that, I propose to write a global change to do the
following:

For each document in our error list:

If the document refers to a fragment that has disappeared
in a different document:

Change the document by deleting the fragment portion
of the link.

Else (the document refers to a fragment that has
disappeared in itself, the same document - probably a
change log entry):

For a cdr:ref

Delete the entire link.

For a cdr:href

Delete all the markup. See the example in
comment #29.

I propose to run the global on Franck, in test mode and have
someone inspect the diff outputs. This will need some careful
inspection, but it should be easier than inspecting the docs in
XMetal because the diffs are highlighted.

We may find that everything is fine except for one or a few
documents for which this doesn't work well. In that case, those
docs can be edited by hand.

Only if the whole thing is a mess should we give it up and edit
everything by hand. I don't personally think that will happen.

I'm starting on the global change now and hope to have it working
and run in test mode before I leave today.

Comment entered 2010-08-18 00:13:22 by alan

BZDATETIME::2010-08-18 00:13:22
BZCOMMENTOR::Alan Meyer
BZCOMMENT::31

The deeper I get into the problem the more little twists and
turns I discover.

I think the plan outlined in comment #30 mostly worked, but there
were some minor complications. I had to not only drop the
fragment ID from external references (i.e., SummaryFragmentRef
from doc A to B#_whatever), but also change the element name:

From: SummaryFragmentRef
To: SummaryRef

I also discovered that errors sometimes vary between versions.
For example, a CWD might have recently introduced fragment
references into the Purpose section that are not in the
publishable version.

Finally, I discovered something that I probably knew at one time
but had forgotten. In the global change report that has lines
like:

CDR ID Ver. Files CWD Size Diff size
CDR0000062875 CWD Old New Diff 44964 1658
LASTV Old New Diff 44964 1658
LASTP Old New Diff 44964 1658

the last column, "Diff size" is only the size of the diff for the
CWD. At a minimum, the column header should say "CWD Diff Size",
but it would be much better to re-write the report so that it
showed actual file and diff sizes for each of the three possible
versions, while still sorting only by CWD size or Diff size.

I looked at what it might take to fix that, but it was
non-trivial. Since that's not in our critical path for this
task, I'm ignoring it for now.

So it's necessary for someone to inspect all of the diffs and
not be fooled by the fact that the sizes make it look like
they're all the same. If the diffs look okay, we're in good
shape. The ones I looked at all looked good to me.

I ran the global to fix references on Franck. The log file is
attached. Some of the entries in the log file show warnings of
the form:

... WARNING: CDR0000062875#_102 not found in doc 62875

That's not a validation warning, it's a warning that my fixup
program issued. It's actually a sort of anti-warning. It's
telling me that an error existed in one version of a document and
was recorded in the spreadsheet attached to comment #29, but
warning me that the same error was NOT found in another version.

Since our goal here is to drain the swamp rather than battle the
alligators, I'll leave the warning in place.

Our next steps, I think are:

1. Have Volker refresh Franck.

2. I will run the fixup program global.

3. I will run the PurposeText addition global for English and
Spanish.

4. I will run the Purpose section deletion global.

In order to get a head start on this, I can run the globals from
home tomorrow.

I'm hoping that the global will do all the fixup work for us.
There may be some errors that don't get corrected but I'm hoping
there will be only a few at most.

After we run with a new Franck refresh from Bach I'll look for
any errors in the log files that indicate there were errors in
Bach data that didn't get copied over in last week's refresh. If
there are, I'll add rows to the spreadsheet so that we can fix
them all automatically when we run on Bach.

If we're lucky, there won't be any manual cleanup required.

Comment entered 2010-08-18 00:13:22 by alan

Attachment Request4838FixRefs.log has been added with description: Log file from error fixup test global on Franck.

Comment entered 2010-08-18 19:12:09 by alan

BZDATETIME::2010-08-18 19:12:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::32

I did the following so far:

1. Fixed the spreadsheet to eliminate all blocked documents.

2. Re-ran the global fixup program in test mode to fix broken
references to fragments inside the Purpose sections, for
active documents only.

3. Looked at the results.

4. Modified the global to delete Purpose sections to not include
blocked documents. Ran it in test mode. Unlocked all the
locked documents. Ran it in live mode.

Six documents still showed errors. I tracked those down.

62687
This one was not in the spreadsheet. It appears that
Louise Mullican made changes on Bach over the last few
days, after we transfered the data to Franck last week.

I'll re-run the fixup on this document.

62789
This was unrelated to the fixup problem. See Comment
#28.

62921
See comment #28.

62927
See comment #28.

62938
Oops. My mistake. It looks like I dropped this from the
spreadsheet by accident when I dropped blocked documents.
I'll add it back to the spreadsheet and re-run.

256677
Looks like another recent edit, like 62687.

I'll add it to the spreadsheet and re-run.

For my re-run, I'll both add the docs to the spreadsheet for the
run on Franck, and make a separate small spreadsheet to just
re-run the fixup on the above three docs on Franck.

I'll report back again when more is done.

Comment entered 2010-08-18 22:37:49 by alan

BZDATETIME::2010-08-18 22:37:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::33

It looks like there were some more nasty little twists. I haven't
tracked it all down, but I think in one document, 62938, the identity
of the section marked _118 changed between versions. An automated fix
would delete the right href in one document, but the wrong one in
another.

I'll go over all the versions carefully tomorrow and fix that
document by hand if necessary on Bach.

After I've figured that out, I'll run the last part of the global
change, the insertion of the PurposeText.

Comment entered 2010-08-18 22:40:26 by alan

BZDATETIME::2010-08-18 22:40:26
BZCOMMENTOR::Alan Meyer
BZCOMMENT::34

Also, 256677 looked like it was one of the errors for which our
fixup might be needed, but it's not. It's a different error that
will be caught and fixed the next time the document is edited. It
may be one of those in-progress edits that will be fixed routinely
as the document is made ready for publication.

I did not put it on the spreadsheet.

Comment entered 2010-08-19 17:25:01 by alan

BZDATETIME::2010-08-19 17:25:01
BZCOMMENTOR::Alan Meyer
BZCOMMENT::35

After more analysis of the errors, it turns out that documents
62687 and and 62938 do not need to be "fixed" by my fixup
program. I've dropped them from the spreadsheet that controls
the fixup program. I believe that all of the errors reported in
comment #32 are now fully accounted for.

Assuming no more references are created to the Purpose section on
Bach, I think the fixup program will make everything come out
correctly in the global change on Bach. Any errors reported by
the globals should be genuine errors that should be reported and
are not caused by the global change.

At any rate, that's my current thinking on the matter.

A test mode run on Bach will not tell us much. We already know
that a lot of errors will occur unless we run the fixup in live
mode.

As far as I can tell, my programs are ready to run live on Bach
whenever Volker is ready. The live mode deletion of Purpose
sections should report the same six documents with errors that
were reported in comment #32. If any additional errors are
produced, I'll investigate and, probably, fix them by hand.

Before running this global it would be a good idea to ask the
users to check in any Summaries they are working on so that none
will be locked during the run.

I tentatively propose to run the globals on Tuesday night next
week, and suggest that we ask all users to check in all Summaries
before 5 or 6 pm on Tuesday evening.

Comment entered 2010-08-27 00:24:19 by alan

BZDATETIME::2010-08-27 00:24:19
BZCOMMENTOR::Alan Meyer
BZCOMMENT::36

I've completed all of the globals on Bach.

A number of warnings and validation errors were generated
involving the following documents:

CDR0000062762
CDR0000062789
CDR0000062833
CDR0000062835
CDR0000062875
CDR0000062910
CDR0000062921
CDR0000062927
CDR0000062929
CDR0000062938
CDR0000641243

Two of them were locked and couldn't be processed. Some others
had validation errors that had nothing to do with the globals.
Some had warnings generated by my program because it tried to fix
references to Purpose sections that would go away, but the
references existed in some versions but not others, so those
aren't validation errors, just notes to users. Some may have
been caused by edits to the documents after I identified the ones
needing fixing - though I suspect there aren't any of those.

I've created two log files.

One is a compendium of the raw logs for the four globals that
were run, namely:

Fix references
Remove Purpose sections
Add English PurposeText
Add Spanish PurposeText

The other one is a whittled down version of the compendium. I
tried to only include statistics and warning messages.

It would be ideal if all of the warnings could be resolved before
publishing Friday night. There may not need to be any actual
changes. If Volker runs a test publishing job it might show that
everything is okay. But if not, the warning messages and, in
some cases, the documents may need to be checked. A review of
the checking I posted in previous comments in the Bugzilla record
may resolve all of them, or maybe not.

I'm not going to go over them tonight. It's too late for me to
do that properly, but I can help if needed tomorrow. Send me
email or call me at home at 410-833-5979.

The summary log is attached to this comment. I'll attach the
full compendium to the next one.

Comment entered 2010-08-27 00:24:19 by alan

Attachment SummaryPurposeWarnings.log has been added with description: Log file extract of warnings and stats from live run on Bach

Comment entered 2010-08-27 00:25:10 by alan

BZDATETIME::2010-08-27 00:25:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::37

This is the compendium referenced in the previous comment.

Comment entered 2010-08-27 00:25:10 by alan

Attachment SummaryPurpose.log has been added with description: Full log files from all four globals

Comment entered 2010-08-27 00:25:47 by alan

BZDATETIME::2010-08-27 00:25:47
BZCOMMENTOR::Alan Meyer
BZCOMMENT::38

Marking this resolved-fixed.

Comment entered 2010-08-27 12:33:38 by alan

BZDATETIME::2010-08-27 12:33:38
BZCOMMENTOR::Alan Meyer
BZCOMMENT::39

(In reply to comment #36)
>
> Two of them were locked and couldn't be processed. Some others
> had validation errors that had nothing to do with the globals.
> Some had warnings generated by my program because it tried to fix
> references to Purpose sections that would go away, but the
> references existed in some versions but not others, so those
> aren't validation errors, just notes to users. Some may have
> been caused by edits to the documents after I identified the ones
> needing fixing - though I suspect there aren't any of those.

We might want to fix the two documents that were locked by hand.
They are: 62762 and 62833. Those were not transformed and should
have been.

To fix them by hand, delete the Purpose section and add the PurposeText
from the English spreadsheet.

I could fix them programmatically, but I think it would be faster
to just do them by hand.

Comment entered 2010-08-27 14:00:05 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-08-27 14:00:05
BZCOMMENTOR::Volker Englisch
BZCOMMENT::40

(In reply to comment #39)
> We might want to fix the two documents that were locked by hand.
> They are: 62762 and 62833. Those were not transformed and should
> have been.

Is CIAT going to make the change or should I?

Comment entered 2010-08-27 14:01:29 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-08-27 14:01:29
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::41

What changes needs to be made? Is it everything that the globals would have done?

Comment entered 2010-08-27 14:03:08 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-08-27 14:03:08
BZCOMMENTOR::Volker Englisch
BZCOMMENT::42

It's the deletion of the Purpose section and adding the PurposeText from the English spreadsheet.

Comment entered 2010-08-27 14:04:06 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-08-27 14:04:06
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::43

(In reply to comment #40)
> (In reply to comment #39)
> > We might want to fix the two documents that were locked by hand.
> > They are: 62762 and 62833. Those were not transformed and should
> > have been.
>
> Is CIAT going to make the change or should I?

CIAT is fixing them and looking at all the warning messages from the global.

Comment entered 2010-08-27 14:10:03 by alan

BZDATETIME::2010-08-27 14:10:03
BZCOMMENTOR::Alan Meyer
BZCOMMENT::44

(In reply to comment #43)

> CIAT is fixing them and looking at all the warning messages from the global.

William,

Before driving yourselves nuts, be sure to search this Bugzilla issue
page for any CDR ID that generated a warning.

I've already driven myself nuts in the past trying to track all of
these down. Hopefully, that effort will make things easier for
anyone at CIAT who is doing the same thing.

Comment entered 2010-08-27 14:41:29 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-08-27 14:41:29
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::45

Thanks Alan, yes I saw the cdr ids you posted but in order to get the actual warnings, I had to look at the logs. I used the cdr ids you posted to search the logs so I did not drive myself nuts :-). But I couldn't figure out what the messages were for these two included in the list: CDR0000062835 CDR0000062929

Comment entered 2010-08-27 14:42:12 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-08-27 14:42:12
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::46

(In reply to comment #43)
> (In reply to comment #40)
> > (In reply to comment #39)
> > > We might want to fix the two documents that were locked by hand.
> > > They are: 62762 and 62833. Those were not transformed and should
> > > have been.
> >
> > Is CIAT going to make the change or should I?
>
> CIAT is fixing them and looking at all the warning messages from the global.

The two trials have been fixed.

Comment entered 2010-08-27 14:45:43 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-08-27 14:45:43
BZCOMMENTOR::Volker Englisch
BZCOMMENT::47

(In reply to comment #45)
> But I couldn't figure out what the
> messages were for these two included in the list: CDR0000062835 CDR0000062929

I can only come up with a 'copy/paste' error.
I don't think these were supposed to be listed as problems.

Comment entered 2010-08-27 14:49:50 by alan

BZDATETIME::2010-08-27 14:49:50
BZCOMMENTOR::Alan Meyer
BZCOMMENT::48

(In reply to comment #47)
> (In reply to comment #45)
> > But I couldn't figure out what the
> > messages were for these two included in the list: CDR0000062835 CDR0000062929
>
> I can only come up with a 'copy/paste' error.
> I don't think these were supposed to be listed as problems.

It was a test.

William and Volker passed.

Alan however, flunked.

Comment entered 2010-08-30 11:22:55 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-08-30 11:22:55
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::49

Everything live on Cancer.gov. Issue closed.

Attachments
File Name Posted User
DocsRequiringFragmentFix.xls 2010-08-13 00:46:07
Request4838.log 2010-08-12 15:58:41
Request4838.log 2010-07-22 20:04:00
Request4838.log 2010-07-08 21:21:48
Request4838FixRefs.log 2010-08-18 00:13:22
SummaryPurpose.log 2010-08-27 00:25:10
SummaryPurposeWarnings.log 2010-08-27 00:24:19

Elapsed: 0:00:00.002006