CDR Tickets

Issue Number 3056
Summary Change in logic for pulling documents from cancer.gov
Created 2009-12-31 14:30:22
Issue Type Improvement
Submitted By Kline, Bob (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2010-09-22 13:08:32
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107384
Description

BZISSUE::4732
BZDATETIME::2009-12-31 14:30:22
BZCREATOR::Bob Kline
BZASSIGNEE::Volker Englisch
BZQACONTACT::Alan Meyer

At some time in the past, the publication logic for determining when documents should be pulled from cancer.gov changed. We used to send 'drop' commands to CG for all the documents which weren't picked up by the publishing control document's query for a given document type. Now we do it for documents which are in the pub_proc_cg table but are blocked from publication. This means that, for example, that while in the past setting the status of a glossary term name would be sufficient for getting it out of the dictionary, it is now necessary to set the document's active_status to 'I' (inactive), or to send a hotfix-remove job to CG by hand. When you get back in town, please take the lead in discussing this with the users, to make sure they're aware of this change, and to determine whether the current logic represents the optimal approach. This will affect the new software to publish genetics professional documents, for which blocking the Person document may be an unacceptable solution. I'm CC'ing a number of the users so they'll be aware of the issue right away.

Comment entered 2010-01-19 17:54:45 by alan

BZDATETIME::2010-01-19 17:54:45
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

(In reply to comment #0)
> ... We used to send 'drop' commands to CG for all the
> documents which weren't picked up by the publishing control
> document's query for a given document type....

Reading the code and looking at the publishing control document,
I see at least two reasons why this change needs to be retained.

One is that if the NumDocsPerDocType publishing job parameter is
used in a publishing job, it could result in a disastrous removal
of perfectly good documents from cancer.gov. For example, if the
operator specifies publishing a maximum of 1,000 protocols, the
old code would have published up to that maximum (actually much
less, see the next comment below), and removed all of the rest
from cancer.gov. This was the motivation for the change made on
2007-05-22 in SVN version 7420.

Another is that a change to the publishing document made on
2007-08-14 relies on the new behavior. On that date we
introduced new queries in support of nightly publishing that
selected new, never published documents. Six new queries were
introduced that do this, plus one that is commented out. Without
the current behavior, all existing documents would be removed
from cancer.gov when the new docs were added.

For these reasons, I believe that I should assume we want to keep
the current logic, not return to the old logic, and instead
decide how to handle the documents that used to be deleted but
are not now.

One final issue with the old removal selection technique
relates to the handling of the MaxDocUpdatedDate publishing job
parameter. That parameter, introduced after the change in the
removal logic, allows a user to publish all docs as they were as
of a particular date. Under the old code, a document not
selected for publication (because it was created after the
MaxDocUpdatedDate) would be removed from cancer.gov. The old
software might actually have been correct in that case and better
than the new - though we have never used MaxDocUpdatedDate for
any purpose other than to freeze the view of the database by the
publishing program as of the start of its run. I don't know that
we'd ever want to reach into the past, though I'm not aware of
any logic preventing it.

Comment entered 2010-01-19 18:06:05 by alan

BZDATETIME::2010-01-19 18:06:05
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

(In reply to comment #1)

> ... if the
> operator specifies publishing a maximum of 1,000 protocols, the
> old code would have published up to that maximum (actually much
> less, see the next comment below) ...

The NumDocsPerDocType publishing job parameter is used to limit
the number of documents selected for publishing, not the number
actually published. The number actually published may be much
less since the selection is made first and then denormalization
is done to find out which of the selected docs have actually
changed.

For example, assume:

cancer.gov requests that we limit our protocol publishing to,
say, 2,000 documents.

We specify 2,000 in the NumDocsPerDocType publishing job
parameter.

The program selects 2,000 protocols from the database.

The 2,000 are passed through the publishing filters to get
denormalized export forms.

The 2,000 are compared to the documents with the same IDs
already on cancer.gov. Perhaps only 100 have changed.

It is possible that if we had filtered all protocols and
compared them to the cancer.gov versions, more than 100 but
less than 2,000 would have changed.

So although cancer.gov said it could accept up to 2,000
documents, and a full publishing of all would have been
successful, we wind up publishing a much smaller number.

This is not necessarily a bug. I don't remember why the
NumDocsPerDocType parameter was created. It might have been more
for debugging than anything else. But we should note that it's
not an ideal way to throttle the number of documents sent to
cancer.gov.

Comment entered 2010-01-19 18:14:33 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-01-19 18:14:33
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3

(In reply to comment #2)
> The NumDocsPerDocType publishing job parameter is used to limit
> the number of documents selected for publishing, not the number
> actually published. The number actually published may be much
> less

I look at this as the number of documents published/created in the file system which is identical to the number of vendor documents created.
Obviously, of those documents created only the once that changed will be pushed to Cancer.gov.

> I don't remember why the NumDocsPerDocType parameter was created.

I don't know why it was created but it is used very often for testing any filter changes or changes to the publishing software so that we don't have to wait for 8 hours to simulate a publishing job.

Comment entered 2010-01-19 23:54:08 by alan

BZDATETIME::2010-01-19 23:54:08
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

I have worked through the entire publishing system control
document and read the selection queries for each of our
publishing subsets.

Most of the queries are generic. They simply select documents
based on the document type. A few are more complicated, looking
at the query_term table in order to restrict documents to those
with specific values.

Here is what I found about such document selections, and what I
would think we need to do to use the query_term values in
removals, if we decide to do that:

1. GlossaryTermName.

Selection:

Only those GlossaryTermNames are selected that have a
/GlossaryTermName/TermNameStatus in:

Approved
Revision pending

Other values are not selected

Removal:

We should remove any that are on cancer.gov and now have
other TermNameStatus values.

2. Term.

Selection:

Only those Term documents are selected that have
/Term/TermStatus in:

Reviewed-Problematic
Reviewed-Retain
Unreviewed

Removal:

We should remove any that are on cancer.gov and now have
other TermStatus values.

3. InScopeProtocol.

There are two types of selection queries, depending on the
protocol status.

Selection:

Open protocols:

/InScopeProtocol/ProtocolAdminInfo/CurrentProtocolStatus
in:

Active
Approved-not yet active
Enrolling by invitation

Closed protocols:

CurrentProtocolStatus in:

Closed
Completed
Temporarily closed

Removal:

Only one removal selection is desirable. It should remove
any InScopeProtocols on cancer.gov that no longer have one
of the six status values enumerated above.

4. Person.

Selection:

All documents with /Person/Status/CurrentStatus = Active

OR with

CurrentStatus = Inactive

AND

there is a cdr:ref or cdr:href that links to the
Person.

Removal:

We should remove any that are on cancer.gov, are now
Inactive and are not linked to by any other documents.

5. Organization.

Selection:

These are exactly like Person except that the query_term
path is /Organization/Status/CurrentStatus.

Removal:

As for Persons.

6. GeneticsProfessional.

Selection:

Only those documents are selected with:

/Person/ProfessionalInformation/GeneticsProfessionalDetails
/AdministrativeInformation/Directory/Include

= 'Include'

Removal:

We should remove any that do not have a Directory/Include
element with a value = 'Include'.

I will write removal selection queries for each of the above and
run them, just the queries, not the removals, to find out how
many documents of each type are involved.

Comment entered 2010-01-20 00:32:52 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-01-20 00:32:52
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5

(In reply to comment #4)
Alan, let's coordinate any changes you're about to make to the publishing document because I have many changes currently in my sandbox and on FRANCK.

> 4. Person.

This is one of the changes: We're not publishing Persons anymore. Any person document published in the future will be a GenProf document.

> 5. Organization.

Lakshmi is planning to stop publishing Organizations with the next licensee update that's adding new elements to the protocol DTD.

Comment entered 2010-01-20 15:53:34 by alan

BZDATETIME::2010-01-20 15:53:34
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6

My inclination at this time is to make a modification to the
publishing program removal logic as follows:

Pull inactive documents from cancer.gov, as now.

Then check the publishing document to see if there is a
selection query for additional documents to pull. If so,
execute the selection query and pull any docs that it
identifies.

I don't think this would be a very hard change, and the number
of selection queries we'd need would be limited. It would
give us maximum flexibility in how to define the documents to
be pulled from cancer.gov.

Comment entered 2010-03-02 23:54:08 by alan

BZDATETIME::2010-03-02 23:54:08
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7

I have looked at the problem of re-instating the old method of
selecting documents for removal from cancer.gov. Is there a way
to do it safely?

I think the answer is, Yes. But I think we would have to make
some changes in our job parameters and carefully document the
limitations.

The following publishing job parameters are problematic if we
reinstate the old method of selecting removals:

NumDocsPerDocType
NumDocs

and maybe also:

MaxDocUpdatedDate

The NumDocs... issue is described in comment #2. What was said
there about NumDocsPerDocType also applies to NumDocs. Either
parameter might be used in a SELECT TOP ?NumDocs...? query to
limit the retrievals. The existing values we see for these in
our publishing control document are:

For NumDocPerDocType:

50,000
100,000
500,000

For NumDocs:

50 (I don't think this is actually used)
200 (Country)
10,000 (CTGovProtocol)
2,000 (DrugInfoSummary)
20,000 (GeneticsProfessional)
10,000 (GlossaryTerm)
1,000 (Media)
25,000 (Organization)
1,000 (PoliticalSubUnit)
50,000 (Protocol)
50,000 (Active Protocol)
50,000 (Closed Protocol)
1,000 (Summary)
20,000 (Term)

Some of those might conceivably be overrun one day. If so, using
the old technique, we could start removing random documents from
cancer.gov that shouldn't be removed. That would happen even
without a user altering the NumDocs... parameters in the
publishing user interface.

One way to overcome the problem is to modify the Python
publishing program to do the following:

Retrieve the total count of documents of the type of document
that is being published.

Retrieve the NumDocs and NumDocsPerDocType parameter values
for the current job from the pub_proc_parm table.

If the parameterized number is lower than the actual count,
then:

Don't process removals.

Maybe record a warning somewhere in hopes that, if this
happened accidentally, we're more likely to notice it.

To make this work, we would have to comb through the publishing
control document and clean up detritus. We have many publishing
subsets that specify parameters that aren't even used. There
could be a NumDocsPerDocType set at a number above the count of
docs of that doctype, and a NumDocs parameter that isn't used at
all with a lower value. In those cases, we should eliminate the
NumDocs parameter rather than raise its value. That seems more
sensible to me than trying to parse the selection query to find
out which parameter, if any, was actually used.

We should also, of course, review the values currently in the
publishing control document and raise at least some of them.

One possible problem with this is that if we ever truly need to
limit the number of published documents while still processing
removals, we'll need to re-design the code.

The MaxDocUpdatedDate parameter is entirely different. Right
now, the parameter is only used to set the maximum date of a
publishable version of a document, and of data denormalized into
it, to the date/time when publishing is started. That's a Good
Thing and I think it should work fine with either of the removal
techniques.

However if someone were to set it earlier than some previous
publication, I'm not sure what would happen. Removals might be
only one of the issues involved. It will take some serious work
to figure out what would happen, and I'm tempted to remove this
parameter from the user interface - though that might be a kludge
to do. Or perhaps we could compare the parameter to the
date/time of the last successful push job and use the maximum of
the two.

Assuming that the MaxDocUpdatedDate is not an issue, it's
beginning to look to me like this approach is better than the
earlier one I proposed.

Let's discuss it.

Comment entered 2010-03-03 12:14:55 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-03-03 12:14:55
BZCOMMENTOR::Volker Englisch
BZCOMMENT::8

(In reply to comment #7)
> The following publishing job parameters are problematic if we
> reinstate the old method of selecting removals:
>
> NumDocsPerDocType
> NumDocs

By the way, I never understood why we had to have two parameters for these. The NumDocs and NumDocsPerDocType parameters restrict the number of documents selected with a given query.
NumDocs is used when the subset contains a single SQL query.
NumDocsPerDocType is used when the subset contains multiple queries.
They both do the same thing but the second one is a little more descriptive.

> Some of those might conceivably be overrun one day.

We could possibly remove the NumDocs and NumDocsPerDocType parameters from the publishing document version on BACH completely.
We've never used these parameters on BACH, only on FRANCK and MAHLER in order to speed up testing when we wanted to run a test Export, for instance and selected a maximum of 1000 docs per doc type.

There are actually some document types that would not need any restriction in terms of number or docs processed, like Country, PolSubUnit, Media, GP. The restriction is generally set to throttle down the number of Organization and protocols to be processed during a test run.

> If the parameterized number is lower than the actual count,
> then:
>
> Don't process removals.

This would mean that a change/test in the removal of an Organization or protocol document would require to always run the full set for this document type.
However, I cannot foresee this to be a problem for our production publishing jobs.

> I'm tempted to remove this parameter from the user interface - though
> that might be a kludge to do.

Maybe we can make the form field read-only?
This field has never been used except for testing. I only see it as the insurance that all documents are being processed as of a fixed date/time.

Comment entered 2010-03-04 23:04:52 by alan

BZDATETIME::2010-03-04 23:04:52
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9

Bob, Volker and I discussed this at length. We have not yet
reached a definite conclusion about the best method of handling
this problem. Here is a list of all the options with my
understanding of their advantages and disadvantages.

1. Re-instate the old technique and be careful with it.

The old technique worked as long as no one set the NumDocs or
NumDocsPerDocType job parameters lower than the total number
of docs in the available pool. To the best of our knowledge,
this only happened once in the life of the CDR.

Pros:
+ Simple - just re-instate the old code.
+ Performs complete removals of all docs that should be
removed.

Cons:

  • If someone forgets the risks and lowers the document
    selection count on a production publication, good
    documents will be removed from cancer.gov.

  • If the NumDocs... parameters are overtaken by the
    actual number of documents in the database, the same
    thing will happen. This can be completely prevented by
    setting extremely high default values (e.g., 1,000,000)
    for all counts.

  • If someone writes a selection query that, for whatever
    reason, does not include every publishable document,
    the publishing program will incorrectly remove any
    documents that aren't included in the selection from
    cancer.gov. This can happen if the query is written
    without keeping this issue in mind, or if the wrong
    Pubtype is assigned to the query without realizing the
    implications of doing that.

2. Re-instate the old technique, but add a test for queries that
failed to select all available docs. If a user specified too
few documents for the removal to work correctly, log a
warning and do not remove documents.

This is explained in comment #7.

Pros:
+ Removes the correct documents every time, if it removes
any at all, at least for the Pubtypes that select all
docs of a given type.
+ Behavior is safe if, for any reason, the NumDocs...
parameters are changed.

Cons:

  • Requires users to notice the warnings and take action
    if something went wrong.

2.a. Same technique but execute the current generic removal
for blocked documents when a full removal is not allowed.

Even if we don't publish all available publishable documents
of a given document type, and therefore can't remove any
that weren't published we can still remove blocked
documents. In many cases, that's all that needs removal
anyway.

Pros:
+ Does everything method 2. does.
+ Does more in the rare case where it might be needed.

Cons:

  • Users must still be alert for warnings.

3. Write custom queries to select removals for each document
type.

This is explained in comment #4 and comment #6. If we do not
worry about Persons and Organizations, which will no longer
be published, then it looks like a total of four custom
queries will be required.

For all other document types, a generic selection to remove
blocked documents would be adequate and correct.

Pros:
+ We can remove the right set of documents under any
circumstances, regardless of the NumDocs... job
parameters.
+ The current logic that ties removals to the "Pubtype"
job parameter can be made less restrictive. An
automatic removal could be specified to be performed
with any publication type if desired. [Note: we can do
that now if we wish just for removing blocked
documents.]
+ Removal jobs could be created and run without having to
publish anything at all.
+ If a publishing selection query is edited to
incorrectly select fewer documents than it should, it
won't result in documents being removed that should not
be.

Cons:

  • If a removal selection query is edited incorrectly to
    pull too many or too few documents, too many or too few
    will be pulled. Currently, there is one query to get
    right or wrong per publication. With this change there
    will be two.

  • If there is a change in a selection query but the
    programmer forgets to create or change a corresponding
    removal query in a corresponding fashion, published
    documents could be withdrawn in the same job in which
    they are published.

4. Write a report query to identify documents that are not
blocked but should be removed. Users would hotfix remove
those documents.

The publishing program could remove blocked documents. Users
would have to hotfix remove any additional documents
identified in the report. Such a report query need not be
tied directly to publishing.

Pros:
I can't really think of any advantages in comparison with
the other approaches.

Cons:

  • Puts the most burden on users to remove wrong documents
    from cancer.gov. More labor is required. Careful
    attention to detail is required.

  • Introduces possible lag into the removal process,
    depending on how often users run the report.

  • The report must utilize versions of the same queries
    that would be written for method 3. above. The same
    issues exist regarding keeping those queries
    up-to-date.

Our next tasks are to determine:

Is the above analysis correct?

Is it complete? Are there other useful approaches not given
above?

Which is best?

My current inclinations are leaning towards method 2.a or 3. I
like 3. a little better. I think Bob likes 2.a a little better.
I don't think Volker has expressed a definite preference yet.

We need user input and further discussion, perhaps at a status
meeting when Lakshmi returns.

Comment entered 2010-03-10 00:50:37 by alan

BZDATETIME::2010-03-10 00:50:37
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10

Bob had an idea about another option, not given in the list in
comment #9.

His idea was to find the document ID of every single document
that can be selected for publication, and remove (or report to
CIAT for removal) any documents from cancer.gov that are not in
that group. Finding every single document that can be selected
for publication means finding the ones that are both publishable
and meet the criteria for status values that are needed for the
particular document type.

Execute every single query in the entire publishing document

Place each unique CDR document ID found by any of the queries
in a single set of doc IDs.

Compare the doc IDs in the set to the doc IDs in the
pub_proc_cg table.

Remove (or just report) any document from cancer.gov that is
found in pub_proc_cg but is not in the retrieved set.

To test this idea I wrote a program to implement it. It executes
every one of the queries, performing parameter substitutions as
needed to make them work.

I ran it on Bach.

It produces a log file showing what queries were processed, how
many were found, and what the doc IDs are that this technique
says should be removed. I have attached it.

The program log revealed the following statistics:

Total unique IDs currently on cancer.gov: 64,572
Total unique IDs that are selected for publishing: 64,697
Total that should be removed: 1

Time taken for the whole process: 3 minutes 40 seconds.

The ID to remove is 658843.

I haven't yet examined the data to find out why there are so many
documents (126) that would be published if we ran all selections
but have not been.

The list of these IDs is at the end of the log file, which is
attached.

Comment entered 2010-03-10 00:50:37 by alan

Attachment AllPub.log has been added with description: Log file from selection of every doc from all publishing queries

Comment entered 2010-03-11 11:33:54 by alan

BZDATETIME::2010-03-11 11:33:54
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

I revised my program to provide additional information about each
document that was identified as needing pulling or publishing.
For each doc it now lists:

DocID Version Status Datetime DocType: Doc Title

Version = The last version number, publishable or not.
Status = "A" for Active or "I" for Inactive/blocked.
Datetime = Date last version created

I've attached an excerpt of the log file that contains this
information. The Doc IDs are a little different from Tuesday, as
we would expect.

Comment entered 2010-03-11 11:33:54 by alan

Attachment AllPubDocs.txt has been added with description: Doc IDs with additional info for each doc

Comment entered 2010-03-19 00:36:53 by alan

BZDATETIME::2010-03-19 00:36:53
BZCOMMENTOR::Alan Meyer
BZCOMMENT::12

For the record, our decision regarding the changes to be made is
that:

The publishing program will be modified to run a subroutine
to identify any documents that should have been pulled from
cancer.gov but are still there. The routine will be run at
the end of the part of the publishing program that processes
removals of newly blocked documents from cancer.gov.

If and only if one or more documents is identified as still
needing removal, an email report will be sent to any members
of an email group defined to receive this report.

Comment entered 2010-06-17 11:37:39 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-17 11:37:39
BZCOMMENTOR::Volker Englisch
BZCOMMENT::13

Assigned to me.

Comment entered 2010-06-21 14:46:12 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-21 14:46:12
BZCOMMENTOR::Volker Englisch
BZCOMMENT::14

William, what information would you like to have reported in the email for these documents that may need to be manually removed from Cancer.gov?

I'm thinking the CDR-ID, Protocol Title, and Status should be enough.
Is there anything else that should be included?

Comment entered 2010-06-21 15:16:39 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-06-21 15:16:39
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::15

(In reply to comment #14)
> William, what information would you like to have reported in the email for
> these documents that may need to be manually removed from Cancer.gov?
>
> I'm thinking the CDR-ID, Protocol Title, and Status should be enough.
> Is there anything else that should be included?

(In reply to comment #14)
> William, what information would you like to have reported in the email for
> these documents that may need to be manually removed from Cancer.gov?
>
> I'm thinking the CDR-ID, Protocol Title, and Status should be enough.
> Is there anything else that should be included?

Yes. That should be enough.
This will not affect only protocols, right? It will be good to Label the various document types so that it will be easy to know who needs to take care of it.

Comment entered 2010-06-23 15:18:01 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-23 15:18:01
BZCOMMENTOR::Volker Englisch
BZCOMMENT::16

William reviewed the email that was created as a test.

Are we going to run this report as part of the nightly or the weekly publishing?
Please note that a document will be listed as long as it still needs to be removed. I am not checking if the document had already been reported or not.

Comment entered 2010-06-23 17:27:35 by alan

BZDATETIME::2010-06-23 17:27:35
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17

(In reply to comment #16)
> William reviewed the email that was created as a test.
>
> Are we going to run this report as part of the nightly or the weekly
> publishing?

Weekly sounds adequate to me. We've gone a couple of years without running it and, apparently, only one document has needed to be removed.

> Please note that a document will be listed as long as it still needs to be
> removed. I am not checking if the document had already been reported or not.

That sounds like a good idea. If it was reported but not removed, it should be reported again.

Comment entered 2010-06-25 12:10:07 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-25 12:10:07
BZCOMMENTOR::Volker Englisch
BZCOMMENT::18

I created a group named 'Hotfix Remove Notification' for the people that receive this email.
Who should be listed in this group?

Comment entered 2010-06-25 12:19:54 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-06-25 12:19:54
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::19

(In reply to comment #18)
> I created a group named 'Hotfix Remove Notification' for the people that
> receive this email.
> Who should be listed in this group?

Since you've created the group, can I add those who need to receive the emails. Is the group in Mahler?

Comment entered 2010-06-25 12:23:44 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-25 12:23:44
BZCOMMENTOR::Volker Englisch
BZCOMMENT::20

I've updated the following files:
Jobmaster.py - R9724
CheckHotfixRemove.py - R9724

These files are currently on MAHLER but given the fact that I will be on vacation over the weekend I am not going to put these in production (typos do happen) but will do so once I'm back.
I will run the new script manually on BACH to provide CIAT with the list of documents that should be removed from Cancer.gov and maybe William can look it over before tonight to have me run the Hotfix-Remove job before I leave.

Other than that, this is ready on MAHLER and ready to be copied to production.

Comment entered 2010-06-25 12:36:42 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-25 12:36:42
BZCOMMENTOR::Volker Englisch
BZCOMMENT::21

(In reply to comment #19)
> Since you've created the group, can I add those who need to receive the emails.
> Is the group in Mahler?

Yes, the group is on BACH and on MAHLER and I believe you have the rights to add members to the group.

Comment entered 2010-06-25 15:09:30 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-25 15:09:30
BZCOMMENTOR::Volker Englisch
BZCOMMENT::22

(In reply to comment #20)
> I will run the new script manually on BACH to provide CIAT with the list of
> documents that should be removed from Cancer.gov

I ran the program on BACH. The email went to myself and William with a list of these candidates for removal from Cancer.gov:

List of Hotfix-Remove Candidates
--------------------------------
Term - 674133
InScopeProtocol - 649658
InScopeProtocol - 658843
Organization - 28045
Organization - 30805
Organization - 30851
Organization - 32861
Organization - 33307
Organization - 33403
Organization - 33458
Organization - 33620
Organization - 35279
Organization - 35412
Organization - 35814
Organization - 36972
Organization - 37036
Organization - 37192
Organization - 37416
Organization - 37656
Organization - 352175
CheckHotfixRemove - Finished

Comment entered 2010-07-13 15:55:54 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-07-13 15:55:54
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::23

I went through the list and found that only 6 records may need to be blocked so that they are removed from cancer.gov. We are investing these 6 documents further. I will post a comment after they have been blocked from publication. They documents are

InScope Protocol

CDR 649658
CDR 658843

Organization

CDR 35279
CDR 36972
CDR 37036
CDR 352175

Comment entered 2010-07-16 14:44:06 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-07-16 14:44:06
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::24

(In reply to comment #23)
> InScope Protocol
>
> CDR 649658
> CDR 658843
>
> Organization
>
>
> CDR 35279
> CDR 36972
> CDR 37036
> CDR 352175

Kim has blocked the above records so they should be off cancer.gov by now.

Comment entered 2010-07-16 14:58:40 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-07-16 14:58:40
BZCOMMENTOR::Volker Englisch
BZCOMMENT::25

Yes, these documents where removed from Cancer.gov yesterday and the day before yesterday.

Comment entered 2010-07-19 11:56:55 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-07-19 11:56:55
BZCOMMENTOR::Volker Englisch
BZCOMMENT::26

(In reply to comment #20)
> Jobmaster.py - R9724
>
> These files are currently on MAHLER but given the fact that I will be on
> vacation over the weekend I am not going to put these in production (typos do
> happen) but will do so once I'm back.

I have copied the file to production and will monitor publishing until this weekend.

Comment entered 2010-07-28 10:50:15 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-07-28 10:50:15
BZCOMMENTOR::Volker Englisch
BZCOMMENT::27

The weekend publishing job finished and produced a report listing multiple document candidates to be pulled from Cancer.gov.

If William agrees that the results of the report were correct we could close this issue.

Comment entered 2010-07-28 12:24:25 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-07-28 12:24:25
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::28

(In reply to comment #27)
> The weekend publishing job finished and produced a report listing multiple
> document candidates to be pulled from Cancer.gov.
>
> If William agrees that the results of the report were correct we could close
> this issue.

Yes. I agree. We can close this issue. Since I am not the QA for this issue, I will wait for Alan to close it.

Comment entered 2010-07-28 12:25:40 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-07-28 12:25:40
BZCOMMENTOR::Volker Englisch
BZCOMMENT::29

Closing issue.

Attachments
File Name Posted User
AllPub.log 2010-03-10 00:50:37
AllPubDocs.txt 2010-03-11 11:33:54

Elapsed: 0:00:00.000993