CDR Tickets

Issue Number 2979
Summary New Summary Document Not Published to Cancer.gov
Created 2009-10-05 15:12:47
Issue Type Improvement
Submitted By Englisch, Volker (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2010-05-17 11:26:22
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107307
Description

BZISSUE::4655
BZDATETIME::2009-10-05 15:12:47
BZCREATOR::Volker Englisch
BZASSIGNEE::Volker Englisch
BZQACONTACT::William Osei-Poku

On Friday, Margaret and Anne identified a Summary document (CDR614165) on BACH that had been published to the vendors but did not appear on Cancer.gov.

After some investigation we identified that this summary - a new document - had been picked up for publishing on Aug 21st and got pushed to Gatekeeper.
However, the load of the document failed due to a 'pretty-URL unique value' failure. This failure was reported back to us.

At this point, the summary probably should have been corrected and manually republished.
Once the document was corrected the users expected the publishing job to pick up the summary for publishing because it had not been published yet. However, due to the earlier publishing event this document already existed in the table pub_proc_cg, but our queries of the publishing document identify a document as new (to be picked up as part of the nightly updates) if it doesn't exist in the table pub_proc_cg. Therefore, this document was not published as part of a nightly job but only as part of the weekly publishing job.

Obviously, this situation does not occur frequently but we could adjust the SQL queries for picking up new documents to adjust for this situation.

I've modified the original query (for summaries) from the following:
SELECT TOP ?NumDocsPerDocType? d.id, max(v.num)
FROM doc_version v
JOIN document d
ON v.id = d.id
JOIN doc_type t
ON d.doc_type = t.id
WHERE t.name = 'Summary'
AND d.active_status = 'A'
AND v.publishable = 'Y'
AND v.val_status = 'V'
AND v.updated_dt <= '?MaxDocUpdatedDate?'
AND v.id NOT IN (SELECT id FROM pub_proc_cg)
AND d.id NOT IN (
SELECT doc_id
FROM query_term_pub
WHERE path = '/Summary/WillReplace/@cdr:ref'
)
GROUP BY d.id
ORDER BY d.id

to this query, which would also pick up documents that do exist in the pub_proc_cg table but for which no successful push job to Cancer.gov exists.
SELECT d.id, max(v.num)
FROM doc_version v
JOIN document d
ON v.id = d.id
JOIN doc_type t
ON d.doc_type = t.id
WHERE t.name = 'Summary'
AND d.active_status = 'A'
AND v.publishable = 'Y'
AND v.val_status = 'V'

    • AND v.updated_dt <= '?MaxDocUpdatedDate?'
      AND d.id NOT IN (
      SELECT doc_id
      FROM query_term_pub
      WHERE path = '/Summary/WillReplace/@cdr:ref'
      )
      AND (v.id NOT IN (SELECT id FROM pub_proc_cg)
      OR
      v.id IN (SELECT DISTINCT ppc.id
      FROM pub_proc_cg ppc
      JOIN pub_proc_doc ppd
      ON ppc.id = ppd.doc_id
      JOIN pub_proc pp
      ON ppd.pub_proc = pp.id
      AND pp.pub_subset LIKE 'Push_%'
      WHERE ppd.failure = 'Y'
      AND NOT EXISTS (SELECT 'x'
      FROM pub_proc_doc i
      JOIN pub_proc ipp
      ON ipp.id = i.pub_proc
      AND ipp.pub_subset LIKE 'Push_%'
      WHERE i.doc_id = ppd.doc_id
      AND i.failure IS NULL
      )
      )
      )
      GROUP BY d.id
      ORDER BY d.id

The query adds about 2-5 minutes for each doc_type as it's written but it might be possible to speed things up a little more should we decide to modify the publishing queries.

I've tested this with the summaries only for now but more testing would be needed.

Comment entered 2009-10-12 10:53:50 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2009-10-12 10:53:50
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::1

I have added the email communication between Volker and me regarding the reported problem below. As we discussed in the CDR meeting last Thursday, we (CIAT) knew what the problem was and tried to fix it but failed to fix it correctly. I have taken note of this so that if another problem comes up similarly to this, we will adequately address it. Thank you.

---Original Message---
From: Englisch, Volker (NIH/NCI) [C] volker@mail.nih.gov
Sent: Monday, August 24, 2009 10:19 AM
To: Osei-Poku, William
Subject: RE: Problems with loading of job 6426 to Cancer.gov

The document CDR614165 could not get loaded because another document with the identical SummaryURL already exists: CDR62781.

Here is the error message from Gatekeeper:
Database Error: Checking prettyURL duplication failed.
Document CDRID=614165.
Database Error: The prettyURL
/cancertopics/pdq/treatment/child-cerebellar-astrocytoma/healthprofessional
is not unique. It is used by another Summary/DrugInfoSummary
document. Reference document CDRID=62781.


Volker Englisch
Contractor - Lockheed Martin
phone: (301) 496-0102 (CTB)
mailto:volker@mail.nih.gov

---Original Message---
From: Osei-Poku, William william.osei-poku@lmco.com
Sent: Monday, August 24, 2009 7:25 AM
To: Englisch, Volker (NIH/NCI) [C]
Subject: RE: Problems with loading of job 6426 to Cancer.gov

Hi Volker,
Do you know what the error means?
I filtered the document and there were no errors.

Validation results for CDR614165
Document is valid: no errors, no warnings!

Thanks,
William

---Original Message---
From: cdr@bach.nci.nih.gov cdr@bach.nci.nih.gov
Sent: Friday, August 21, 2009 11:58 PM
To: bkline@rksystems.com; vrmeyer@comcast.net; volker@mail.nih.gov; Osei-Poku, William
Subject: Problems with loading of job 6426 to Cancer.gov

1 failures and 0 warnings were encountered in the loading of documents
for job 6426 to Cancer.gov.

Please visit the following link for further details:

http://bach.nci.nih.gov/cgi-bin/cdr/GateKeeperStatus.py?jobId=6426&targetHost=gatekeeper.cancer.gov&flavor=all

Comment entered 2010-03-24 13:09:01 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-03-24 13:09:01
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2

I'm changing this to a P6.
This situation comes up very rarely (maybe once or twice a year) and is fixed automatically by the weekly publishing job.

Comment entered 2010-04-29 11:20:28 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-04-29 11:20:28
BZCOMMENTOR::Volker Englisch
BZCOMMENT::3

The original query runs in 3 seconds.
The new query runs in 230 seconds.

I found out that the new query will run in 10 sec if the 'OR' statement is split into two queries joined by a UNION.
This is the version I am going to include into our publishing document.

Comment entered 2010-05-10 17:33:18 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-05-10 17:33:18
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4

I've updated the Summaries SQL query for the Interim-Export publishing job.
The document CDR178 has been replaced on production (current CDR version 52).

I'll monitor during this week before closing the issue.

Comment entered 2010-05-17 11:26:22 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-05-17 11:26:22
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5

Publishing is now running for a week without any problems.

Closing issue.

Elapsed: 0:00:00.000756