Issue Number | 5327 |
---|---|
Summary | Investigate missing revised summary from GovDelivery email report |
Created | 2024-06-10 13:40:00 |
Issue Type | Task |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2024-06-20 12:13:04 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.444777 |
Please investigate why 256680 was missing from the GovDelivery New/Changed Spanish Summaries Report for (2024-05-24 to 2024-05-31).
As far as my investigation is able to determine, the report was run
twice, once by the scheduler at 2:30 in the morning on the 2nd of June
(finding no documents to report) and once by hand at 2:27 in the
afternoon of June 6th. In February of last year new logic was added to
handle edge cases in which a publishable version of a document is
created after the most recent publishing job was started. In that case
the is_published()
method is invoked to find out if an
earlier publishable version was created within the date range used to
generate the current report. That new method tries to parse the start
and end values for the report's date range, and makes the assumption
that those values are strings in the format "YYYY-MM-DD HH:MM:SS." That
assumption is valid when the default values for that range are used, or
if the values are overridden with a full ISO date and time string
matching that pattern. However, the manual run of the report on the
afternoon of the 6th provided only the dates for the range, without the
"HH:MM:SS" portion for the times. This raised an exception during the
processing of CDR256680, and as a result that summary document was
omitted from the report. The report could be modified to supply default
times when the start and end for the report's time range are overridden,
but it would be preferable if the manual invocation of the report
specified the full date/time strings, matching the push publishing jobs
for which the report should be run, and emulating what the job would
have done when it was originally scheduled to run.
I see that there are actually three runs of this report, and the
third run actually did pick up the summary. So the question is, why did
this third run of the report (on June 6) pick up CDR256680 when the
scheduled run on the morning of June 2 failed to include it? The report
looks at the value of the DateLastModified
element as
captured in the query_term_pub
table to see if that date
falls in the report's date range. The version of the document which was
published (version number 207) had "2024-02-15" as the value of the
DateLastModified
element. Therefore the report omitted the
document, as was appropriate. However, the following week someone went
in and created a new publishable version of the document (#208),
changing the value of the DateLastModified
element to
"2024-05-31." If you recall that the query_term_pub
can
only store the values extracted from the most recently saved publishable
version, you will understand why the report generated on the 2nd and the
report generated on the 6th both behaved as they were supposed to, even
though they didn't match each other.
Note that it would be possible (though considerably more expensive in
terms of processing time and resources) to modify the report so that it
completely parsed all the summaries to find the actual value of the
DateLastModified
element of the version published by the
relevant push job, rather than relying on the
query_term_pub
table. That would come closer to a strict
implementation of the intended logic, but I'm not convinced it would be
worth it. In this particular case, such a modification of the code would
have prevented CDR256680 from ever being picked up for this
report of the week in question, no matter how many times it was run
manually.
Thank you, Bob!
Elapsed: 0:00:00.001382