PDQ Issues

Issue Number	5327
Summary	Investigate missing revised summary from GovDelivery email report
Created	2024-06-10 13:40:00
Issue Type	Task
Submitted By	Osei-Poku, William (NIH/NCI) [C]
Assigned To	Kline, Bob (NIH/NCI) [C]
Status	Closed
Resolved	2024-06-20 12:13:04
Resolution	Fixed
Path	/home/bkline/backups/jira/ocecdr/issue.444777

Description

Please investigate why 256680 was missing from the GovDelivery New/Changed Spanish Summaries Report for (2024-05-24 to 2024-05-31).

Comment entered 2024-06-20 12:13:04 by Kline, Bob (NIH/NCI) [C]

As far as my investigation is able to determine, the report was run twice, once by the scheduler at 2:30 in the morning on the 2nd of June (finding no documents to report) and once by hand at 2:27 in the afternoon of June 6th. In February of last year new logic was added to handle edge cases in which a publishable version of a document is created after the most recent publishing job was started. In that case the is_published() method is invoked to find out if an earlier publishable version was created within the date range used to generate the current report. That new method tries to parse the start and end values for the report's date range, and makes the assumption that those values are strings in the format "YYYY-MM-DD HH:MM:SS." That assumption is valid when the default values for that range are used, or if the values are overridden with a full ISO date and time string matching that pattern. However, the manual run of the report on the afternoon of the 6th provided only the dates for the range, without the "HH:MM:SS" portion for the times. This raised an exception during the processing of CDR256680, and as a result that summary document was omitted from the report. The report could be modified to supply default times when the start and end for the report's time range are overridden, but it would be preferable if the manual invocation of the report specified the full date/time strings, matching the push publishing jobs for which the report should be run, and emulating what the job would have done when it was originally scheduled to run.

Comment entered 2024-06-20 14:41:24 by Kline, Bob (NIH/NCI) [C]

I see that there are actually three runs of this report, and the third run actually did pick up the summary. So the question is, why did this third run of the report (on June 6) pick up CDR256680 when the scheduled run on the morning of June 2 failed to include it? The report looks at the value of the DateLastModified element as captured in the query_term_pub table to see if that date falls in the report's date range. The version of the document which was published (version number 207) had "2024-02-15" as the value of the DateLastModified element. Therefore the report omitted the document, as was appropriate. However, the following week someone went in and created a new publishable version of the document (#208), changing the value of the DateLastModified element to "2024-05-31." If you recall that the query_term_pub can only store the values extracted from the most recently saved publishable version, you will understand why the report generated on the 2nd and the report generated on the 6th both behaved as they were supposed to, even though they didn't match each other.

Note that it would be possible (though considerably more expensive in terms of processing time and resources) to modify the report so that it completely parsed all the summaries to find the actual value of the DateLastModified element of the version published by the relevant push job, rather than relying on the query_term_pub table. That would come closer to a strict implementation of the intended logic, but I'm not convinced it would be worth it. In this particular case, such a modification of the code would have prevented CDR256680 from ever being picked up for this report of the week in question, no matter how many times it was run manually.

Comment entered 2024-06-25 08:18:03 by Osei-Poku, William (NIH/NCI) [C]

Thank you, Bob!

Elapsed: 0:00:00.001438

CDR Tickets