Issue Number | 4479 |
---|---|
Summary | Add Date Last Accessed column to Content Partner report |
Created | 2018-05-24 13:21:39 |
Issue Type | Improvement |
Submitted By | Beckwith, Margaret (NIH/NCI) [E] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2018-06-07 11:45:12 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.226662 |
I think it would be helpful to add a column to the list of Active and Test PDQ Content Partners that showed the last time they accessed the data. I realize that it doesn't tell us what they do with it, or which data they actually pick up, but I still think it could give us clues about who is actively using the data. We can discuss specific requirements further at the meeting.
~volker: it looks to me as if cdr/Publishing/LicenseeList.py is obsolete (superseded by cdr/Scheduler/tasks/licensee_list.py). Unless you think otherwise, I plan to drop it from the repository.
~volker: Am I right in thinking that this enhancement blows a hole in our theory that it was safe to set up access to the partner SFTP logs just from DEV (based on the assumption that we would be looking at those logs ourselves from the development server)?
Following up on my previous question, ~volker, it seems that what we will need to do
for this request is set up a service on the DEV server, which is the
only server which has direct access to the cumulative historical SFTP
logs, and have this service perform an rsync
to fetch the
latest logs from the SFTP server, and parse all of the log files to
extract the latest date associated with each login name. This report,
running on PROD, will invoke that DEV service, and use the information
it gets back to match login names with data partners and populate the
new "Last Access" date column. Does that sound right?
~mbeckwit: There will be holes in the information (beyond the flaw that the access event dates don't tell us much about what the data partners actually retrieved or what they did with it), introduced by two conditions:
CBIIT inadvertently wiped out irretrievably several months worth of information about data partner activity; and
the login name associated with a specific data partner can change, and there is no record of previous login names used by the data partners (and there is even no mechanism that I'm aware of to prevent the re-use of a login name by a different data partner than was previously associated with that login name).
I know that I had written the original report on the FTP-DEV server as a cronjob but I don't know where the version running now as a scheduled job is running from. Does it also run on DEV or is it running on PROD?
The PDQ Partner List report (the one for this ticket) runs on all four tiers, I believe. The PDQ Partner SFTP Requests Report (the one which retrieves and parses the SFTP logs) only runs on DEV.
We discussed this yesterday but I'll add the main points here regarding your item (2):
Yes, login names for partners could change. There are a few partners for whom the transfer to the new environment did not work properly and a new login had been created. If the users login name was dada then the new login would be dada2.
No, login names are not recycled. Logins names don't get deleted, only disabled. Therefore it's not possible to reuse a login name on this *NIX system.
There is one partner whose name did change. It is Gustav Quade, Univ. Bonn. He retired from the U. of Bonn and keeps using our data. His login name changed from univbonn to ubonngq.
Making good progress. CBIIT has changed the log format used by the sftp daemon so the year isn't discarded. I have parsed all the existing logs and converted all the unique entries into a single consolidated log. This log is used by the new service running on DEV at https://cdr-dev.cancer.gov/cgi-bin/cdr/last-pdq-data-partner-accesses.py. Next steps:
implement/install scheduled job to keep the new cumulative log current
modify the content partner report to use the new service for the added column
modify the existing PDQ access log report to handle the change in the log format
licensees-20180607T105620.html shows the report
as run on DEV. Running on DEV uses Licensee
documents from
that tier but the data in the last column is pulled from the logs for
activity on the production SFTP server. I used the access dates, based
on the title of the issue, rather than access date/time, which seems to
be implied by the issue description ("... the last time they accessed
the data"). Let me know if you want the time included.
Completed and installed on DEV.
https://github.com/NCIOCPL/cdr-scheduler/commit/8fa6506
https://github.com/NCIOCPL/cdr-admin/commit/315e24b9
Note to self: after the current month log set no longer contains any
entries without the year date, I will want to simplify the parsing in
Scheduler/tasks/sftp_logs.py
to remove the guesswork about
which year a line belongs to.
Is there anyway for me to test this on QA? I think you have to run it for me.
I've submitted the report to run on QA and forwarded the result.
QA verified.
File Name | Posted | User |
---|---|---|
licensees-20180607T105620.html | 2018-06-07 11:02:15 | Kline, Bob (NIH/NCI) [C] |
Elapsed: 0:00:00.002693