EBMS Tickets

Issue Number 94
Summary [Calendar] Downloading Event Documents All at Once
Created 2013-11-05 11:34:17
Issue Type Improvement
Submitted By Juthe, Robin (NIH/NCI) [E]
Assigned To alan
Status Closed
Resolved 2014-07-10 11:23:27
Resolution Fixed
Path /home/bkline/backups/jira/oceebms/issue.114583
Description

Some Board members have expressed an interest in downloading all of the event documents for a meeting in one fell swoop (e.g., in a zip file) rather than having to click on each item and download it individually. We may adjust the priority of this once we receive the results of the Board member review activity.

Comment entered 2014-06-26 16:05:55 by alan
It looks like there are three possible places where a document could be
identified as part of a meeting:

    Agenda
    Notes
    Event documents

Should all of those be examined to look for files to download, or just
the Agenda and Event documents?  Maybe it's safest to look in all three
places since different board managers might work differently.  Or maybe
it's safest to not include Notes on the theory that Board Managers might
want to put links to documents in there that aren't really intended for
the meeting, for example:

    "If you recall our discussion of the article {link: Prostate Cancer
    in China} from last meeting, ..."

If I examine all of them, I'll check for duplication in the filenames in
order to handle the case of one document appearing in more than one
category.
Comment entered 2014-06-26 16:15:27 by alan

For the user interface I suggest that we have a link on the page with a
name like "Download all meeting documents in one zip file", or whatever
better link name we can come up with. It might appear just above the
"ADD TO PERSONAL CALENDAR" link.

I propose that we consider whether to display the link based on two
criteria:

1. Is there an agenda and is it marked as "Published".

If there's no agenda yet, or if it's not yet marked Published, is
there a risk that board members could download docs in a zip file
and think they've got everything when, in fact, there's more to
come?

2. Are there any documents to download?

We don't want to put a link to download docs that, if clicked, can
only lead to a message like "No documents available yet."

Comment entered 2014-06-26 16:19:16 by alan
Robin,

While you're pondering answers to the questions in the above two
comments, I'll work on the following assumptions:

    Agenda and Event documents will be examined, not Notes, to determine
    what documents are to be downloaded.  I'll add Notes later if you
    say we need them.

    The agenda must exist and be marked 'Published' before putting up
    the link.

I don't think it will be hard for me to change things if you decide on
different answers to the questions.
Comment entered 2014-06-27 00:02:20 by alan

I've worked up a design for this based on a discussion with Bob - who knows much more about our calendar than I do. I then did a number of experiments to try out things that need to work.

One unfortunate outcome of the experiments is that our copy of PHP is not compiled with support for zipping files. I found an open source package that I tested and it does work, but we'll probably prefer to use the official PHP package if CBIIT agrees to recompile it for us and add the supporting zlib shared library.

This will introduce some delay into the implementation while we sort this out.

Comment entered 2014-06-27 17:02:41 by Juthe, Robin (NIH/NCI) [E]

Hi Alan,

Your assumptions are all correct. Clearly, you don't need us :-)

Thanks!

Comment entered 2014-07-06 17:26:05 by alan
In theory, it is possible that a board member might attempt to download
a file that can't be found.  I don't know if the problem ever will
happen, but if it does I am creating a file in place of the missing file
to alert the board member to the problem.

Assume for example that the name of the unreachable file is:

   "13_Hiraki J Genet Couns 2014 24599651_0.pdf"

I'll replace it in the zip file with a file named:

   "13_Hiraki J Genet Couns 2014 24599651_0.pdf_ERROR.txt"

That error file will contain the following contents:

---------------------------------------------------------------
We're sorry, an error has occurred while attempting to add the file named:

   13_Hiraki J Genet Couns 2014 24599651_0.pdf

to the zip archive file.

Please contact the PDQ editorial board Board Manager for help getting
a copy of this document.
---------------------------------------------------------------

If the open the file in the archive, that's what they'll see.

If you wish to have different wording, or if you prefer to just abort
the whole download (which seems less desirable to me), please let me
know.
Comment entered 2014-07-07 13:54:44 by Juthe, Robin (NIH/NCI) [E]

This seems like a good solution to me. I made a few edits to your suggested wording below.

------
We're sorry, but an error has occurred while attempting to add the file named:

13_Hiraki J Genet Couns 2014 24599651_0.pdf

to the zip archive file.

Please contact the PDQ Board Manager for help getting a copy of this document.

Comment entered 2014-07-08 19:29:06 by alan
I came across a problem while testing the document download for the
calendar record for the Genetics Board Meeting of April 28, 2014.

It turns out that every document in the meeting documents section is
duplicated.  There are two full text PDF copies of each one of them
stored on our server.  One copy is linked from the AGENDA section of the
calendar event page and the other is linked from the EVENT DOCUMENTS
section.  The ones in the AGENDA section were created first, then the
copies were created for the EVENT DOCUMENTS section 15 - 20 minutes
later.

It seems clear that the intent was to create two links to the same
document, one from the agenda and one from the meeting documents.  But
perhaps the only way that the Board Manager or admin assistant could
find to do that was to store the document twice.

I think that we want to change our procedures to only store each
document once and to have both types of links in the above case point to
the same physical document.  One way looks like it might be to create
the document list first, then put links in the agenda, but I'm not sure
what other ways exist.

The next question is what to do about it.  My software currently treats
every reference to a unique physical file as a unique link.  If we have
two separate physical copies of a PDF, the software will pack both of
them into the zip file and download both to a user's workstation.
That's obviously not desirable.

Some alternatives might be:

 1. Ignore the problem.

    If the number of events stored in this way is not large, and if we
    change our procedures to not do them this way again, we could just
    ignore the problem.  Users downloading packages of the old events
    will get duplicates but over time, and maybe quite soon, they'll
    only be looking at the new data.

 2. Try to de-duplicate.

    I already de-duplicate physical files.  If the same physical file is
    referenced in two places, I only put one copy of the file in the zip
    archive.  However these references aren't to the same physical file
    and are actually stored with separate names, e.g., 
    
        "Document XYZ.pdf"
        "Document XYZ_0.pdf"

    This is analogous to the way Windows handles duplicate names in
    downloads, namely:

        "Document XYZ.pdf"
        "Document XYZ(1).pdf"

    I could rely on the original filename (which is preserved for both
    the regular and "_0", "_1", "_2" etc. versions, perhaps in
    combination with the file size.  For most cases that should be
    highly accurate in finding duplicates, though mistakes are
    theoretically possible.  For example, imagine two agenda documents
    that are exactly the same except for these two strings:

        The meeting will begin at 2 pm.

        The meeting will begin at 3 pm.

    The only way to detect the difference would be to do a character by
    character compare of the two files.

 3. Try to fix the data going backward.

    I'm not sure how hard that would be.  Locating and deleting
    duplicates might not be too hard but It might be quite hard to track
    down all of the links to the files that were dropped and
    programmatically fix them to link to the surviving member of the
    duplicate pair.

Unless and until I hear otherwise I plan to do the following:

 a. Make everything work using alternative 1 - ignore the problem.

    My current code ignores the problem.  I can proceed to test
    everything else and make sure it works.

 b. When everything is working I will go back and implement 
    alternative 2 - try to de-duplicate.

Please let me know if you agree with this plan and, in the mean time,
let's implement a change in procedures in order to avoid creating more
duplicate records in this way.
Comment entered 2014-07-09 01:07:07 by alan
I worked on a de-duplication routine but decided not to finish it at
this time, working instead on getting everything else into version
control, installing it onto the Dev server (I did development on my own
workstation), and testing.

I think it's working on Dev and can be examined there.

It works like this:

    If a meeting has an agenda and the agenda is not marked as a draft,
    and there are some docs to download, I present a DOWNLOAD ALL
    DOCUMENTS link after the list of meeting documents and before the
    ADD TO PERSONAL CALENDAR link.

    If the user clicks the link, the software creates a zip archive with
    a name composed of the title of the meeting + the date plus a
    suffix.  Spaces are changed to underscores - which eases certain
    problems with spaces in file and directory names.  An example zip
    archive is:

        Colorectal_WG_2014-04-28_Docs.zip

    If the user saves the archive and extracts all of the files from it,
    the extraction will create a directory named:

        Colorectal_WG_2014-04-28_Docs

    All of the files should be there.
Comment entered 2014-07-09 16:15:29 by Juthe, Robin (NIH/NCI) [E]

We discussed the duplication of event documents today and we are fine with posting unique documents in a single place. We still plan to use both methods - posting within a meeting agenda and/or posting under event documents - but we will change our process to avoid posting the same documents in both places, recognizing that they would not be de-duped appropriately.

Could you please change the wording from "download all documents" to "download all documents to a .zip file"? We think that will be clearer to our Board members.

Comment entered 2014-07-10 11:21:49 by alan

The wording is now changed as requested.

Comment entered 2014-07-10 11:23:27 by alan

We decided at our meeting not to try to de-duplicate documents posted twice. Board managers plan to avoid this problem for the future.

I'm marking the issue as resolved fixed, ready for testing on dev.

Comment entered 2014-07-23 13:34:06 by Juthe, Robin (NIH/NCI) [E]

This looks good on DEV, but I don't think the zip file script has been loaded on QA yet. I just received the following error message on QA:

Fatal error: Class 'ZipArchive' not found in /local/content/web/appdev/sites/ebms.nci.nih.gov/modules/custom/ebms/calendar.inc on line 503

Comment entered 2014-07-23 14:12:27 by alan

Yes, the ZipArchive software has to be loaded by CBIIT onto QA, Stage, and Prod. I had hoped it would be done by last week but they have requested delays. The last I heard it was supposed to be done today.

I'll prod them again and try to get them to do Stage and schedule Prod for a good time.

The system will be down, I think for just a few minutes, while the install the software.

I'll let you know what I find out from them.

Comment entered 2014-07-23 15:05:16 by alan

I just spoke to Kiran Bojja, the CBIIT person who will do the install.

He plans to do it sometime between 6 and 9 pm tonight. If there are any of us who might be testing this evening they should be aware that the system may be inaccessible during part of that time.

Kiran is going to take a snapshot of the system when he starts so that, if something is royally screwed up, he'll restore the snapshot before giving up for the evening. However I don't expect that to happen. He's done this once already on Dev and made extensive notes on how to do it.

I'll post again when everything is okay.

Comment entered 2014-07-24 00:30:23 by alan

I believe that CBIIT has successfully installed the PHP extension required on QA for the zip function to work. Unfortunately, they made a different mistake while doing it (forgot to restore the drush utility), but I've informed them about it.

I'll check again tomorrow and try to get this fully resolved and get the ball rolling to make the installs on Stage and Prod.

The guy who did the work is working from home and was facing the horrible connectivity issues of the NIH terminal server tonight (seems to happen whenever there are thunderstorms). That may have hampered his work.

Comment entered 2014-07-24 10:07:31 by alan

My testing on QA for this is complete. The CBIIT guy had a typo in his documentation that caused the error, but he's fixed it. I've asked him to do Stage next.

Comment entered 2014-07-24 10:47:23 by Juthe, Robin (NIH/NCI) [E]

Verified on QA. Thanks!

Comment entered 2014-08-29 15:57:06 by Juthe, Robin (NIH/NCI) [E]

Verified on prod.

Elapsed: 0:00:00.000698