EBMS Searching

                           Alan Meyer
                             Draft 1
                        November 13, 2012


Here is how the search functionality is implemented in the EBMS
code as of this date:

EDITORIAL BOARD:

     Selecting one limits the search by editorial board and
     populates the Summary Topic pick list.

SUMMARY TOPIC:

     One or more topics can be selected if and only if a board
     has been selected.  If a board is selected but no topic is
     selected, all topics for that board qualify.

     If one or more topics are selected, the board(s) are
     ignored.  That way, if a topic was once assigned to board A
     but is now part of board B, a search on that topic will find
     it whether it belongs to board A or B.  The search will also
     be faster.

     By default, if more than one topic is selected, an article
     in any of them qualifies.  This is the meaning of the OR box
     (actually a radio button.)  That can be changed to AND by
     checking the other box.  That is the ONLY place on the
     search form where OR can be used.  All other search criteria
     are AND'ed together, meaning that an article must match ALL
     of the criteria in order to be retrieved.

PMID:

     The Pubmed article ID.

CMS ID:

     The EBMS article ID.  This is NOT the article ID in the old
     CMS.  We might be better off with the label "EBMS ID", which
     is what is really being searched.

AUTHOR:

     One or more authors may be entered with last name and
     optional initials, separated by semicolons and optional
     spaces.  Examples:

          "Smith"
          "Smith AH"
          "Smith AH; Jones B"
          etc.

     Full first names must not be used.  For example, use "Jones
     B" rather than "Jones Barrett" or "Barrett Jones".  This is
     because all Pubmed records have initials but only some of
     them have full first names.  Initials should always work.

     The order of entry of the authors is insignificant.  For
     example:

          "Smith; Jones"
        and
          "Jones; Smith"

     should retrieve the same articles.

     SQL % "wildcards" are supported.  For example:

          "Johns%n"

        retrieves:

          "Johnson"
          "Johnston"
          "Johnsen"
          "Johnsarden"

     etc.  But be prepared for longer search times.

TITLE:

     Searches article titles.

     Wildcards are supported.  See the notes on string searching
     below.

ADVANCED SEARCH:

  FYI CITATION:

     Selects articles with an active FYI state.

  NCI REVIEWER DECISION:

     Selects articles that have passed board manager review.
     They may have just passed it or long passed it and gone on
     to other states - including states that rejected the
     article.

  FULL TEXT RETRIEVED:

     Selects articles for which the EBMS has the full text (i.e.
     a PDF version) of the full article.

     Note: No legacy article from the old CMS will be found
     unless a PDF version of the article is stored in the EBMS -
     which will not happen automatically during conversion.  So
     this works differently from the old CiteMS.

  COMMITTEE DECISION:

     Selects articles that have passed full text review, i.e., a
     review of the full text of the article.  No actual committee
     is involved but, for historical reasons, the users have
     opted to call this "committee decision".  The article may be
     in any state after that one.

  PUBLISHED TO CITEMS:

     Selects articles that have the "Published" state in their
     status history.  The article may be in any state after that
     one.

  CORE JOURNALS:

     Currently unimplemented and ignored.  The intent is to
     restrict the search to just those articles that were
     published by a journal that is in a set of highly respected
     journals.

  INCLUDE NOT LISTS:

     By default, we exclude any journal article that was rejected
     because it was published in a journal that we have declared
     to not be used.

     Checking the box causes us to override the default and
     include these journals.

     No extra articles from the legacy database will be found by
     checking this box because the old CMS rejected those journal
     articles before loading them, so they are not in the
     database at all.

  JOURNAL:

     Searches journal titles, retrieving articles from journals
     with titles that match the search string.

     Wildcards are supported.  See notes on string searching.

  PUBLICATION YEAR/MONTH:

     Searches for articles in journal issues with the specified
     year and, optionally, month.

     We depend on Pubmed for the year and month, which in turn
     depends on the publishers.  If the journal title page says
     that it was published in May, then it can only be found
     under May, even if it actually came out in March or July.

     Some journals cannot be searched by month because no month
     is given in the Pubmed record.  They may, for example, use
     Spring, Summer, Fall and Winter instead, or something else.
     Those journals can still be searched by year.

     If a year is entered without a month, articles published at
     any time during that year will be selected.

     If a month is entered without a year, it is silently (i.e.,
     with no error message) ignored.  We do not search for months
     without years.

  REVIEW CYCLE:

     Selects articles imported during the specified review cycle.
     The software for this search uses the new data structures
     created in the new import software.  It cannot find review
     cycles for the legacy data.  Unless we change that, a user
     should use the input date or publication year month searches
     for legacy data.

  EDITORIAL BOARD REVIEWER:

     Selects articles that have ever been in a packet for the
     specified editorial board member - unless the member was
     dropped from that packet - unless, if dropped, he has
     already submitted a response.

     This search uses the new packet oriented data structures in
     the EBMS and does not work with legacy data.

  EDITORIAL BOARD RESPONSE:

     Selects articles for which at least one reviewer has
     submitted the selected response.  This search uses the new
     packet oriented data structures in the EBMS and does not
     work with legacy data.

  EDITORIAL BOARD DECISION:

     Selects articles for which the selected board decision has
     been made after board member review.  This search uses the
     new packet oriented data structures in the EBMS and does not
     work with legacy data.

  COMMENTS:

     Selects articles with the specified string as or in the
     article state comments (not the tag comments).

     Wildcards are supported.  Given the size of comments, it is
     probably important to use them.  See the notes on string
     searching at the end of this discussion.

  DATE COMMENT ADDED / RANGE:

     Selects articles that have a comment with the specified date
     or date range.

     If both a comment string and a date range were specified, an
     article with a comment matching the string will only be
     selected if the comment were created within the range.

     If no comment string were specified, then the dates will
     find articles with any comment in the specified date range.

     The rules for dates were described in a Bugzilla posting, a
     relevant excerpt from which is reproduced here:

     Date values
     -----------
 
     A user must enter dates as year + optional month + optional
     day.
 
     If no value for year is entered, month and day will be
     ignored.  If no value for month is entered, day will be
     ignored.
 
     If there is a year but no month or day, or a year + month but
     no day, an implicit range search will be performed.  For
     example:
 
         Look for comments entered any time in 2012:
         [2012] [   ] [  ]
 
         Look for comments entered during November of 2012:
         [2012] [Nov] [  ]
 
         Look for comments entered November 8, 2012:
         [2012] [Nov] [ 8]
 
     Explicit date ranges
     --------------------
 
     If the RANGE checkbox has been checked, the program will look
     at values in the three date fields for the end of the range.
 
     If there are no values in the fields, no range search will be
     performed.
 
     If there are values, the range search will look at values up
     to, but not including the specified dates.  For example:
 
         Find comments entered any time from Jan 1, 2012 through
         Dec 31, 2013.
         [2012] [   ] [  ]
         [2014] [   ] [  ]
 
         Find comments entered any time from Feb 1, 2012 through
         Jun 30, 2012
         [2012] [Feb] [  ]
         [2014] [Jul] [  ]
 
         Find comments entered any time from Feb 10, 2012 through
         Feb 11, 2012
         [2012] [Feb] [10]
         [2014] [Feb] [12]
 
     Again, a user must enter dates in order.  A date like the
     following will be ignored:
 
         [    ] [Feb] [12]
 
     Starting date MUST be entered.  For example:
 
         Find comments entered before Feb 12, 2014.  
         The "1900" start year in the example is the earliest one
         in the drop down list but any date before the first
         comment was created will do.
         [1900] [   ] [  ]
         [2014] [Feb] [12]
 
         But this will be ignored:
         [    ] [   ] [  ]
         [2014] [Feb] [12]
 
     These conventions are arbitrary.  I don't claim they're better
     than other conventions.  But they look as reasonable and easy to
     understand as any others.

     The original Bugzilla posting said that a comment must be
     entered in order to search a range.  However, users wanted
     to be able to search for any comment entered during the
     range regardless of what it is.  So that has been
     implemented.  If no comment text is entered we'll search for
     any comment at all within the date range.

     All comments are searched, active or not [Is that right?  I
     thought it might be since we are often looking for ANY
     comment.]

  TAG:

     Selects articles with the specified tag.  Only active tags
     are considered.

     Tags are mostly relevant for articles that are new to the
     EBMS, however the legacy data conversion sometimes added
     "Legacy conversion note" tags to older articles during
     conversion.

  DATE TAG ADDED:

     Functions exactly as for comment dates.

ADMINISTRATOR SEARCH:

  SUMMARY TOPICS ADDED:

     Selects articles for which a summary topic was added after
     the initial import via a second import for another topic
     that brought in the same article.

     This uses the data structures created for the new EBMS
     import functions and does not work with legacy data.

     Note: This may or may not be what was intended for this kind
     of search specification.  It does not find articles with
     summary topics added during review - e.g., by initial or
     board manager.  I tried out some SQL that would find all
     articles that had more than one summary topic.  That could
     be made to work but would be much slower.

  INCLUDE REJECTED ARTICLES:

     Currently unimplemented and ignored.  We need to discuss
     what this should do.

  INPUT DATE:

     Selects articles for which the original (first ever) import
     date was within the date range, with dates specified as for
     comment dates (see above.)

  MODIFIED DATE:

     Selects articles that had any change.  Possible changes that
     are searched are:
     
          A state was added.
          A state comment was added.
          A tag was added.
          A tag comment was added.

     Note: I'm not checking import date but probably should be.

DISPLAY OPTIONS:

  SORT BY:

     Sorts the articles found by the specified criterion.  Note
     that "CMS ID#" means EBMS article ID number.

  FORMAT:

     What gets displayed for each article found.

  PER PAGE:

     How many articles to display on each results page.

     Due to the way that Drupal manages search result displays,
     the search is re-executed each time the next page is
     fetched.  Therefore, if the search is hard and slow, and if
     a user intends to look at a large number of results,
     searching can be significantly faster if a higher PER PAGE
     number is selected.


NOTES ON STRING SEARCHING:

  "WILDCARDS":

     The database management system allows searching for strings
     using "wildcards" to represent substrings with any
     characters.  We have implemented searching using the '%'
     character to stand for a substring of any number of any
     characters.  For example:

          "ABC%XYZ" Finds any string that begins with ABC and
          ends with XYZ, with anything at all between them.

     This string will be found:

          "ABC123XYZ"

     This string will not be found because 'A' is not the first
     character:

          "123ABC XYZ"

     This string will not be found because 'Z' is not the last
     character:

          "ABC XYZ."

     If no wildcards are specified, the program will attempt to
     do an exact match.

     If a user does not want an exact match, it is necessary to
     use wildcards.  Be sure to put one at the start of the
     search string if you don't require the first characters in
     your search string to exactly match the start of the target
     string, and be sure to put them at the end if you don't
     require the last characters of your search string to exactly
     match the last characters of the target.

     The following search string will find any target string with
     ABC somewhere in the string and XYZ somewhere after that:

          "%ABC%XYZ%"

     The old system sometimes looked for exact matches and
     sometimes put implicit wildcards in searches.  I didn't do
     that in the EBMS in order to make it very clear exactly what
     would happen in a search.  Wildcards will be used if the
     user enters them.  They won't if the user doesn't.

  CASE:

     Case is non-significant in searching.  Searches on upper,
     lower, or mixed case inputs will all retrieve the same hits.

  CHARACTER SET:

     There are some non-ASCII characters in our data.  These may
     be characters with diacritics, greek letters, math symbols,
     etc.  For purposes of searching, we have translated all of
     these into US ASCII, and all search strings will be
     similarly translated using the same algorithm.

     It is therefore not necessary to try to input non-English
     characters but, if a user does, the search may still work -
     if we got our translate tables right.