EBMS Searching

                           Alan Meyer
                       Draft 2 (Bob Kline)
                        January 20, 2013


Here is how the search functionality is implemented in the EBMS
code as of this date:

EDITORIAL BOARD:

     Selecting one limits the search by editorial board and
     populates the Summary Topic pick list.  More than one board
     can be selected.  Only articles assigned to all selected
     boards with be found by the search.

SUMMARY TOPIC:

     One or more topics can be selected if and only if a board
     has been selected.  If a board is selected but no topic is
     selected, all topics for that board qualify.

     If one or more topics are selected, the board(s) are
     ignored.  That way, if a topic was once assigned to board A
     but is now part of board B, a search on that topic will find
     it whether it belongs to board A or B.  The search will also
     be faster.

     By default, if more than one topic is selected, an article
     in any of them qualifies.  This is the meaning of the OR box
     (actually a radio button.)  That can be changed to AND by
     checking the other box.  That is the ONLY place on the
     search form where OR can be used.  All other search criteria
     are AND'ed together, meaning that an article must match ALL
     of the criteria in order to be retrieved.

PMID:

     The Pubmed article ID.

EBMS ID:

     The primary identifier for articles in the EBMS.  This is not
     the same ID as used in the legacy CiteMS system.

AUTHOR:

     One or more authors may be entered with last name and
     optional initials, separated by semicolons and optional
     spaces.  Examples:

          "Smith"
          "Smith AH"
          "Smith AH; Jones B"
          etc.

     Full first names must not be used.  For example, use "Jones
     B" rather than "Jones Barrett" or "Barrett Jones".  This is
     because all Pubmed records have initials but only some of
     them have full first names.  Initials should always work.

     The order of entry of the authors is insignificant.  For
     example:

          "Smith; Jones"
        and
          "Jones; Smith"

     should retrieve the same articles.

     SQL % "wildcards" are supported.  For example:

          "Johns%n"

        retrieves:

          "Johnson"
          "Johnston"
          "Johnsen"
          "Johnsarden"

     etc.  But be prepared for longer search times.

TITLE:

     Searches article titles.

     Wildcards are supported.  See the notes on string searching
     below.

ADVANCED SEARCH:

  JOURNAL:

     Searches journal titles, retrieving articles from journals
     with titles that match the search string.  Note that this
     search is against the actual titles of the journals, not
     the journal title abbreviations.

     Wildcards are supported.  See notes on string searching.

  PUBLICATION YEAR/MONTH:

     Searches for articles in journal issues with the specified
     year and, optionally, month.

     We depend on Pubmed for the year and month, which in turn
     depends on the publishers.  If the journal title page says
     that it was published in May, then it can only be found
     under May, even if it actually came out in March or July.

     Some journals cannot be searched by month because no month
     is given in the Pubmed record.  They may, for example, use
     Spring, Summer, Fall and Winter instead, or something else.
     Those journals can still be searched by year.

     If a year is entered without a month, articles published at
     any time during that year will be selected.

     If a month is entered without a year, it is silently (i.e.,
     with no error message) ignored.  We do not search for months
     without years.

  REVIEW CYCLE:

     If one or more topics are specified in the search criteria,
     this field narrows the results to those articles having those
     topics for the review cycle selected.  Otherwise, if one or
     more boards are specified, the software narrows the results
     to those articles having any topic for each of those boards,
     with the topics assigned for the review cycle selected.
     Otherwise, articles are selected having any topic assigned
     for the review cycle selected.

  FYI CITATION:

     Selects articles with an active or inactive FYI state (note
     that this is the only state criterion which accepts inactive
     states as qualifying for the search results).

  NCI REVIEWER DECISION:

     Selects articles that have passed (if "YES" is checked) or
     failed ("NO") the board manager's review of the articles
     based on the abstract, before full text has been retrieved.
     They may have just passed it or long passed it and gone on
     to other states - including states that rejected the
     article.

  FULL TEXT RETRIEVED:

     Selects articles for which the EBMS has the full text (i.e.
     a PDF version) of the full article (if "YES" is checked)
     or for which the EBMS does not have the full text ("NO").

     Note: No legacy article from the old CMS will be found
     unless a PDF version of the article is stored in the EBMS -
     which will not happen automatically during conversion.  So
     this works differently from the old CiteMS.

     There is no provision for selecting the subset of "NO"
     results for which it has been determined that the full
     text for the articles cannot be obtained.

  COMMITTEE DECISION:

     Selects articles that have passed full text review, i.e., a
     review of the full text of the article (if "YES" is checked)
     or failed full text review ("NO").  No actual committee is
     involved but, for historical reasons, the users have opted
     to call this "committee decision".  The article may be in
     any state after that one.

  EDITORIAL BOARD REVIEWER:

     Selects articles that have ever been in a packet for the
     specified editorial board member - unless the member was
     dropped from that packet - unless, if dropped, he has
     already submitted a response.

  EDITORIAL BOARD RESPONSE:

     Selects articles for which at least one reviewer has
     submitted the selected response.

  EDITORIAL BOARD DECISION:

     Selects articles for which the selected board decision has
     been made after board member review.

  COMMENTS:

     Selects articles with the specified string as found in the
     article state comments (not the tag comments).

     Wildcards are supported.  Given the size of comments, it is
     probably important to use them.  See the notes on string
     searching at the end of this discussion.

  DATE COMMENT ADDED / RANGE:

     Selects articles that have a comment with the specified date
     or date range.

     If both a comment string and a date range were specified, an
     article with a comment matching the string will only be
     selected if the comment were created within the range.

     If no comment string were specified, then the dates will
     find articles with any comment in the specified date range.

     The rules for dates were described in a Bugzilla posting, a
     relevant excerpt from which is reproduced here:

     Date values
     -----------
 
     A user must enter dates as year + optional month + optional
     day.
 
     If no value for year is entered, month and day will be
     ignored.  If no value for month is entered, day will be
     ignored.
 
     If there is a year but no month or day, or a year + month but
     no day, an implicit range search will be performed.  For
     example:
 
         Look for comments entered any time in 2012:
         [2012] [   ] [  ]
 
         Look for comments entered during November of 2012:
         [2012] [Nov] [  ]
 
         Look for comments entered November 8, 2012:
         [2012] [Nov] [ 8]
 
     Explicit date ranges
     --------------------
 
     If the RANGE checkbox has been checked, the program will look
     at values in the three date fields for the end of the range.
 
     If there are no values in the fields, no range search will be
     performed.
 
     If there are values, the range search will look at values up
     to, but not including the specified dates.  For example:
 
         Find comments entered any time from Jan 1, 2012 through
         Dec 31, 2013.
         [2012] [   ] [  ]
         [2014] [   ] [  ]
 
         Find comments entered any time from Feb 1, 2012 through
         Jun 30, 2012
         [2012] [Feb] [  ]
         [2014] [Jul] [  ]
 
         Find comments entered any time from Feb 10, 2012 through
         Feb 11, 2012
         [2012] [Feb] [10]
         [2014] [Feb] [12]
 
     Again, a user must enter dates in order.  A date like the
     following will be ignored:
 
         [    ] [Feb] [12]
 
     Starting date MUST be entered.  For example:
 
         Find comments entered before Feb 12, 2014.  
         The "1900" start year in the example is the earliest one
         in the drop down list but any date before the first
         comment was created will do.
         [1900] [   ] [  ]
         [2014] [Feb] [12]
 
         But this will be ignored:
         [    ] [   ] [  ]
         [2014] [Feb] [12]
 
     These conventions are arbitrary.  I don't claim they're better
     than other conventions.  But they look as reasonable and easy to
     understand as any others.

     The original Bugzilla posting said that a comment must be
     entered in order to search a range.  However, users wanted
     to be able to search for any comment entered during the
     range regardless of what it is.  So that has been
     implemented.  If no comment text is entered we'll search for
     any comment at all within the date range.

     All comments are searched, active or not [Is that right?  I
     thought it might be since we are often looking for ANY
     comment.]

  TAG:

     Selects articles with the specified tag.  Only active tags
     are considered.

     Tags are mostly relevant for articles that are new to the
     EBMS, however the legacy data conversion sometimes added
     "Legacy conversion note" tags to older articles during
     conversion.

  DATE TAG ADDED:

     Functions exactly as for comment dates.

  CORE JOURNALS:

     Currently unimplemented and ignored.  The intent is to
     restrict the search to just those articles that were
     published by a journal that is in a set of highly respected
     journals.

ADMINISTRATOR SEARCH:

  INPUT DATE:

     Selects articles for which the original (first ever) import
     date was within the date range, with dates specified as for
     comment dates (see above.)

  MODIFIED DATE:

     Selects articles that had any change.  Possible changes that
     are searched are:
     
          A state was added.
          A state comment was added.
          A tag was added.
          A tag comment was added.

     Note: I'm not checking import date but probably should be.

  INCLUDE UNPUBLISHED

     By default, we include only articles which have an active
     Published state for the topics/boards specified (or for
     any topic if no topics or boards are selected on the search
     request form).  If this box is checked, that restriction is
     lifted, and articles which were excluded by the list of
     journals which we have declared are not to be used, or
     which were rejected by the librarian's initial review,
     or which the librarian approved but has not yet published
     will also be eligible for inclusion in the search results,
     subject to the other criteria specified on the search
     request form.  Note that "Published" in this context has
     nothing to do with the publication of the article in its
     journal, but instead refers to the processing step used
     by the librarian to batch up the current set of approved
     articles for review by NCI so that they are visible to
     the NCI reviewer all at once, instead of having each show
     up in that reviewer's queue separately as it gets the
     librarian's approval.  Think of "published" in this context
     as shorthand for "published to the NCI reviewer's queue."

  INCLUDE NOT LISTED:

     This option is a subset of the "INCLUDE UNPUBLISHED" option
     described immediately above.  Checking the box adds articles
     which would otherwise not be included because, having been
     excluded by the board's list of journals whose articles are
     not considered useful for PDQ literature surveillance, they
     would not have made it to the stage of processing where they
     would have received the Published state.

     No extra articles from the legacy database will be found by
     checking this box because the old CMS rejected those journal
     articles before loading them, so they are not in the
     database at all.

     If the INCLUDE UNPUBLISHED box is also checked, this option
     has no additional effect.

  INCLUDE REJECTED:

     This option is another subset of the "INCLUDE UNPUBLISHED"
     option described above.  Checking this box adds articles
     which were rejected by the librarian.

     If the INCLUDE UNPUBLISHED box is also checked, this option
     has no additional effect.

  SUMMARY TOPICS ADDED:

     Selects articles for which a summary topic was added after
     the initial import via a second import for another topic
     that brought in the same article.

     This uses the data structures created for the new EBMS
     import functions and does not work with legacy data.

     Note: This may or may not be what was intended for this kind
     of search specification.  It does not find articles with
     summary topics added during review - e.g., by initial or
     board manager.  I tried out some SQL that would find all
     articles that had more than one summary topic.  That could
     be made to work but would be much slower.

     See also the Citation Summary Topic Changes report on the
     librarians' Citations Report page, which provides similar
     functionality.

DISPLAY OPTIONS:

  SORT BY:

     Sorts the articles found by the specified criterion.

  PER PAGE:

     How many articles to display on each results page.

     Due to the way that Drupal manages search result displays,
     the search is re-executed each time the next page is
     fetched.  Therefore, if the search is hard and slow, and if
     a user intends to look at a large number of results,
     searching can be significantly faster if a higher PER PAGE
     number is selected.


NOTES ON STRING SEARCHING:

  "WILDCARDS":

     The database management system allows searching for strings
     using "wildcards" to represent substrings with any
     characters.  We have implemented searching using the '%'
     character to stand for a substring of any number of any
     characters.  For example:

          "ABC%XYZ" Finds any string that begins with ABC and
          ends with XYZ, with anything at all between them.

     This string will be found:

          "ABC123XYZ"

     This string will not be found because 'A' is not the first
     character:

          "123ABC XYZ"

     This string will not be found because 'Z' is not the last
     character:

          "ABC XYZ."

     If no wildcards are specified, the program will attempt to
     do an exact match.

     If a user does not want an exact match, it is necessary to
     use wildcards.  Be sure to put one at the start of the
     search string if you don't require the first characters in
     your search string to exactly match the start of the target
     string, and be sure to put them at the end if you don't
     require the last characters of your search string to exactly
     match the last characters of the target.

     The following search string will find any target string with
     ABC somewhere in the string and XYZ somewhere after that:

          "%ABC%XYZ%"

     The old system sometimes looked for exact matches and
     sometimes put implicit wildcards in searches.  I didn't do
     that in the EBMS in order to make it very clear exactly what
     would happen in a search.  Wildcards will be used if the
     user enters them.  They won't if the user doesn't.

  CASE:

     Case is non-significant in searching.  Searches on upper,
     lower, or mixed case inputs will all retrieve the same hits.

  CHARACTER SET:

     There are some non-ASCII characters in our data.  These may
     be characters with diacritics, greek letters, math symbols,
     etc.  For purposes of searching, we have translated all of
     these into US ASCII, and all search strings will be
     similarly translated using the same algorithm.

     It is therefore not necessary to try to input non-English
     characters but, if a user does, the search may still work -
     if we got our translate tables right.