EBMS Searching Alan Meyer Draft 2 (Bob Kline) January 20, 2013 Here is how the search functionality is implemented in the EBMS code as of this date: EDITORIAL BOARD: Selecting one limits the search by editorial board and populates the Summary Topic pick list. More than one board can be selected. Only articles assigned to all selected boards with be found by the search. SUMMARY TOPIC: One or more topics can be selected if and only if a board has been selected. If a board is selected but no topic is selected, all topics for that board qualify. If one or more topics are selected, the board(s) are ignored. That way, if a topic was once assigned to board A but is now part of board B, a search on that topic will find it whether it belongs to board A or B. The search will also be faster. By default, if more than one topic is selected, an article in any of them qualifies. This is the meaning of the OR box (actually a radio button.) That can be changed to AND by checking the other box. That is the ONLY place on the search form where OR can be used. All other search criteria are AND'ed together, meaning that an article must match ALL of the criteria in order to be retrieved. PMID: The Pubmed article ID. EBMS ID: The primary identifier for articles in the EBMS. This is not the same ID as used in the legacy CiteMS system. AUTHOR: One or more authors may be entered with last name and optional initials, separated by semicolons and optional spaces. Examples: "Smith" "Smith AH" "Smith AH; Jones B" etc. Full first names must not be used. For example, use "Jones B" rather than "Jones Barrett" or "Barrett Jones". This is because all Pubmed records have initials but only some of them have full first names. Initials should always work. The order of entry of the authors is insignificant. For example: "Smith; Jones" and "Jones; Smith" should retrieve the same articles. SQL % "wildcards" are supported. For example: "Johns%n" retrieves: "Johnson" "Johnston" "Johnsen" "Johnsarden" etc. But be prepared for longer search times. TITLE: Searches article titles. Wildcards are supported. See the notes on string searching below. ADVANCED SEARCH: JOURNAL: Searches journal titles, retrieving articles from journals with titles that match the search string. Note that this search is against the actual titles of the journals, not the journal title abbreviations. Wildcards are supported. See notes on string searching. PUBLICATION YEAR/MONTH: Searches for articles in journal issues with the specified year and, optionally, month. We depend on Pubmed for the year and month, which in turn depends on the publishers. If the journal title page says that it was published in May, then it can only be found under May, even if it actually came out in March or July. Some journals cannot be searched by month because no month is given in the Pubmed record. They may, for example, use Spring, Summer, Fall and Winter instead, or something else. Those journals can still be searched by year. If a year is entered without a month, articles published at any time during that year will be selected. If a month is entered without a year, it is silently (i.e., with no error message) ignored. We do not search for months without years. REVIEW CYCLE: If one or more topics are specified in the search criteria, this field narrows the results to those articles having those topics for the review cycle selected. Otherwise, if one or more boards are specified, the software narrows the results to those articles having any topic for each of those boards, with the topics assigned for the review cycle selected. Otherwise, articles are selected having any topic assigned for the review cycle selected. FYI CITATION: Selects articles with an active or inactive FYI state (note that this is the only state criterion which accepts inactive states as qualifying for the search results). NCI REVIEWER DECISION: Selects articles that have passed (if "YES" is checked) or failed ("NO") the board manager's review of the articles based on the abstract, before full text has been retrieved. They may have just passed it or long passed it and gone on to other states - including states that rejected the article. FULL TEXT RETRIEVED: Selects articles for which the EBMS has the full text (i.e. a PDF version) of the full article (if "YES" is checked) or for which the EBMS does not have the full text ("NO"). Note: No legacy article from the old CMS will be found unless a PDF version of the article is stored in the EBMS - which will not happen automatically during conversion. So this works differently from the old CiteMS. There is no provision for selecting the subset of "NO" results for which it has been determined that the full text for the articles cannot be obtained. COMMITTEE DECISION: Selects articles that have passed full text review, i.e., a review of the full text of the article (if "YES" is checked) or failed full text review ("NO"). No actual committee is involved but, for historical reasons, the users have opted to call this "committee decision". The article may be in any state after that one. EDITORIAL BOARD REVIEWER: Selects articles that have ever been in a packet for the specified editorial board member - unless the member was dropped from that packet - unless, if dropped, he has already submitted a response. EDITORIAL BOARD RESPONSE: Selects articles for which at least one reviewer has submitted the selected response. EDITORIAL BOARD DECISION: Selects articles for which the selected board decision has been made after board member review. COMMENTS: Selects articles with the specified string as found in the article state comments (not the tag comments). Wildcards are supported. Given the size of comments, it is probably important to use them. See the notes on string searching at the end of this discussion. DATE COMMENT ADDED / RANGE: Selects articles that have a comment with the specified date or date range. If both a comment string and a date range were specified, an article with a comment matching the string will only be selected if the comment were created within the range. If no comment string were specified, then the dates will find articles with any comment in the specified date range. The rules for dates were described in a Bugzilla posting, a relevant excerpt from which is reproduced here: Date values ----------- A user must enter dates as year + optional month + optional day. If no value for year is entered, month and day will be ignored. If no value for month is entered, day will be ignored. If there is a year but no month or day, or a year + month but no day, an implicit range search will be performed. For example: Look for comments entered any time in 2012: [2012] [ ] [ ] Look for comments entered during November of 2012: [2012] [Nov] [ ] Look for comments entered November 8, 2012: [2012] [Nov] [ 8] Explicit date ranges -------------------- If the RANGE checkbox has been checked, the program will look at values in the three date fields for the end of the range. If there are no values in the fields, no range search will be performed. If there are values, the range search will look at values up to, but not including the specified dates. For example: Find comments entered any time from Jan 1, 2012 through Dec 31, 2013. [2012] [ ] [ ] [2014] [ ] [ ] Find comments entered any time from Feb 1, 2012 through Jun 30, 2012 [2012] [Feb] [ ] [2014] [Jul] [ ] Find comments entered any time from Feb 10, 2012 through Feb 11, 2012 [2012] [Feb] [10] [2014] [Feb] [12] Again, a user must enter dates in order. A date like the following will be ignored: [ ] [Feb] [12] Starting date MUST be entered. For example: Find comments entered before Feb 12, 2014. The "1900" start year in the example is the earliest one in the drop down list but any date before the first comment was created will do. [1900] [ ] [ ] [2014] [Feb] [12] But this will be ignored: [ ] [ ] [ ] [2014] [Feb] [12] These conventions are arbitrary. I don't claim they're better than other conventions. But they look as reasonable and easy to understand as any others. The original Bugzilla posting said that a comment must be entered in order to search a range. However, users wanted to be able to search for any comment entered during the range regardless of what it is. So that has been implemented. If no comment text is entered we'll search for any comment at all within the date range. All comments are searched, active or not [Is that right? I thought it might be since we are often looking for ANY comment.] TAG: Selects articles with the specified tag. Only active tags are considered. Tags are mostly relevant for articles that are new to the EBMS, however the legacy data conversion sometimes added "Legacy conversion note" tags to older articles during conversion. DATE TAG ADDED: Functions exactly as for comment dates. CORE JOURNALS: Currently unimplemented and ignored. The intent is to restrict the search to just those articles that were published by a journal that is in a set of highly respected journals. ADMINISTRATOR SEARCH: INPUT DATE: Selects articles for which the original (first ever) import date was within the date range, with dates specified as for comment dates (see above.) MODIFIED DATE: Selects articles that had any change. Possible changes that are searched are: A state was added. A state comment was added. A tag was added. A tag comment was added. Note: I'm not checking import date but probably should be. INCLUDE UNPUBLISHED By default, we include only articles which have an active Published state for the topics/boards specified (or for any topic if no topics or boards are selected on the search request form). If this box is checked, that restriction is lifted, and articles which were excluded by the list of journals which we have declared are not to be used, or which were rejected by the librarian's initial review, or which the librarian approved but has not yet published will also be eligible for inclusion in the search results, subject to the other criteria specified on the search request form. Note that "Published" in this context has nothing to do with the publication of the article in its journal, but instead refers to the processing step used by the librarian to batch up the current set of approved articles for review by NCI so that they are visible to the NCI reviewer all at once, instead of having each show up in that reviewer's queue separately as it gets the librarian's approval. Think of "published" in this context as shorthand for "published to the NCI reviewer's queue." INCLUDE NOT LISTED: This option is a subset of the "INCLUDE UNPUBLISHED" option described immediately above. Checking the box adds articles which would otherwise not be included because, having been excluded by the board's list of journals whose articles are not considered useful for PDQ literature surveillance, they would not have made it to the stage of processing where they would have received the Published state. No extra articles from the legacy database will be found by checking this box because the old CMS rejected those journal articles before loading them, so they are not in the database at all. If the INCLUDE UNPUBLISHED box is also checked, this option has no additional effect. INCLUDE REJECTED: This option is another subset of the "INCLUDE UNPUBLISHED" option described above. Checking this box adds articles which were rejected by the librarian. If the INCLUDE UNPUBLISHED box is also checked, this option has no additional effect. SUMMARY TOPICS ADDED: Selects articles for which a summary topic was added after the initial import via a second import for another topic that brought in the same article. This uses the data structures created for the new EBMS import functions and does not work with legacy data. Note: This may or may not be what was intended for this kind of search specification. It does not find articles with summary topics added during review - e.g., by initial or board manager. I tried out some SQL that would find all articles that had more than one summary topic. That could be made to work but would be much slower. See also the Citation Summary Topic Changes report on the librarians' Citations Report page, which provides similar functionality. DISPLAY OPTIONS: SORT BY: Sorts the articles found by the specified criterion. PER PAGE: How many articles to display on each results page. Due to the way that Drupal manages search result displays, the search is re-executed each time the next page is fetched. Therefore, if the search is hard and slow, and if a user intends to look at a large number of results, searching can be significantly faster if a higher PER PAGE number is selected. NOTES ON STRING SEARCHING: "WILDCARDS": The database management system allows searching for strings using "wildcards" to represent substrings with any characters. We have implemented searching using the '%' character to stand for a substring of any number of any characters. For example: "ABC%XYZ" Finds any string that begins with ABC and ends with XYZ, with anything at all between them. This string will be found: "ABC123XYZ" This string will not be found because 'A' is not the first character: "123ABC XYZ" This string will not be found because 'Z' is not the last character: "ABC XYZ." If no wildcards are specified, the program will attempt to do an exact match. If a user does not want an exact match, it is necessary to use wildcards. Be sure to put one at the start of the search string if you don't require the first characters in your search string to exactly match the start of the target string, and be sure to put them at the end if you don't require the last characters of your search string to exactly match the last characters of the target. The following search string will find any target string with ABC somewhere in the string and XYZ somewhere after that: "%ABC%XYZ%" The old system sometimes looked for exact matches and sometimes put implicit wildcards in searches. I didn't do that in the EBMS in order to make it very clear exactly what would happen in a search. Wildcards will be used if the user enters them. They won't if the user doesn't. CASE: Case is non-significant in searching. Searches on upper, lower, or mixed case inputs will all retrieve the same hits. CHARACTER SET: There are some non-ASCII characters in our data. These may be characters with diacritics, greek letters, math symbols, etc. For purposes of searching, we have translated all of these into US ASCII, and all search strings will be similarly translated using the same algorithm. It is therefore not necessary to try to input non-English characters but, if a user does, the search may still work - if we got our translate tables right.