EBMS Searching Alan Meyer Draft 1 November 13, 2012 Here is how the search functionality is implemented in the EBMS code as of this date: EDITORIAL BOARD: Selecting one limits the search by editorial board and populates the Summary Topic pick list. SUMMARY TOPIC: One or more topics can be selected if and only if a board has been selected. If a board is selected but no topic is selected, all topics for that board qualify. If one or more topics are selected, the board(s) are ignored. That way, if a topic was once assigned to board A but is now part of board B, a search on that topic will find it whether it belongs to board A or B. The search will also be faster. By default, if more than one topic is selected, an article in any of them qualifies. This is the meaning of the OR box (actually a radio button.) That can be changed to AND by checking the other box. That is the ONLY place on the search form where OR can be used. All other search criteria are AND'ed together, meaning that an article must match ALL of the criteria in order to be retrieved. PMID: The Pubmed article ID. CMS ID: The EBMS article ID. This is NOT the article ID in the old CMS. We might be better off with the label "EBMS ID", which is what is really being searched. AUTHOR: One or more authors may be entered with last name and optional initials, separated by semicolons and optional spaces. Examples: "Smith" "Smith AH" "Smith AH; Jones B" etc. Full first names must not be used. For example, use "Jones B" rather than "Jones Barrett" or "Barrett Jones". This is because all Pubmed records have initials but only some of them have full first names. Initials should always work. The order of entry of the authors is insignificant. For example: "Smith; Jones" and "Jones; Smith" should retrieve the same articles. SQL % "wildcards" are supported. For example: "Johns%n" retrieves: "Johnson" "Johnston" "Johnsen" "Johnsarden" etc. But be prepared for longer search times. TITLE: Searches article titles. Wildcards are supported. See the notes on string searching below. ADVANCED SEARCH: FYI CITATION: Selects articles with an active FYI state. NCI REVIEWER DECISION: Selects articles that have passed board manager review. They may have just passed it or long passed it and gone on to other states - including states that rejected the article. FULL TEXT RETRIEVED: Selects articles for which the EBMS has the full text (i.e. a PDF version) of the full article. Note: No legacy article from the old CMS will be found unless a PDF version of the article is stored in the EBMS - which will not happen automatically during conversion. So this works differently from the old CiteMS. COMMITTEE DECISION: Selects articles that have passed full text review, i.e., a review of the full text of the article. No actual committee is involved but, for historical reasons, the users have opted to call this "committee decision". The article may be in any state after that one. PUBLISHED TO CITEMS: Selects articles that have the "Published" state in their status history. The article may be in any state after that one. CORE JOURNALS: Currently unimplemented and ignored. The intent is to restrict the search to just those articles that were published by a journal that is in a set of highly respected journals. INCLUDE NOT LISTS: By default, we exclude any journal article that was rejected because it was published in a journal that we have declared to not be used. Checking the box causes us to override the default and include these journals. No extra articles from the legacy database will be found by checking this box because the old CMS rejected those journal articles before loading them, so they are not in the database at all. JOURNAL: Searches journal titles, retrieving articles from journals with titles that match the search string. Wildcards are supported. See notes on string searching. PUBLICATION YEAR/MONTH: Searches for articles in journal issues with the specified year and, optionally, month. We depend on Pubmed for the year and month, which in turn depends on the publishers. If the journal title page says that it was published in May, then it can only be found under May, even if it actually came out in March or July. Some journals cannot be searched by month because no month is given in the Pubmed record. They may, for example, use Spring, Summer, Fall and Winter instead, or something else. Those journals can still be searched by year. If a year is entered without a month, articles published at any time during that year will be selected. If a month is entered without a year, it is silently (i.e., with no error message) ignored. We do not search for months without years. REVIEW CYCLE: Selects articles imported during the specified review cycle. The software for this search uses the new data structures created in the new import software. It cannot find review cycles for the legacy data. Unless we change that, a user should use the input date or publication year month searches for legacy data. EDITORIAL BOARD REVIEWER: Selects articles that have ever been in a packet for the specified editorial board member - unless the member was dropped from that packet - unless, if dropped, he has already submitted a response. This search uses the new packet oriented data structures in the EBMS and does not work with legacy data. EDITORIAL BOARD RESPONSE: Selects articles for which at least one reviewer has submitted the selected response. This search uses the new packet oriented data structures in the EBMS and does not work with legacy data. EDITORIAL BOARD DECISION: Selects articles for which the selected board decision has been made after board member review. This search uses the new packet oriented data structures in the EBMS and does not work with legacy data. COMMENTS: Selects articles with the specified string as or in the article state comments (not the tag comments). Wildcards are supported. Given the size of comments, it is probably important to use them. See the notes on string searching at the end of this discussion. DATE COMMENT ADDED / RANGE: Selects articles that have a comment with the specified date or date range. If both a comment string and a date range were specified, an article with a comment matching the string will only be selected if the comment were created within the range. If no comment string were specified, then the dates will find articles with any comment in the specified date range. The rules for dates were described in a Bugzilla posting, a relevant excerpt from which is reproduced here: Date values ----------- A user must enter dates as year + optional month + optional day. If no value for year is entered, month and day will be ignored. If no value for month is entered, day will be ignored. If there is a year but no month or day, or a year + month but no day, an implicit range search will be performed. For example: Look for comments entered any time in 2012: [2012] [ ] [ ] Look for comments entered during November of 2012: [2012] [Nov] [ ] Look for comments entered November 8, 2012: [2012] [Nov] [ 8] Explicit date ranges -------------------- If the RANGE checkbox has been checked, the program will look at values in the three date fields for the end of the range. If there are no values in the fields, no range search will be performed. If there are values, the range search will look at values up to, but not including the specified dates. For example: Find comments entered any time from Jan 1, 2012 through Dec 31, 2013. [2012] [ ] [ ] [2014] [ ] [ ] Find comments entered any time from Feb 1, 2012 through Jun 30, 2012 [2012] [Feb] [ ] [2014] [Jul] [ ] Find comments entered any time from Feb 10, 2012 through Feb 11, 2012 [2012] [Feb] [10] [2014] [Feb] [12] Again, a user must enter dates in order. A date like the following will be ignored: [ ] [Feb] [12] Starting date MUST be entered. For example: Find comments entered before Feb 12, 2014. The "1900" start year in the example is the earliest one in the drop down list but any date before the first comment was created will do. [1900] [ ] [ ] [2014] [Feb] [12] But this will be ignored: [ ] [ ] [ ] [2014] [Feb] [12] These conventions are arbitrary. I don't claim they're better than other conventions. But they look as reasonable and easy to understand as any others. The original Bugzilla posting said that a comment must be entered in order to search a range. However, users wanted to be able to search for any comment entered during the range regardless of what it is. So that has been implemented. If no comment text is entered we'll search for any comment at all within the date range. All comments are searched, active or not [Is that right? I thought it might be since we are often looking for ANY comment.] TAG: Selects articles with the specified tag. Only active tags are considered. Tags are mostly relevant for articles that are new to the EBMS, however the legacy data conversion sometimes added "Legacy conversion note" tags to older articles during conversion. DATE TAG ADDED: Functions exactly as for comment dates. ADMINISTRATOR SEARCH: SUMMARY TOPICS ADDED: Selects articles for which a summary topic was added after the initial import via a second import for another topic that brought in the same article. This uses the data structures created for the new EBMS import functions and does not work with legacy data. Note: This may or may not be what was intended for this kind of search specification. It does not find articles with summary topics added during review - e.g., by initial or board manager. I tried out some SQL that would find all articles that had more than one summary topic. That could be made to work but would be much slower. INCLUDE REJECTED ARTICLES: Currently unimplemented and ignored. We need to discuss what this should do. INPUT DATE: Selects articles for which the original (first ever) import date was within the date range, with dates specified as for comment dates (see above.) MODIFIED DATE: Selects articles that had any change. Possible changes that are searched are: A state was added. A state comment was added. A tag was added. A tag comment was added. Note: I'm not checking import date but probably should be. DISPLAY OPTIONS: SORT BY: Sorts the articles found by the specified criterion. Note that "CMS ID#" means EBMS article ID number. FORMAT: What gets displayed for each article found. PER PAGE: How many articles to display on each results page. Due to the way that Drupal manages search result displays, the search is re-executed each time the next page is fetched. Therefore, if the search is hard and slow, and if a user intends to look at a large number of results, searching can be significantly faster if a higher PER PAGE number is selected. NOTES ON STRING SEARCHING: "WILDCARDS": The database management system allows searching for strings using "wildcards" to represent substrings with any characters. We have implemented searching using the '%' character to stand for a substring of any number of any characters. For example: "ABC%XYZ" Finds any string that begins with ABC and ends with XYZ, with anything at all between them. This string will be found: "ABC123XYZ" This string will not be found because 'A' is not the first character: "123ABC XYZ" This string will not be found because 'Z' is not the last character: "ABC XYZ." If no wildcards are specified, the program will attempt to do an exact match. If a user does not want an exact match, it is necessary to use wildcards. Be sure to put one at the start of the search string if you don't require the first characters in your search string to exactly match the start of the target string, and be sure to put them at the end if you don't require the last characters of your search string to exactly match the last characters of the target. The following search string will find any target string with ABC somewhere in the string and XYZ somewhere after that: "%ABC%XYZ%" The old system sometimes looked for exact matches and sometimes put implicit wildcards in searches. I didn't do that in the EBMS in order to make it very clear exactly what would happen in a search. Wildcards will be used if the user enters them. They won't if the user doesn't. CASE: Case is non-significant in searching. Searches on upper, lower, or mixed case inputs will all retrieve the same hits. CHARACTER SET: There are some non-ASCII characters in our data. These may be characters with diacritics, greek letters, math symbols, etc. For purposes of searching, we have translated all of these into US ASCII, and all search strings will be similarly translated using the same algorithm. It is therefore not necessary to try to input non-English characters but, if a user does, the search may still work - if we got our translate tables right.