API for Drupal access to CiteMS functions

Draft 1.0

1 Changes
2 Concepts
3 API
- 3.1 Article related APIs
- 3.2 Article import related APIs

1 Changes

This draft contains the application programmer interface (API) definitions for Article information and for importing Articles from the National Library of Medicine. The draft incorporates revisions from two walkthroughs between Bob and myself.

An API for searching for articles has been removed from the draft. I think it will be necessary to learn more about the Drupal search API and do some prototyping with it before deciding what or how much functionality should be programmed independently of Drupal's existing functions.

2 Concepts

2.1 PHP

Functionality is implemented in pure PHP modules, with no HTML involved.

2.2 Isolation of database from user interface

The user interface ("view" in MVC parlance) calls those modules to manipulate article data ("model").

2.3 Logging

The update functions described here should log their actions. Logging will be primarily for debugging.

2.4 Transactions

Many of the functions documented here can update multiple rows in the database. We don't yet know how much transaction support we will get from Drupal for controlling this but Drupal 7 appears to offer more support than Drupal 6. Good logging might help us if transactions are not fully supported.

2.5 Authorization control

The API retrieves information that is not security sensitive, therefore it is envisioned that no authorization control will be placed on functions that retrieve article data from the database.

However all functions that update the database will require that the ID of an authorized user be passed to the function. Authorization will be checked and the user ID logged with the update.

2.6 Optimization notes

Preference will be given to simplicity and maintainability over optimization until experience shows the need for optimization. Some optimization ideas that can be considered include:

Caching lookups in static class variables
Assuming classes (not objects) are initialized once for each HTTP request, we can uses static variables to hold such things as the human readable name associated with a controlled identifier like a Tag or Status or Response value. Another possibility is to cache the result of an authorization check for a user.
In a batch operation that calls a function multiple times, like an object constructor or a name lookup, the first call can cache its results in the class variables and subsequent calls can get those results without having to query the database again and again.
Batching operations
Some functions may be written to perform an action, say, 100 times on 100 different article records. These could be rewritten to process all 100 in memory and read all of the database information into a single result set.

3 API

3.1 Article related APIs

class Article
An Article object encapsulates a single article in the database. Many of the functions are intended only to be used during import or replacement operations. Others are for EBMS programs running in support of users.
- Constructor
  There are several forms of constructor, using a unique EBMS/CiteMS ID, using a source name ("Pubmed") and source ID, and one using XML.
- Import related methods()
  These are defined in EbmsCite.php and are not relevant here. They do not require an Article object.
- User service related methods()
  Getters for information about an article. Some of these simply return information from the Article object. Some go back to the database to get information which is not brought into memory unless and until it is requested.
  - getId()
  - getSource()
    Initially just "Pubmed"
  - getSourceId()
    Initially this is just the Pubmed ID
  - getSourceStatus()
    Tells whether this is the final Pubmed version of the article, or something preliminary. May not be relevant for other data sources.
  - getSourceJrnlId()
    For Pubmed, this is the NLM journal ID, not the ISSN
  - getJrnlTitle()
    Title as given in the article record. This is not an authority controlled title.
  - getBrfJrnlTitle()
    As found in the article record.
  - getImportDate()
    Date first imported for any board or topic
  - getUpdateDate()
    Date data was last updated from source (i.e. Pubmed for now). It may be the import date or may be later.
  - getTitle()
    Article title
  - getAuthors()
    - Purpose
      Retrieve author names for an article. The author names are NOT authority controlled. They are whatever appears in the Pubmed record (or other record if we ever have others) for the article.
    - Pass
      Optional limit on the number to retrieve. The caller may only want one or a few.
    - Return
      Array of ArticleAuthor objects (q.v. - we may prefer simple strings instead), one for each author. Objects are returned in the order that they appeared in the source article. The first object identifies and names the first author, and so on.
    - Logic
      The data comes from the authors table, not the article record.
  - getPubDate()
    Return the publication date as a string. This is NOT a SQL date and is not easy to compare.
    - Note
      We need to discuss what to produce here. Pubmed XML includes different fields for JournalIssue/PubDate. The fields can include Year, Month, Day, Season, MedlineDate (e.g. "2004 Jul-Aug", or "2004 Oct 2-8").
      I have always seen Year present, either as an element or as the first four characters of the MedlineDate. However the sample I observed was only a fraction of the total database. I also saw 3 Season="Autumn" and 78 Season="Fall" in my sample of 51,845 records.
  - getAbstract()
    Formatted as text.
  - getCitation()
    - Note
      This is the human readable citation to the article, for example something like: "BriefJrnlTitle, pubdate; vol(issue);pages" - or whatever we think is better. I'm conceiving this as a string that is stable and stored, not something we change for different purposes and create by re-parsing XML.
      We may not be able to have just one of these. The old system had just one such string for each article, derived from the Medline Print format "SO" field. However the new EBMS has two different strings used as brief citations. We may have to store two, or store one and construct the other on the fly, or store the parts and construct any of them on the fly.
  - getXml()
  - hasPdf()
    Returns true if we have one, else false.
  - getPdf()
    Returns PDF as a string of bytes, null if we don't have one.
  - setPdf()
    - Purpose
      Store a PDF in association with an article. If a PDF is already associated with an article, this one replaces it - for example if a wrong or corrupt PDF had been previously stored.
    - Pass
      - PDF
        TBD: This might be the PDF or the URL of a PDF in the file system.
      - userId
        Used for checking and logging. Not otherwise stored.
    - Return
      Void
    - Throws
      - EbmsException
        If IO, authorization, or other(?) error.
  - getHistory()
    - Purpose
      Retrieve a complete history of everything that happened to this article - import, topics assigned, status, tags, etc.
    - Pass
      - filters
        An array of keyword => value arguments used to filter out what might not be wanted. Not sure what these are yet, but they would probably include at least the following. More may be added later if required.
        
        startDateTime
        Retrieve history information from this datetime forward. Default if no startDateTime specified is from the first appearance of the article in the database.
        
        endDateTime
        Retrieve history information up to but not including this datetime. Default if no endDateTime specified is to retrieve information up to and including the latest.
        
        topicId
        Unique ID specifying a Summary topic. If specified, only retrieve the history under a single topic is desired. Default is to retrieve history for this article in association with any and all topics.
        
        Note:
        Any information not specific to a topic, e.g., import or update date, will always be retrieved. For this purpose, the import datetime and the datetime of the association with the topic for which the import was performed are treated as separate events - even though they have identical datetime stamps and were specified by a user in the same import operation.
        
        Note:
        We may want to be able to specify more than one topic, but not necessarily all topics for a board. If there is a use case for this then we can allow an array of topic IDs and the software will test the parameter to see if it's a single object or an array.
        
        boardId
        Unique ID specifying an editorial board. If specified, only retrieve the history as it pertains to a particular editorial board. If the article is associated with only one topic for this board the result is identical to specifying that topic ID. However if two or more topics from the same board are associated with the article then they will all appear in the history.
        The notes on non-topic specific items and multiple topics under topicId also applies to non-board specific items.
        
        get_status
        
        True
        Get all status values. The default.
        
        False
        Do not include status values
        
        get_tags
        
        True
        Get tags, the default
        
        False
        Leave out tag info
        
        get_responses
        
        True
        Get board member responses, the default
        
        False
        Leave out response info
        
        sortOrder
        Tells how to order results.
        
        "date"
        Order objects by datetime. This is the default - though we can change the way it works if it is not the most commonly requested sort order.
        If datetimes are identical, priorities are:
        
        non-board, non-topic specific items
        An import event is shown before any status or tag assignment with the same datetime value.
        
        status before tag
        If a status assignment and a tag assignment are made in one operation with the same datetime, the status assignment will be ordered first.
        
        "board"
        Objects are partitioned by editorial board, then sorted by date within board. Non-board specific items (imported, modified) are shown before all board specific items. Boards are in alphabetical order.
        
        "topic"
        Objects are partitioned by topic. Non-topic specific items come first. Topics are in Summary topic name alphabetical order.
    - Return
      Array of objects, in specified (or defaulted) sort order.
      The caller should use the PHP isinstance() or get_class() functions to determine how to interpret each object returned. Possible object types are:
      - ArticleImportEvent - also used for updates.
        These are potentially independent of Summary topic and are therefore not "status" records.
      - ArticleTopic
      - ArticleStatus
      - ArticleTag
      - ArticleResponse
      - CiteDecision?
        If final decisions are "status" values, then we don't need this.
    - Note on logic optimization
      History is not expected to be large for most articles. It is therefore desirable to program for simplicity and maintainability and only optimize if experience shows the need.
    - Logic
      - Get import and modified dates
        Construct ArticleImport objects.
      - Get list of topic assignments
        Select those of interest or fetch all and delete any that are of no interest. Construct ArticleTopic objects.
      - Get list of status assignments
        Select those of interest or filter out others after select. Construct ArticleStatus objects.
      - Get list of tag assignments
        Select those of interest or filter out others after select. Construct ArticleTag objects.
      - Get list of board member responses
        Construct ArticleResponse events.
      - Create an in-memory index of all objects for sorting
        The sort keys are defined in a sortOrder dependent way and extracted in an object dependent way.
      - Sort the key=>object list
      - Produce a simple array
        0=>first object in sortOrder 1=>next object …
      - Return it
  - getTopics()
    - Purpose
      Get a list of all of the topics associated with this article.
    - Pass
      Void.
    - Return
      Array of ArticleTopic objects containing information about the association.
  - hasTopic()
    - Purpose
      Tell whether the article is associated with a specific topic.
    - Pass
      - topicId
    - Return
      - True - article is associated with this topic
      - False - it's not
    - Note:
      It doesn't matter what the status is for this topic.
    - Throws
    - EbmsException
      If topicId or Name is invalid.
  - getBoards()
    - Purpose
      Get a list of all of the editorial boards associated with this article.
    - Pass
      Void
    - Return
      Array of ArticleBoard objects containing information about the association.
  - hasBoard()
    - Purpose
      Tell whether the article is currently associated with a specific board.
    - Pass
      - boardId
    - Return
      - True - article is associated with this board
      - False - it's not
    - Throws
      - EbmsException
        If boardId or Name is invalid.
  - getStatus()
    - Purpose
      Find the current (i.e. last assigned) status of an article. Status is topic specific and there is a current status for each topic associated with an article.
    - Pass
      - topicId
        Caller must specify a topic. If the caller needs to see the status for more than one topic that may have been assigned to the article, he should call getHistory() instead.
    - Return
      - Status object for the last assigned status
        Includes information about when assigned, who assigned it, etc.
      - or Null
        If the article is not associated with that topic. This is not an error and does not raise an exception.
    - Throws
      - EbmsException
        If database fails. Failure to find an association between the passed topic and the article ID does not throw an exception.
  - setStatus()
    - Purpose
      Set a new current status for an article.
    - Pass
      - statusId
      - topicId
        A status is always associated with a specific topic identified by unique ID. This is checked. The article must already be associated with the topic.
      - userId
        Unique ID of person. Authorization is checked.
      - Comment
        Optional comment to record with the status assignment.
      - dt
        Optional datetime of assignment. Probably null when called from a user interface. May be used by batch programs like the import program to ensure that the assignment of the initial topic for the article has an identical datetime to the import datetime.
        If null, use DBMS current datetime.
    - Return
      - uniqueRowId
        This is the unique ID of the database row representing the association of the article and the status, not the unique ID of the tag.
    - Throws
      - EbmsException
        
        If status not valid or not allowed at this point in processing
        
        If article not associated with the passed topic
        
        If user not authorized
    - Logic
      - Check user authorization
        No database update allowed unless user is authorized.
      - Check that article is associated with the passed topic.
      - Check that status makes sense
        Not sure what all the checks here will be, but there are some status values that might be illogical, for example requesting a PDF when the PDF has already been stored, or setting a status value with a topic that is not associated with this article (unless the status being set is the association itself.)
      - Create row in the status table
  - deleteStatus() ?
    - Purpose
      Normally, status would never be deleted, it would be superseded by the next status value assigned. It would only be if an error occurred and a status was assigned by mistake that we would want to delete the status.
      Deletions would be performed by updating the column in the status table that says whether the row is active or not.
    - Pass
      - statusRowId
        The unique ID of the row in the status table that is to be deleted. This is NOT the ID of the status name in a table of valid status values. It is the ID of a database row that associates a status ID and an article ID. The caller should get this id by calling getStatus() or getHistory() or otherwise reading the table.
      - userId
        User with authorization to set status who is performing the action.
      - comment
        Reason for the inactivation. Calls updateStatusComment() to insert the reason.
    - Return
      Void
    - Throws
      - EbmsException
        If user not authorized or the requested operation is invalid. See Logic.
    - Logic
      - Check user authorization
      - Perform any consistency checks
        Don't know what these are yet. There may be references to the status row somewhere that would make it invalid to inactivate the status. There may be DBMS constraints to enforce these, or applications code. Or there may not be any.
      - Update the row.
  - updateStatusComment()
    - Purpose
      It may be desirable to allow comments associated with status to be updated, for example to explain some special circumstance regarding the status assignment, or to correct a previous comment.
      Existing comments are appended to, not overwritten. See addTagComment() for a description.
    - Pass
      - userId
      - comment
        Text to append.
      - dt
        Optional datetime for use by batch programs that synchronize datetime stamps with other operations. Default is database current datetime.
  - addTag()
    - Purpose
      Associate a new descriptive tag with an article. Tags are for search and comment purposes, used when it is desired to mark an article for some purpose without changing its status.
    - Pass
      - tagId
        Unique ID of the tag in a table of tags.
      - topicId
        Unlike status, tags aren't necessarily associated with a particular topic, but may be. Topic ID is optional and may be null.
        Our current thinking is that some tags might be topic specific but some might not. Therefore we plan to store a topic_id column in the database but make it nullable.
        
        XXX - This requires discussion with users.
      - userId
        Unique ID of person. Authorization is checked.
      - Comment
        Optional comment to record with the tag assignment. See also updateTagComment().
      - dt
        Optional datetime of assignment. Probably null when called from a user interface. May be used by batch programs.
    - Return
      Void.
    - Throws
      - EbmsException
        
        If tag or topic not valid
        
        If user not authorized
  - deleteTag()
    - Purpose
      Remove a tag from an article by setting its active_status value to inactive.
    - Pass
      - tagRowId
        We need this because one tag can be associated with an article multiple times. For example if we have a tag like "Board manager comment", there might be multiple occurrences from the same board manager.
        The calling program would typically find the unique row ID by using the getHistory() function.
      - userId
        The user must be authorized to create the tag in order to delete it.
        The user ID does not replace the ID of the original creator of the tag. However it will appear in the log file and in a default comment appended to the existing tag comment.
      - comment
        Optional comment associated with the deletion. We will append this comment to the existing tag comment (see updateTagComment()) and then mark the entire row in the table as inactive/deleted.
      - dt
        Optional datetime of the deletion. Else use current database datetime.
    - Return
      Void.
    - Throws
      - EbmsException
        If user not authorized or tagRowId does not exist.
  - updateTagComment()
    - Purpose
      Add more information to a tag comment. This is a useful function but it could get ridiculously complicated. To avoid too much complication, I propose that the program should do the following when updating a comment.
      - Append username to comment
      - Append date to comment
      - Append new comment text
      - Example
        A comment might look something like this:
        "[Alan Meyer, 2012-09-15 16:20:05] I think this is an interesting article that we should consider with regard to treatment of all lymphomas. [Bob Kline, 2012-09-15 09:01:55] It's not really relevant to all of them and, besides, the author has a dubious publication history."
    - Pass
      - tagRowId
        From getHistory().
      - userID
        Must be authorized.
      - comment
        Text to append
      - dt
        Optional datetime, normally defaulted to database datetime but may be set by batch programs that add tags.
    - Return
      New ArticleTag object with the updated comment information.
    - Throws
      - EbmsException
        
        If user or authorization error
        
        If updated comment would exceed space constraints
Utility classes - NO
Used in multiple places.
- IdName - NO, WE DECIDED AGAINST THIS (or at least I did)
  - Purpose
    Used for passing information to and from functions.
  - Contains
    - itemId
      The unique internal identifier of something. It might be a status name, tag name, user, or something else. It can be used as a lookup in the database.
      If itemId exists, it is used to identify the item. In that case, itemName is always ignored.
    - itemName
      For items that are guaranteed to have unique names, if the itemId is not present but the name is present, the name will be used to lookup the unique ID when required. This is a convenience intended to provide better code self-documentation. Passing status
Other related classes
These are all representations of rows in the database. The methods for manipulating them are in the Article class.
Note: The names of fields given below are intended to describe the function of the field. They are not necessarily the names that will be used in the definitions of the fields in our PHP code.
- Issues
  The following issue(s) apply to all of these classes.
  - Visibility of contents
    We may want to make some or all fields in these objects private, accessible only by getter methods. That would enable us to provide lazy evaluation of data that is not stored in the rows themselves, for example user, status, or tag names that are represented in the database rows only by their IDs.
- ArticleAuthor
  - Purpose
    Describe what we know about the identity of an author. Note that authors are not authority controlled. "Bush, G", "Bush, GHW", "Bush, George", "Bush, George Herbert Walker", etc. may in fact all be the same actual author but they will create separate author records in our database.
    The fields in our records are taken from the corresponding fields in Pubmed XML. Whatever they give us is what we have.
  - Contains
    - unique row ID
      Unique in EBMS/CiteMS. This is our ID, not something from NLM.
    - Last name
    - Forename
    - Initials
- ArticleImport
  - Purpose
    Describe an import of an article from an external source - always Pubmed in the initial version of the system. May be something else later.
  - Contains
    - unique row ID
    - article ID
    - source
      Initially "Pubmed".
    - user ID
    - user name
    - datetime of import
    - comment (if any)
    - importType
      "import" or "update". Updates are, initially, replacements of previously imported XML by newer XML from NLM.
- ArticleStatus
  - Purpose
    Describe the association of a single combination of article, topic, status, and date.
  - Contains
    - unique row ID
    - article ID
    - summary topic ID
    - summary topic name
    - status ID
    - status name
    - user ID
    - user name
    - datetime created
    - comment
    - active status
- ArticleTag
  - Purpose
    Describe the association of a single combination of an article, descriptive tag, and date.
  - Contains
    - unique row ID
    - article ID
    - summary topic ID?
      To be determined
    - tag id
    - tag name
    - user ID
      Of user creating the tag record. Other users may update the comment.
    - user name
    - datetime created
      Of original creation
    - datetime last updated
    - comment
    - active status
- ArticleTopic
  - Purpose
    Describe the association of an article and a summary topic.
  - Contains
    - article ID
    - summary topic id
    - summary topic name
    - user ID
    - user name
    - datetime association made
    - comment
    - active status
- ArticleImport
  - Purpose
    These objects are used in the getHistory() results but they don't exist as such in the database. Import and update dates come from the article rows.
  - Contains
    - article ID
    - import type - "import" or "update"
    - user ID
    - user name
    - datetime
- ArticleResponse
  - Purpose
    Describe a board member's response to an article sent to him or her for review. "Responses" are currently controlled values picked from a list.
  - Issues
    - What is a response?
      Recording and retrieving responses is complicated by the fact that a board member can perform a single form submission with multiple responses. Currently, the multiple responses are encoded in a bitmap stored in a single row in the database - and associated with one optional free text comment. That might or might not change.
      - Multiple rows?
        We might treat each response in a single submission as a separate row in the response table, possibly with something that identifies them as all coming from the same board member response. User ID + datetime might be sufficient for that.
      - Multiple comments?
        An advantage (or is it a disadvantage?) of having multiple rows is that we can allow the board member to attach a comment to each response. It's an advantage from an information
        
        Advantages
        It produces more precise, complete information.
        
        Disadvantage
        It puts potential additional burdens on the board members to produce more detailed responses. Current thinking is that this outweighs any advantage.
      - Impact on this function
        How we answer these questions will determine what the content of an ArticleResponse needs to be.
  - Contains
    - article ID
    - topic ID
    - topic name
    - board member ID
    - board member name
    - response ID
      This requires discussion. See above.
    - response text
      See above
    - Datetime

3.2 Article import related APIs

class PubmedInterface (for getCitesFromNLM())
- Purpose
  - Used in fetching articles from NLM
  - Maintains state between batches
- Fields
  - idList
    Array of IDs in the batch to get
  - nextId
    Index of next ID to fetch
  - lastFetchStart
    Datetime of our last transmission to NLM. We may need to throttle our interactions with NLM to avoid breaking their speed limits. We can consult this and the next two fields to decide whether we need to introduce a sleep time before the next fetch.
  - lastFetchEnd
    Date-time the last response from NLM was completely received
  - lastFetchCount
    Number of articles we requested in the last fetch.
- Methods
  - Constructor
    - Pass
      - idList
        Array of IDs to fetch, number limited only by available memory. IDs are numeric strings, e.g., array("12345678", "12345679", …)
    - Logic
      - Initialize
        
        lastQueryStart = null
        
        nextFetch = 0
  - getCitesFromNLM()
    - Purpose
      - Fetch articles from NLM
      - Used for:
        
        Importing articles
        
        Replacing articles with later versions
        
        Any other reason we might have
    - Pass
      - fetchCount
        Count of articles to retrieve in next batch
      - rawMode
        
        True = return the XML as retrieved from NLM
        Caller has to figure out what's inside, including figuring out whether some items were not returned. This is just raw data.
        
        False (default) = return Article objects
    - Return
      - Raw XML, or
      - Array of Article objects
        
        If PubmedID not found:
        The Article object will have nothing in it except the Pubmed ID and a status value = 'ERROR_NOT_FOUND'. It will be possible to have an array of, e.g., 50 article objects, some number of which have full data and some number of which contain this error.
        
        If returned array has fewer cites than requested:
        
        We've reached the end of the list
        
        Returned array will be empty if called again
        But no error will be raised. It is okay to keep calling until the length of the returned array == 0.
    - Throws
      - EbmsException
        If we cannot connect to Pubmed or other Pubmed error. If we can't parse the returned information and rawMode == False. If HTTP return code indicates error.
    - Logic
      - Determine Pubmed throttle numbers
        These are probably just constants.
        
        Max articles to get in one query
        
        Min time between queries
      - returnList = array()
      - countRequested = 0
      - moreToRequest = length of ID list - nextFetch
      - if fetchCount > moreToRequest
        
        fetchCount = moreToRequest
      - while moreToRequest
        
        Construct a request to Pubmed
        
        Fill it with some requested Pubmed IDs
        
        Count is min of
        
        Number of IDs remaining to be fetched
        
        Maximum we allow in one fetch
        
        Starting with nextId
        
        Update nextID to point after the last requested ID
        
        Apply throttling rules
        These might take day of week and time of day into account as well as lastQueryStart, lastFetchEnd, and lastFetchCount
        
        If it's too soon to request more cites
        
        Compute how long we have to wait
        
        Sleep for that amount of time
        
        Update lastQueryStart
        
        Send the request to Pubmed
        
        Throw exception if we cannot connect or get other error.
        
        Get results
        
        Update lastFetchEnd
        Indicates last
        
        If rawMode
        
        Set lastFetchCount = countToRetrieve
        This may be an overestimate since we might have gotten back fewer than we requested, but in raw mode we aren't parsing results to get an actual count. Using the countToRetrieve could cause us to sleep when we don't need to but won't cause us to piss off NLM.
        
        Return XML string
        
        Parse data
        
        For each Article
        
        Construct an Article object (see Article object)
        Data is NOT saved in the database. To save the data call the object's save() method for each object.
        
        Append the object to the array
        
        Check for any PMIDs not returned by Pubmed
        
        Construct an Article object for each one
        
        Set
        
        source = 'Pubmed'
        
        sourceId = Pubmed ID
        
        errMsg = "Not found in Pubmed on {current datetime}"
        
        Append it to the array
        Or I can put all of these at the beginning or end if desired.
        
        Update the lastFetchCount
        
        Update moreToRequest
      - Return the array
extractPubmedIds()
- Purpose
  Read a file and extract Pubmed IDs from it. The file typically contains search results from a Pubmed search.
- Pass
  - idHandle
    Could be a file handle or file handle like memory object (like an IOStream.)
    Initially, we need to support files in Medline print format, which is the format currently used by CIAT to retrieve the results of searches. Later we can add parsers for any other format that we need to support.
  - format
    Default = "Pubmed". Others supported later if needed.
- Return
  - Array of Pubmed IDs as strings, e.g., '12345678'
- Throws EbmsException if
  - Could not open file
  - Could not parse file (bad format)
- Logic
  Logic depends on input format. For Medline print format we:
  - idList = array()
  - While we can get another line from the idHandle
    - If line matches "^PMID- (\d{x,8))$"
      - Get numeric digits as a string
      - Append string to idList
  - Return idList
importArticles()
- Purpose
  - Load articles identified by Pubmed ID into the database
    Load can be real - to get articles in, or test, to see what the effect of loading articles would be if it were done in real mode.
- Pass
  - revCycleId
  - summaryTopicId
  - userId
  - pubmedIds
    Can be a single Pubmed ID string, or an array of Pubmed ID strings. The array can come from extractPubmedIds(), or from an interface that allows direct user PMID input.
  - mode - string
    - "live" = load the data
    - "test" = report only
  - ignoreNotList - boolean
    True=Don't check the NOT lists. Allows user to import record(s) that otherwise could not be imported. Defaults to False.
  - tagArray - optional array of pairs of
    - tagId - integer
      Unique ID of a descriptive "tag" to assign to each imported cite.
    - comment - string
      Optional comment string to associate with that tag
- Return
  When running in test or live mode we return the same information.
  - Note
    In live mode we also save some of this information in tables that enable us to determine what happened in any import batch - thereby enabling us to reconstruct the return values to be used in a report generation after the fact. If the user dismisses the report she can still request a reconstruction.
    We'll also need an interface that Bob can call to get the reconstructed data.
  - array of
    - Array of tuples of
      - Article object
      - Array of result identifier strings, one or more of:
        
        "new"
        Article was not in the database, would be (test mode) or was (live mode) added.
        
        "new topic"
        Article was in the database but did not have this topic assigned. New topic would be (test mode), or was, assigned.
        
        "duplicate"
        Article was in the database for this topic. It may be that the topic was de-assigned at one time but we still regard it as a duplicate. See Concepts below.
      - "NOT listed"
        Article was to a journal that is on the NOT list.
        
        "replacement"
        Article fetched from Pubmed is newer than what we've got. In live mode it would replace the record we have.
        
        "error"
        Document could not be loaded. No articleId. otherId might or might not be present, depending on the error. messages is an array of one or more strings, each containing one error. A typical example of an error is that we searched Pubmed for this ID but didn't find it (should be extremely rare.)
- Throws EbmsException
  If the whole thing fails, e.g., can't connect to NLM, maybe others.
- Logic
  - OBSOLETE LOGIC - TO BE REPLACED
    - Create arrays to hold different categories
      Categories are "new", "new_topic", etc. Each starts as an empty array.
    - Filter out duplicate articles that will not be loaded
      - Concepts
        Concept is to search the database for this article by its source (i.e. pubmed) ID and get any and all summary topic ids that have been assigned to it.
        If this topic was ever assigned to this article, we will reject the import, even if the assignment was changed later on. It has already been considered for the topic at hand and someone decided that, even though it may have looked like it was about this topic based on Pubmed indexing, a reviewer decided it was not about that topic.
        
        If the user changes her mind, the right way to do that is not via the import process. The user should visit the article in the database and add a summary topic to it. The reason for this is that re-assigning a topic to an article for which it was already rejected based on automatic search criteria would likely produce more errors than re-assigning by hand.
        
        We also have to decide whether to search all articles in one fell swoop (SELECT … WHERE source_id IN (… all from the batch …), or to search one at a time - which takes more time but simplifies the logic. The simplification may be worthwhile, but it's not that hard to do the fell swoop approach (unless MySQL has small limits on the number of values in an IN clause.)
      - Search for article ID and (ever) associated topics
      - For each hit
        
        Append (articleId, sourceId, article) to the list of duplicates
    - While there are articles to fetch from NLM
      - Construct a search requesting MAX_NLM_FETCH hits from NLM
        We put up to MAX_NLM_FETCH IDs in one request
      - Connect to NLM and execute search
      - Fetch results in XML
      - Parse results into individual articles
      - For each article
        
        If problem with result
        e.g., Pubmed says it doesn't have it, we're missing critical information in the parse, whatever
        
        Append Pubmed ID + error message(s) to "error" array
        
        Continue
        
        Get Pubmed ID, Pubmed Journal ID, brief citation string
        
        Create array of articleId, sourceId (pubmed ID), brief citation
        articleId is still null at this point
        
        Check database for pubmed ID and all associated topic IDs
        
        If cite in database with same topic
        
        Append information to "duplicate" array
        
        Continue
        
        If not ignoreNotList
        
        Check if journal ID is in not list
        
        If NOT listed
        
        Append information to "NOT listed" array
        
        Continue
        
        If in database but not with this topic
        
        Append to "new topic" array
        
        Continue
        
        Append to "new" array
      - If "live" mode
        
        For each cite in "new" array
        
        Save the data, getting new articleId
        
        For each cite in "new topic" array
        
        Get the internal cite ID
        
        Add existing topics to messages?
        
        Replace null with internal EBMS/CiteMS cite ID in array
        
        Create association with summaryTopic, reviewCycle
        
        If any tags exist, create associations between tag and articleId
    - Construct array of arrays
    - Return it
getImportInfo()
- Purpose
  Retrieve information from a past live importArticles() run.
- Pass
  - batchId
    Locates the specific import batch
    Caller queries the database to see what batches exist, then selects a batch ID to pass to this function.
- Return
  See importArticles()
checkCitesFromNLM()
- Purpose
  Get a list of articles that are not yet in final Medline status in our database. Called by updateCitesFromNLM() but also accessible to other callers.
- Pass
  - Void
- Return
  - Array of
    - Array of
      - article ID
      - Pubmed status in our database
- Logic
  - Precondition
    - We've stored the latest status visible to MySQL
      We will put a column in the cite/article table for the latest status of any article in our database. When an article is replaced, the latest status is also replaced.
  - Search
    SELECT article_id, pubmed_status FROM article WHERE pubmed_status IN (the list, see Issues)
  - Return array of IDs
    May be empty.
- Issues
  - Do we need more details?
    Is there any value to further subdividing these by specific status, e.g., array of arrays of different PubStatus values? The updateCitesFromNLM() function probably doesn't need these.
  - What are the PubMedPubDate/@PubStatus values?
    - Found in the data I've examined:
      accepted aheadofprint entrez epublish medline ppublish pubmed received revised
    - But the documentation only lists 4 possible
      received accepted revised aheadofprint http://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.History_O
    - What status values should we check for replacements?
updateCitesFromNLM()
- Purpose
  Replace articles in pre-medline or other preliminary status with the versions containing full and complete processing.
- Pass
  - Void
    We're going to check all of the journals that are now in an incomplete status. If the checking is done regularly (e.g. weekly or even monthly) the total number to check won't be that great.
- Return
  - Array of
    - Count of replaced articles
    - Count of those we searched for that were not replaced
- Throws
  - Exception from PubmedInterface if we can't connect, etc.
- Logic
  - idList = checkSitesFromNLM()
  - nlmInterface = new PubmedInterface(idList)
  - citeList = array()
  - updatedCount = unchangedCount = 0
  - Do
    - citeList = nlmInterface.getCitesFromNLM(some reasonable number)
      Exception bubbles up if thrown
    - For each cite in the citeList
      - If cite was not found (pubmedStat == ERROR…
        
        Increment unchangedCount
      - if cite.pubmedStatus != status of id in idList
        
        cite.save()
        
        Increment updatedCount
  - while count(citeList) > 0
  - Return array(updatedCount, unchangedCount)

Author: Alan Meyer <alan@NCI-01802749>

Date: December 29, 2011

HTML generated by org-mode 6.33x in emacs 23