API for Drupal access to CiteMS functions
Draft 1.0
Table of Contents
1 Changes
This draft contains the application programmer interface (API) definitions for Article information and for importing Articles from the National Library of Medicine. The draft incorporates revisions from two walkthroughs between Bob and myself.
An API for searching for articles has been removed from the draft. I think it will be necessary to learn more about the Drupal search API and do some prototyping with it before deciding what or how much functionality should be programmed independently of Drupal's existing functions.
2 Concepts
2.1 PHP
Functionality is implemented in pure PHP modules, with no HTML involved.
2.2 Isolation of database from user interface
The user interface ("view" in MVC parlance) calls those modules to manipulate article data ("model").
2.3 Logging
The update functions described here should log their actions. Logging will be primarily for debugging.
2.4 Transactions
Many of the functions documented here can update multiple rows in the database. We don't yet know how much transaction support we will get from Drupal for controlling this but Drupal 7 appears to offer more support than Drupal 6. Good logging might help us if transactions are not fully supported.
2.5 Authorization control
The API retrieves information that is not security sensitive, therefore it is envisioned that no authorization control will be placed on functions that retrieve article data from the database.
However all functions that update the database will require that the ID of an authorized user be passed to the function. Authorization will be checked and the user ID logged with the update.
2.6 Optimization notes
Preference will be given to simplicity and maintainability over optimization until experience shows the need for optimization. Some optimization ideas that can be considered include:
- Caching lookups in static class variables
Assuming classes (not objects) are initialized once for each HTTP request, we can uses static variables to hold such things as the human readable name associated with a controlled identifier like a Tag or Status or Response value. Another possibility is to cache the result of an authorization check for a user.In a batch operation that calls a function multiple times, like an object constructor or a name lookup, the first call can cache its results in the class variables and subsequent calls can get those results without having to query the database again and again.
- Batching operations
Some functions may be written to perform an action, say, 100 times on 100 different article records. These could be rewritten to process all 100 in memory and read all of the database information into a single result set.
3 API
3.1 Article related APIs
- class Article
An Article object encapsulates a single article in the database. Many of the functions are intended only to be used during import or replacement operations. Others are for EBMS programs running in support of users.- Constructor
There are several forms of constructor, using a unique EBMS/CiteMS ID, using a source name ("Pubmed") and source ID, and one using XML. - Import related methods()
These are defined in EbmsCite.php and are not relevant here. They do not require an Article object. - User service related methods()
Getters for information about an article. Some of these simply return information from the Article object. Some go back to the database to get information which is not brought into memory unless and until it is requested.- getId()
- getSource()
Initially just "Pubmed" - getSourceId()
Initially this is just the Pubmed ID - getSourceStatus()
Tells whether this is the final Pubmed version of the article, or something preliminary. May not be relevant for other data sources. - getSourceJrnlId()
For Pubmed, this is the NLM journal ID, not the ISSN - getJrnlTitle()
Title as given in the article record. This is not an authority controlled title. - getBrfJrnlTitle()
As found in the article record. - getImportDate()
Date first imported for any board or topic - getUpdateDate()
Date data was last updated from source (i.e. Pubmed for now). It may be the import date or may be later. - getTitle()
Article title - getAuthors()
- Purpose
Retrieve author names for an article. The author names are NOT authority controlled. They are whatever appears in the Pubmed record (or other record if we ever have others) for the article. - Pass
Optional limit on the number to retrieve. The caller may only want one or a few. - Return
Array of ArticleAuthor objects (q.v. - we may prefer simple strings instead), one for each author. Objects are returned in the order that they appeared in the source article. The first object identifies and names the first author, and so on. - Logic
The data comes from the authors table, not the article record.
- Purpose
- getPubDate()
Return the publication date as a string. This is NOT a SQL date and is not easy to compare.- Note
We need to discuss what to produce here. Pubmed XML includes different fields for JournalIssue/PubDate. The fields can include Year, Month, Day, Season, MedlineDate (e.g. "2004 Jul-Aug", or "2004 Oct 2-8").I have always seen Year present, either as an element or as the first four characters of the MedlineDate. However the sample I observed was only a fraction of the total database. I also saw 3 Season="Autumn" and 78 Season="Fall" in my sample of 51,845 records.
- Note
- getAbstract()
Formatted as text. - getCitation()
- Note
This is the human readable citation to the article, for example something like: "BriefJrnlTitle, pubdate; vol(issue);pages" - or whatever we think is better. I'm conceiving this as a string that is stable and stored, not something we change for different purposes and create by re-parsing XML.We may not be able to have just one of these. The old system had just one such string for each article, derived from the Medline Print format "SO" field. However the new EBMS has two different strings used as brief citations. We may have to store two, or store one and construct the other on the fly, or store the parts and construct any of them on the fly.
- Note
- getXml()
- hasPdf()
Returns true if we have one, else false. - getPdf()
Returns PDF as a string of bytes, null if we don't have one. - setPdf()
- Purpose
Store a PDF in association with an article. If a PDF is already associated with an article, this one replaces it - for example if a wrong or corrupt PDF had been previously stored. - Pass
- PDF
TBD: This might be the PDF or the URL of a PDF in the file system. - userId
Used for checking and logging. Not otherwise stored.
- PDF
- Return
Void - Throws
- EbmsException
If IO, authorization, or other(?) error.
- EbmsException
- Purpose
- getHistory()
- Purpose
Retrieve a complete history of everything that happened to this article - import, topics assigned, status, tags, etc. - Pass
- filters
An array of keyword => value arguments used to filter out what might not be wanted. Not sure what these are yet, but they would probably include at least the following. More may be added later if required.- startDateTime
Retrieve history information from this datetime forward. Default if no startDateTime specified is from the first appearance of the article in the database. - endDateTime
Retrieve history information up to but not including this datetime. Default if no endDateTime specified is to retrieve information up to and including the latest. - topicId
Unique ID specifying a Summary topic. If specified, only retrieve the history under a single topic is desired. Default is to retrieve history for this article in association with any and all topics.- Note:
Any information not specific to a topic, e.g., import or update date, will always be retrieved. For this purpose, the import datetime and the datetime of the association with the topic for which the import was performed are treated as separate events - even though they have identical datetime stamps and were specified by a user in the same import operation. - Note:
We may want to be able to specify more than one topic, but not necessarily all topics for a board. If there is a use case for this then we can allow an array of topic IDs and the software will test the parameter to see if it's a single object or an array.
- Note:
- boardId
Unique ID specifying an editorial board. If specified, only retrieve the history as it pertains to a particular editorial board. If the article is associated with only one topic for this board the result is identical to specifying that topic ID. However if two or more topics from the same board are associated with the article then they will all appear in the history.The notes on non-topic specific items and multiple topics under topicId also applies to non-board specific items.
- get_status
- True
Get all status values. The default. - False
Do not include status values
- True
- get_tags
- True
Get tags, the default - False
Leave out tag info
- True
- get_responses
- True
Get board member responses, the default - False
Leave out response info
- True
- sortOrder
Tells how to order results.- "date"
Order objects by datetime. This is the default - though we can change the way it works if it is not the most commonly requested sort order.If datetimes are identical, priorities are:
- non-board, non-topic specific items
An import event is shown before any status or tag assignment with the same datetime value. - status before tag
If a status assignment and a tag assignment are made in one operation with the same datetime, the status assignment will be ordered first.
- non-board, non-topic specific items
- "board"
Objects are partitioned by editorial board, then sorted by date within board. Non-board specific items (imported, modified) are shown before all board specific items. Boards are in alphabetical order. - "topic"
Objects are partitioned by topic. Non-topic specific items come first. Topics are in Summary topic name alphabetical order.
- "date"
- startDateTime
- filters
- Return
Array of objects, in specified (or defaulted) sort order.The caller should use the PHP isinstance() or get_class() functions to determine how to interpret each object returned. Possible object types are:
- ArticleImportEvent - also used for updates.
These are potentially independent of Summary topic and are therefore not "status" records. - ArticleTopic
- ArticleStatus
- ArticleTag
- ArticleResponse
- CiteDecision?
If final decisions are "status" values, then we don't need this.
- ArticleImportEvent - also used for updates.
- Note on logic optimization
History is not expected to be large for most articles. It is therefore desirable to program for simplicity and maintainability and only optimize if experience shows the need. - Logic
- Get import and modified dates
Construct ArticleImport objects. - Get list of topic assignments
Select those of interest or fetch all and delete any that are of no interest. Construct ArticleTopic objects. - Get list of status assignments
Select those of interest or filter out others after select. Construct ArticleStatus objects. - Get list of tag assignments
Select those of interest or filter out others after select. Construct ArticleTag objects. - Get list of board member responses
Construct ArticleResponse events. - Create an in-memory index of all objects for sorting
The sort keys are defined in a sortOrder dependent way and extracted in an object dependent way. - Sort the key=>object list
- Produce a simple array
0=>first object in sortOrder 1=>next object … - Return it
- Get import and modified dates
- Purpose
- getTopics()
- Purpose
Get a list of all of the topics associated with this article. - Pass
Void. - Return
Array of ArticleTopic objects containing information about the association.
- Purpose
- hasTopic()
- Purpose
Tell whether the article is associated with a specific topic. - Pass
- topicId
- topicId
- Return
- True - article is associated with this topic
- False - it's not
- True - article is associated with this topic
- Note:
It doesn't matter what the status is for this topic. - Throws
- EbmsException
If topicId or Name is invalid.
- Purpose
- getBoards()
- Purpose
Get a list of all of the editorial boards associated with this article. - Pass
Void - Return
Array of ArticleBoard objects containing information about the association.
- Purpose
- hasBoard()
- Purpose
Tell whether the article is currently associated with a specific board. - Pass
- boardId
- boardId
- Return
- True - article is associated with this board
- False - it's not
- True - article is associated with this board
- Throws
- EbmsException
If boardId or Name is invalid.
- EbmsException
- Purpose
- getStatus()
- Purpose
Find the current (i.e. last assigned) status of an article. Status is topic specific and there is a current status for each topic associated with an article. - Pass
- topicId
Caller must specify a topic. If the caller needs to see the status for more than one topic that may have been assigned to the article, he should call getHistory() instead.
- topicId
- Return
- Status object for the last assigned status
Includes information about when assigned, who assigned it, etc. - or Null
If the article is not associated with that topic. This is not an error and does not raise an exception.
- Status object for the last assigned status
- Throws
- EbmsException
If database fails. Failure to find an association between the passed topic and the article ID does not throw an exception.
- EbmsException
- Purpose
- setStatus()
- Purpose
Set a new current status for an article. - Pass
- statusId
- topicId
A status is always associated with a specific topic identified by unique ID. This is checked. The article must already be associated with the topic. - userId
Unique ID of person. Authorization is checked. - Comment
Optional comment to record with the status assignment. - dt
Optional datetime of assignment. Probably null when called from a user interface. May be used by batch programs like the import program to ensure that the assignment of the initial topic for the article has an identical datetime to the import datetime.If null, use DBMS current datetime.
- statusId
- Return
- uniqueRowId
This is the unique ID of the database row representing the association of the article and the status, not the unique ID of the tag.
- uniqueRowId
- Throws
- EbmsException
- If status not valid or not allowed at this point in processing
- If article not associated with the passed topic
- If user not authorized
- If status not valid or not allowed at this point in processing
- EbmsException
- Logic
- Check user authorization
No database update allowed unless user is authorized. - Check that article is associated with the passed topic.
- Check that status makes sense
Not sure what all the checks here will be, but there are some status values that might be illogical, for example requesting a PDF when the PDF has already been stored, or setting a status value with a topic that is not associated with this article (unless the status being set is the association itself.) - Create row in the status table
- Check user authorization
- Purpose
- deleteStatus() ?
- Purpose
Normally, status would never be deleted, it would be superseded by the next status value assigned. It would only be if an error occurred and a status was assigned by mistake that we would want to delete the status.Deletions would be performed by updating the column in the status table that says whether the row is active or not.
- Pass
- statusRowId
The unique ID of the row in the status table that is to be deleted. This is NOT the ID of the status name in a table of valid status values. It is the ID of a database row that associates a status ID and an article ID. The caller should get this id by calling getStatus() or getHistory() or otherwise reading the table. - userId
User with authorization to set status who is performing the action. - comment
Reason for the inactivation. Calls updateStatusComment() to insert the reason.
- statusRowId
- Return
Void - Throws
- EbmsException
If user not authorized or the requested operation is invalid. See Logic.
- EbmsException
- Logic
- Check user authorization
- Perform any consistency checks
Don't know what these are yet. There may be references to the status row somewhere that would make it invalid to inactivate the status. There may be DBMS constraints to enforce these, or applications code. Or there may not be any. - Update the row.
- Check user authorization
- Purpose
- updateStatusComment()
- Purpose
It may be desirable to allow comments associated with status to be updated, for example to explain some special circumstance regarding the status assignment, or to correct a previous comment.Existing comments are appended to, not overwritten. See addTagComment() for a description.
- Pass
- userId
- comment
Text to append. - dt
Optional datetime for use by batch programs that synchronize datetime stamps with other operations. Default is database current datetime.
- userId
- Purpose
- addTag()
- Purpose
Associate a new descriptive tag with an article. Tags are for search and comment purposes, used when it is desired to mark an article for some purpose without changing its status. - Pass
- tagId
Unique ID of the tag in a table of tags. - topicId
Unlike status, tags aren't necessarily associated with a particular topic, but may be. Topic ID is optional and may be null.Our current thinking is that some tags might be topic specific but some might not. Therefore we plan to store a topic_id column in the database but make it nullable.
XXX - This requires discussion with users.
- userId
Unique ID of person. Authorization is checked. - Comment
Optional comment to record with the tag assignment. See also updateTagComment(). - dt
Optional datetime of assignment. Probably null when called from a user interface. May be used by batch programs.
- tagId
- Return
Void. - Throws
- EbmsException
- If tag or topic not valid
- If user not authorized
- If tag or topic not valid
- EbmsException
- Purpose
- deleteTag()
- Purpose
Remove a tag from an article by setting its active_status value to inactive. - Pass
- tagRowId
We need this because one tag can be associated with an article multiple times. For example if we have a tag like "Board manager comment", there might be multiple occurrences from the same board manager.The calling program would typically find the unique row ID by using the getHistory() function.
- userId
The user must be authorized to create the tag in order to delete it.The user ID does not replace the ID of the original creator of the tag. However it will appear in the log file and in a default comment appended to the existing tag comment.
- comment
Optional comment associated with the deletion. We will append this comment to the existing tag comment (see updateTagComment()) and then mark the entire row in the table as inactive/deleted. - dt
Optional datetime of the deletion. Else use current database datetime.
- tagRowId
- Return
Void. - Throws
- EbmsException
If user not authorized or tagRowId does not exist.
- EbmsException
- Purpose
- updateTagComment()
- Purpose
Add more information to a tag comment. This is a useful function but it could get ridiculously complicated. To avoid too much complication, I propose that the program should do the following when updating a comment.- Append username to comment
- Append date to comment
- Append new comment text
- Example
A comment might look something like this:"[Alan Meyer, 2012-09-15 16:20:05] I think this is an interesting article that we should consider with regard to treatment of all lymphomas. [Bob Kline, 2012-09-15 09:01:55] It's not really relevant to all of them and, besides, the author has a dubious publication history."
- Append username to comment
- Pass
- tagRowId
From getHistory(). - userID
Must be authorized. - comment
Text to append - dt
Optional datetime, normally defaulted to database datetime but may be set by batch programs that add tags.
- tagRowId
- Return
New ArticleTag object with the updated comment information. - Throws
- EbmsException
- If user or authorization error
- If updated comment would exceed space constraints
- If user or authorization error
- EbmsException
- Purpose
- getId()
- Constructor
- Utility classes - NO
Used in multiple places.- IdName - NO, WE DECIDED AGAINST THIS (or at least I did)
- Purpose
Used for passing information to and from functions. - Contains
- itemId
The unique internal identifier of something. It might be a status name, tag name, user, or something else. It can be used as a lookup in the database.If itemId exists, it is used to identify the item. In that case, itemName is always ignored.
- itemName
For items that are guaranteed to have unique names, if the itemId is not present but the name is present, the name will be used to lookup the unique ID when required. This is a convenience intended to provide better code self-documentation. Passing status
- itemId
- Purpose
- IdName - NO, WE DECIDED AGAINST THIS (or at least I did)
- Other related classes
These are all representations of rows in the database. The methods for manipulating them are in the Article class.Note: The names of fields given below are intended to describe the function of the field. They are not necessarily the names that will be used in the definitions of the fields in our PHP code.
- Issues
The following issue(s) apply to all of these classes.- Visibility of contents
We may want to make some or all fields in these objects private, accessible only by getter methods. That would enable us to provide lazy evaluation of data that is not stored in the rows themselves, for example user, status, or tag names that are represented in the database rows only by their IDs.
- Visibility of contents
- ArticleAuthor
- Purpose
Describe what we know about the identity of an author. Note that authors are not authority controlled. "Bush, G", "Bush, GHW", "Bush, George", "Bush, George Herbert Walker", etc. may in fact all be the same actual author but they will create separate author records in our database.The fields in our records are taken from the corresponding fields in Pubmed XML. Whatever they give us is what we have.
- Contains
- unique row ID
Unique in EBMS/CiteMS. This is our ID, not something from NLM. - Last name
- Forename
- Initials
- unique row ID
- Purpose
- ArticleImport
- Purpose
Describe an import of an article from an external source - always Pubmed in the initial version of the system. May be something else later. - Contains
- unique row ID
- article ID
- source
Initially "Pubmed". - user ID
- user name
- datetime of import
- comment (if any)
- importType
"import" or "update". Updates are, initially, replacements of previously imported XML by newer XML from NLM.
- unique row ID
- Purpose
- ArticleStatus
- Purpose
Describe the association of a single combination of article, topic, status, and date. - Contains
- unique row ID
- article ID
- summary topic ID
- summary topic name
- status ID
- status name
- user ID
- user name
- datetime created
- comment
- active status
- unique row ID
- Purpose
- ArticleTag
- Purpose
Describe the association of a single combination of an article, descriptive tag, and date. - Contains
- unique row ID
- article ID
- summary topic ID?
To be determined - tag id
- tag name
- user ID
Of user creating the tag record. Other users may update the comment. - user name
- datetime created
Of original creation - datetime last updated
- comment
- active status
- unique row ID
- Purpose
- ArticleTopic
- Purpose
Describe the association of an article and a summary topic. - Contains
- article ID
- summary topic id
- summary topic name
- user ID
- user name
- datetime association made
- comment
- active status
- article ID
- Purpose
- ArticleImport
- Purpose
These objects are used in the getHistory() results but they don't exist as such in the database. Import and update dates come from the article rows. - Contains
- article ID
- import type - "import" or "update"
- user ID
- user name
- datetime
- article ID
- Purpose
- ArticleResponse
- Purpose
Describe a board member's response to an article sent to him or her for review. "Responses" are currently controlled values picked from a list. - Issues
- What is a response?
Recording and retrieving responses is complicated by the fact that a board member can perform a single form submission with multiple responses. Currently, the multiple responses are encoded in a bitmap stored in a single row in the database - and associated with one optional free text comment. That might or might not change.- Multiple rows?
We might treat each response in a single submission as a separate row in the response table, possibly with something that identifies them as all coming from the same board member response. User ID + datetime might be sufficient for that. - Multiple comments?
An advantage (or is it a disadvantage?) of having multiple rows is that we can allow the board member to attach a comment to each response. It's an advantage from an information- Advantages
It produces more precise, complete information. - Disadvantage
It puts potential additional burdens on the board members to produce more detailed responses. Current thinking is that this outweighs any advantage.
- Advantages
- Impact on this function
How we answer these questions will determine what the content of an ArticleResponse needs to be.
- Multiple rows?
- What is a response?
- Contains
- article ID
- topic ID
- topic name
- board member ID
- board member name
- response ID
This requires discussion. See above. - response text
See above - Datetime
- article ID
- Purpose
- Issues
3.2 Article import related APIs
- class PubmedInterface (for getCitesFromNLM())
- Purpose
- Used in fetching articles from NLM
- Maintains state between batches
- Used in fetching articles from NLM
- Fields
- idList
Array of IDs in the batch to get - nextId
Index of next ID to fetch - lastFetchStart
Datetime of our last transmission to NLM. We may need to throttle our interactions with NLM to avoid breaking their speed limits. We can consult this and the next two fields to decide whether we need to introduce a sleep time before the next fetch. - lastFetchEnd
Date-time the last response from NLM was completely received - lastFetchCount
Number of articles we requested in the last fetch.
- idList
- Methods
- Constructor
- Pass
- idList
Array of IDs to fetch, number limited only by available memory. IDs are numeric strings, e.g., array("12345678", "12345679", …)
- idList
- Logic
- Initialize
- lastQueryStart = null
- nextFetch = 0
- lastQueryStart = null
- Initialize
- Pass
- getCitesFromNLM()
- Purpose
- Fetch articles from NLM
- Used for:
- Importing articles
- Replacing articles with later versions
- Any other reason we might have
- Importing articles
- Fetch articles from NLM
- Pass
- fetchCount
Count of articles to retrieve in next batch - rawMode
- True = return the XML as retrieved from NLM
Caller has to figure out what's inside, including figuring out whether some items were not returned. This is just raw data. - False (default) = return Article objects
- True = return the XML as retrieved from NLM
- fetchCount
- Return
- Raw XML, or
- Array of Article objects
- If PubmedID not found:
The Article object will have nothing in it except the Pubmed ID and a status value = 'ERROR_NOT_FOUND'. It will be possible to have an array of, e.g., 50 article objects, some number of which have full data and some number of which contain this error. - If returned array has fewer cites than requested:
- We've reached the end of the list
- Returned array will be empty if called again
But no error will be raised. It is okay to keep calling until the length of the returned array == 0.
- We've reached the end of the list
- If PubmedID not found:
- Raw XML, or
- Throws
- EbmsException
If we cannot connect to Pubmed or other Pubmed error. If we can't parse the returned information and rawMode == False. If HTTP return code indicates error.
- EbmsException
- Logic
- Determine Pubmed throttle numbers
These are probably just constants.- Max articles to get in one query
- Min time between queries
- Max articles to get in one query
- returnList = array()
- countRequested = 0
- moreToRequest = length of ID list - nextFetch
- if fetchCount > moreToRequest
- fetchCount = moreToRequest
- fetchCount = moreToRequest
- while moreToRequest
- Construct a request to Pubmed
- Fill it with some requested Pubmed IDs
- Count is min of
- Number of IDs remaining to be fetched
- Maximum we allow in one fetch
- Number of IDs remaining to be fetched
- Starting with nextId
- Update nextID to point after the last requested ID
- Count is min of
- Apply throttling rules
These might take day of week and time of day into account as well as lastQueryStart, lastFetchEnd, and lastFetchCount- If it's too soon to request more cites
- Compute how long we have to wait
- Sleep for that amount of time
- Compute how long we have to wait
- If it's too soon to request more cites
- Update lastQueryStart
- Send the request to Pubmed
- Throw exception if we cannot connect or get other error.
- Throw exception if we cannot connect or get other error.
- Get results
- Update lastFetchEnd
Indicates last - If rawMode
- Set lastFetchCount = countToRetrieve
This may be an overestimate since we might have gotten back fewer than we requested, but in raw mode we aren't parsing results to get an actual count. Using the countToRetrieve could cause us to sleep when we don't need to but won't cause us to piss off NLM. - Return XML string
- Set lastFetchCount = countToRetrieve
- Parse data
- For each Article
- Construct an Article object (see Article object)
Data is NOT saved in the database. To save the data call the object's save() method for each object. - Append the object to the array
- Check for any PMIDs not returned by Pubmed
- Construct an Article object for each one
- Set
- source = 'Pubmed'
- sourceId = Pubmed ID
- errMsg = "Not found in Pubmed on {current datetime}"
- source = 'Pubmed'
- Append it to the array
Or I can put all of these at the beginning or end if desired.
- Construct an Article object for each one
- Construct an Article object (see Article object)
- Update the lastFetchCount
- Update moreToRequest
- Construct a request to Pubmed
- Return the array
- Determine Pubmed throttle numbers
- Purpose
- Constructor
- Purpose
- extractPubmedIds()
- Purpose
Read a file and extract Pubmed IDs from it. The file typically contains search results from a Pubmed search. - Pass
- idHandle
Could be a file handle or file handle like memory object (like an IOStream.)Initially, we need to support files in Medline print format, which is the format currently used by CIAT to retrieve the results of searches. Later we can add parsers for any other format that we need to support.
- format
Default = "Pubmed". Others supported later if needed.
- idHandle
- Return
- Array of Pubmed IDs as strings, e.g., '12345678'
- Array of Pubmed IDs as strings, e.g., '12345678'
- Throws EbmsException if
- Could not open file
- Could not parse file (bad format)
- Could not open file
- Logic
Logic depends on input format. For Medline print format we:- idList = array()
- While we can get another line from the idHandle
- If line matches "^PMID- (\d{x,8))$"
- Get numeric digits as a string
- Append string to idList
- Get numeric digits as a string
- If line matches "^PMID- (\d{x,8))$"
- Return idList
- idList = array()
- Purpose
- importArticles()
- Purpose
- Load articles identified by Pubmed ID into the database
Load can be real - to get articles in, or test, to see what the effect of loading articles would be if it were done in real mode.
- Load articles identified by Pubmed ID into the database
- Pass
- revCycleId
- summaryTopicId
- userId
- pubmedIds
Can be a single Pubmed ID string, or an array of Pubmed ID strings. The array can come from extractPubmedIds(), or from an interface that allows direct user PMID input. - mode - string
- "live" = load the data
- "test" = report only
- "live" = load the data
- ignoreNotList - boolean
True=Don't check the NOT lists. Allows user to import record(s) that otherwise could not be imported. Defaults to False. - tagArray - optional array of pairs of
- tagId - integer
Unique ID of a descriptive "tag" to assign to each imported cite. - comment - string
Optional comment string to associate with that tag
- tagId - integer
- revCycleId
- Return
When running in test or live mode we return the same information.- Note
In live mode we also save some of this information in tables that enable us to determine what happened in any import batch - thereby enabling us to reconstruct the return values to be used in a report generation after the fact. If the user dismisses the report she can still request a reconstruction.We'll also need an interface that Bob can call to get the reconstructed data.
- array of
- Array of tuples of
- Article object
- Array of result identifier strings, one or more of:
- "new"
Article was not in the database, would be (test mode) or was (live mode) added. - "new topic"
Article was in the database but did not have this topic assigned. New topic would be (test mode), or was, assigned. - "duplicate"
Article was in the database for this topic. It may be that the topic was de-assigned at one time but we still regard it as a duplicate. See Concepts below.
- "new"
- "NOT listed"
Article was to a journal that is on the NOT list.- "replacement"
Article fetched from Pubmed is newer than what we've got. In live mode it would replace the record we have. - "error"
Document could not be loaded. No articleId. otherId might or might not be present, depending on the error. messages is an array of one or more strings, each containing one error. A typical example of an error is that we searched Pubmed for this ID but didn't find it (should be extremely rare.)
- "replacement"
- Article object
- Array of tuples of
- Note
- Throws EbmsException
If the whole thing fails, e.g., can't connect to NLM, maybe others. - Logic
- OBSOLETE LOGIC - TO BE REPLACED
- Create arrays to hold different categories
Categories are "new", "new_topic", etc. Each starts as an empty array. - Filter out duplicate articles that will not be loaded
- Concepts
Concept is to search the database for this article by its source (i.e. pubmed) ID and get any and all summary topic ids that have been assigned to it.If this topic was ever assigned to this article, we will reject the import, even if the assignment was changed later on. It has already been considered for the topic at hand and someone decided that, even though it may have looked like it was about this topic based on Pubmed indexing, a reviewer decided it was not about that topic.
If the user changes her mind, the right way to do that is not via the import process. The user should visit the article in the database and add a summary topic to it. The reason for this is that re-assigning a topic to an article for which it was already rejected based on automatic search criteria would likely produce more errors than re-assigning by hand.
We also have to decide whether to search all articles in one fell swoop (SELECT … WHERE source_id IN (… all from the batch …), or to search one at a time - which takes more time but simplifies the logic. The simplification may be worthwhile, but it's not that hard to do the fell swoop approach (unless MySQL has small limits on the number of values in an IN clause.)
- Search for article ID and (ever) associated topics
- For each hit
- Append (articleId, sourceId, article) to the list of duplicates
- Append (articleId, sourceId, article) to the list of duplicates
- Concepts
- While there are articles to fetch from NLM
- Construct a search requesting MAX_NLM_FETCH hits from NLM
We put up to MAX_NLM_FETCH IDs in one request - Connect to NLM and execute search
- Fetch results in XML
- Parse results into individual articles
- For each article
- If problem with result
e.g., Pubmed says it doesn't have it, we're missing critical information in the parse, whatever- Append Pubmed ID + error message(s) to "error" array
- Continue
- Append Pubmed ID + error message(s) to "error" array
- Get Pubmed ID, Pubmed Journal ID, brief citation string
- Create array of articleId, sourceId (pubmed ID), brief citation
articleId is still null at this point - Check database for pubmed ID and all associated topic IDs
- If cite in database with same topic
- Append information to "duplicate" array
- Continue
- Append information to "duplicate" array
- If not ignoreNotList
- Check if journal ID is in not list
- If NOT listed
- Append information to "NOT listed" array
- Continue
- Append information to "NOT listed" array
- Check if journal ID is in not list
- If in database but not with this topic
- Append to "new topic" array
- Continue
- Append to "new topic" array
- Append to "new" array
- If problem with result
- If "live" mode
- For each cite in "new" array
- Save the data, getting new articleId
- Save the data, getting new articleId
- For each cite in "new topic" array
- Get the internal cite ID
- Add existing topics to messages?
- Get the internal cite ID
- Replace null with internal EBMS/CiteMS cite ID in array
- Create association with summaryTopic, reviewCycle
- If any tags exist, create associations between tag and articleId
- For each cite in "new" array
- Construct a search requesting MAX_NLM_FETCH hits from NLM
- Construct array of arrays
- Return it
- Create arrays to hold different categories
- OBSOLETE LOGIC - TO BE REPLACED
- Purpose
- getImportInfo()
- Purpose
Retrieve information from a past live importArticles() run. - Pass
- batchId
Locates the specific import batchCaller queries the database to see what batches exist, then selects a batch ID to pass to this function.
- batchId
- Return
See importArticles()
- Purpose
- checkCitesFromNLM()
- Purpose
Get a list of articles that are not yet in final Medline status in our database. Called by updateCitesFromNLM() but also accessible to other callers. - Pass
- Void
- Void
- Return
- Array of
- Array of
- article ID
- Pubmed status in our database
- article ID
- Array of
- Array of
- Logic
- Precondition
- We've stored the latest status visible to MySQL
We will put a column in the cite/article table for the latest status of any article in our database. When an article is replaced, the latest status is also replaced.
- We've stored the latest status visible to MySQL
- Search
SELECT article_id, pubmed_status FROM article WHERE pubmed_status IN (the list, see Issues) - Return array of IDs
May be empty.
- Precondition
- Issues
- Do we need more details?
Is there any value to further subdividing these by specific status, e.g., array of arrays of different PubStatus values? The updateCitesFromNLM() function probably doesn't need these. - What are the PubMedPubDate/@PubStatus values?
- Found in the data I've examined:
accepted aheadofprint entrez epublish medline ppublish pubmed received revised - But the documentation only lists 4 possible
received accepted revised aheadofprint http://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.History_O - What status values should we check for replacements?
- Found in the data I've examined:
- Do we need more details?
- Purpose
- updateCitesFromNLM()
- Purpose
Replace articles in pre-medline or other preliminary status with the versions containing full and complete processing. - Pass
- Void
We're going to check all of the journals that are now in an incomplete status. If the checking is done regularly (e.g. weekly or even monthly) the total number to check won't be that great.
- Void
- Return
- Array of
- Count of replaced articles
- Count of those we searched for that were not replaced
- Count of replaced articles
- Array of
- Throws
- Exception from PubmedInterface if we can't connect, etc.
- Exception from PubmedInterface if we can't connect, etc.
- Logic
- idList = checkSitesFromNLM()
- nlmInterface = new PubmedInterface(idList)
- citeList = array()
- updatedCount = unchangedCount = 0
- Do
- citeList = nlmInterface.getCitesFromNLM(some reasonable number)
Exception bubbles up if thrown - For each cite in the citeList
- If cite was not found (pubmedStat == ERROR…
- Increment unchangedCount
- Increment unchangedCount
- if cite.pubmedStatus != status of id in idList
- cite.save()
- Increment updatedCount
- cite.save()
- If cite was not found (pubmedStat == ERROR…
- citeList = nlmInterface.getCitesFromNLM(some reasonable number)
- while count(citeList) > 0
- Return array(updatedCount, unchangedCount)
- idList = checkSitesFromNLM()
- Purpose
Date: December 29, 2011
HTML generated by org-mode 6.33x in emacs 23