EBMS Tickets

Issue Number 270
Summary Citations with Changed PMIDs
Created 2015-02-02 12:31:33
Issue Type Improvement
Submitted By Juthe, Robin (NIH/NCI) [E]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2016-01-25 17:09:37
Resolution Fixed
Path /home/bkline/backups/jira/oceebms/issue.146197
Description

We've run across an interesting issue. Sharon has encountered at least a couple of citations whose PMIDs have changed.

Here is an example:

The citation was initially given a PMID of 25513469 (and imported to the EBMS). The PMID is now 25113753 on PubMed.

This is for the following paper:

J Clin Oncol. 2014 Sep 20;32(27):3059-68.
Recommendations for initial evaluation, staging, and response assessment of Hodgkin and non-Hodgkin lymphoma: the Lugano classification.
Cheson BD, Fisher RI, Barrington SF, et al.

A couple of questions:
1. Is it possible to change the PMID of a citation?
2. When we receive automatic updates from NLM, do these updates include changes to PMIDs?

Comment entered 2015-02-02 12:47:39 by Juthe, Robin (NIH/NCI) [E]

Here is another example:

The first PMID was 25513462. The current PMID is 25049327.

This is for the following paper:

J Clin Oncol. 2014 Sep 20;32(27):3012-20.
Outcomes of children with BCR-ABL1–like acute lymphoblastic leukemia treated with risk-directed therapy based on the levels of minimal residual disease.
Roberts KG, Pei D, Campana D, Payne-Turner D, Li Y, Cheng C, Sandlund JT, Jeha S, Easton J, Becksfort J, Zhang J, Coustan-Smith E, Raimondi SC, Leung WH, Relling MV, Evans WE, Downing JR, Mullighan CG, Pui CH.

Comment entered 2015-08-13 10:28:12 by Juthe, Robin (NIH/NCI) [E]

Here's another example of this problem:

The PMID for the following citation changed from 25513459 to 25049325. The citation is now in the EBMS twice.

Eijzenga W; Aaronson NK; Hahn DE; Sidharta GN; van der Kolk LE; Velthuizen ME; Ausems MG; Bleiker EM
Effect of routine assessment of specific psychosocial problems on personalized communication, counselors’ awareness, and distress levels in cancer genetic counseling practice: a randomized controlled trial.
J Clin Oncol 32(27): 2998-3004, 2014

Comment entered 2015-10-05 12:15:14 by Kline, Bob (NIH/NCI) [C]

In answer to the first question, according to NLM's online documentation for PubMed:

PMIDs do not change over time or during processing....

That would imply that, since they don't believe they have changed the PMID, the answer to the second question is that they won't ever notify us that they have changed the PMID for an article. Presumably they believe that the article represented by the second PMID is a different article from the article represented by the first PMID.

Comment entered 2015-10-05 19:45:34 by alan

I'm guessing this is a case where the documentation is just wrong and Robin's example is not the only case where an article had a changed ID. Maybe the original download happened before the article was fully processed and "published" by Pubmed. Or maybe they entered the record twice, discovered that after the fact, and deleted one. Whatever the cause or their interpretation of it, we still need to do something about it.

The first thing to do is probably to study our software and find out whether anything breaks if we change a Pubmed ID. The only use of "source_id" outside of our ebms_article table is ebms_import_action - which might not be a problem for us, or maybe only a small one.

Comment entered 2015-10-05 19:50:45 by alan

Probably the hardest problem isn't the source_id but the fact that we have two article_ids, each with article states. If we have to merge the states, it's starting to get hard. I can think of multiple approaches to the problem, none of which is perfect.

Comment entered 2015-10-06 14:42:39 by alan

It's going to be valuable to find out the dimensions of the problem.

  • How often does this happen?

  • Has it happened with documents that went all the way through to be sent out for review?

  • What are the consequences of not fixing it?

  • What will we do with these docs if we don't fix it in software?

  • What does NLM say about why this happened?

  • Does (or can) NLM provide a mechanism to let us know when it happens?

Bob and I think there's a can (or cans) of worms here.

Comment entered 2015-10-07 14:21:11 by Juthe, Robin (NIH/NCI) [E]

I have reported the problem each time I have come across it, so about 1x every 3 months. Chances are this is an underestimate because it isn't easy to recognize when there are duplicate citations with different PMIDs in the system - each of these cases were identified by us remembering a very similar (it turns out, identical) citation that we had already seen. However, I just noticed an interesting point about the three examples above. The original PMID for each of these citations is very close - 25513469, 25513462, and 25513459. Maybe there was a "bad" batch??

To answer some of your other questions:
It has happened to citations that have gone through the process (sent for review).

The consequences of not fixing it include sending the same citation out twice and having an incomplete history stored in a citation.

When we identify such a problem, we try to add comments to alert us to the duplication, but this isn't a foolproof method (and we haven't been completely consistent about this, I see).

What happens when we go to PubMed for updates for a citation that has had a change in its PMID? I'm wondering if another citation is found or if it comes back saying the citation cannot be found? Do we get an error message of any kind?

Comment entered 2015-10-08 12:09:25 by alan

I did a little research. It turns out that there are more than 400 cases in the past where we attempted to import a record from Pubmed by its Pubmed ID and failed.

The errors are reported in the import batch reports, but the software only knows that it requested an article from Pubmed and didn't get it. It doesn't know why, and I'm not aware of NLM saying why. They just don't return a record for the request. If a user clicks on the Pubmed ID in the import batch report she is taken to Pubmed where, in the times I tried it, I got messages like this:

Error occurred: The following PMID is not available: 21680980
PMID: 21680980

or:

Wrong UID 25513462
No items found.

It's probably not the case that they are all replaced by duplicate records. Many could be records that are withdrawn from Pubmed for some other reason. However, it looks like the problem could be a serious one.

Maybe "Wrong UID" means it's a duplicate and "PMID is not available" means something else, but we'd have to get some info from NLM before deciding anything and it could be a fragile thing to rely on.

One thing to consider is whether we need to check incoming records for duplication with other records by some criteria other than PMID. For example, by title plus list of authors. That may not be foolproof either if, for example, a duplicate record is issued with a new Pubmed ID, but there is a single character change in the title or in one of the author's names.

It's not a simple problem, is it?

Comment entered 2015-10-14 16:24:16 by Juthe, Robin (NIH/NCI) [E]

Let's talk about this issue in tomorrow's meeting, if possible. The option to check for duplicates seems time consuming and unnecessary most of the time given that this doesn't happen too often. I'm wondering if it would make more sense to add the ability to block duplicate citations in order to remove them from search results, report results, packet pages, etc. Or maybe we could merge the two citation records in the EBMS? These solutions wouldn't be foolproof either, since they would require that we identify the two citations as duplicates, but they would be helpful nonetheless.

Comment entered 2015-10-15 20:07:45 by alan

I have attached the raw XML files for the two EBMS / Pubmed articles mentioned in the Description field for this issue. Surprisingly, the later record (i.e., the one with the higher Pubmed ID number - 255...) is not the one that is the official Pubmed record. It is not as complete as the earlier one, for example it has no MeSH headings, shows only an abbreviated journal title in the Journal/Title field, and has less bibliographic and control information.

My initial assumption that the later record is a correction for an earlier one is false in this case. It looks like it is a duplicate that was mistakenly added and then removed.

Comment entered 2015-10-15 21:44:30 by alan

Here are some notes of ideas and questions pertaining to our discussion at
today's CDR/EBMS status meeting:

  • Can we determine our goals and assign priorities?
    Some of the things we might want to achieve include: not sending duplicate
    documents to board members for review, detecting duplicates automatically,
    providing consistent treatment for duplicate records. etc.
    If we can list and prioritize these we may find that some expensive and/or
    risky methods are needed only for low priority goals and can be sacrificed
    if necessary.

  • Can we detect duplicate documents using software?

    • By getting info from NLM?
      Is it possible to retrieve information from NLM that tells us when an
      article is added with a new Pubmed ID where the article is a duplicate or
      replacement of an earlier one?

    • By doing it ourselves by matching document sections?
      Example: author list + title + journal id

  • When two records refer to the same article, can we tell which is the right
    one?

    • By asking NLM?
      An obvious way would be to have a function that submits both PMIDs to NLM
      and, if one fails to retrieve a record, consider the other the official
      version. However this will only work after NLM has discovered the problem
      and selected the winning record before the time we make the check.

    • By looking for specific fields in the record?
      In one example, the "good" article had a MedlineCitation/@Status =
      "Medline", while the "bad" one had a Status of "In-Process". But we don't
      know how often that rule applies.

    • By EBMS staff judgment?
      We might create a comparison tool that allows a user to request a
      comparison of two records that appear to be the same. It could query NLM,
      look at MedlineCitation/@Status, and other fields and heuristics, perhaps
      offer a diff of the two XML files, and then prompt the user to choose a
      winner.

  • If we find a newer record duplicates an older one, what should we do?

    • Re-link data from the "bad" record into the "good" one?
      Important information linked to the bad record might be either copied or
      moved to link to the good one. Some information linked to the bad record
      might possibly just be deleted. We would need to study every possible
      link type and determine rules for each one.

    • Delete the bad record?
      Requires that any linked information not re-linked to the good record also
      be deleted.

    • Keep but reject the bad record?
      Could be done with a new state type, e.g. RejectInvalid. This does not
      require as much relinking and no deletions.

    • Create new tables to hold "bad" data, or new status columns to mark it?

    • Link duplicates together?

    • Tag duplicates?

Comment entered 2015-10-16 00:15:08 by alan

Just getting information about the problems in this issue is turning out to be a difficult and slippery problem.

In my comment of Thursday, 8 Oct 2015 12:09 PM, I wrote that "there are more than 400 cases in the past where we attempted to import a record from Pubmed by its Pubmed ID and failed."

I decided to try to find out why these failed. I extracted them from the error records in the database. There were 408 distinct IDs. Then I hacked together a program to submit all of them to NLM, thinking I'd find out more about why they failed. This time, only 8 of the 408 failed. The other 400 succeeded in retrieving Pubmed records! Digging further I discovered that every one of the 400 that succeeded this time was in a narrow PMID range: 18375141 - 18625716 and all belonged to a single batch launched at 2014-06-03 11:53:22.

So my 400+ errors was a red herring, or a wild goose, or whatever animal we like to symbolize my march in the wrong direction.

Then I tested the other 8. One was clearly a mistake: 79276158, a number too high for Pubmed. Of the other seven, two succeeded! Maybe there was a bug in my test program, or maybe Pubmed burped on those two in the first attempt and succeeded the next time.

However, there is good news in this story. We learned that the NLM error reports are not as frequent as I thought they were.

We also learned that the import failures aren't much help in measuring the problem. There could be many cases where Pubmed created a duplicate record and did not notice it, or noticed it and deleted one of a pair but we never downloaded it a second time, or we downloaded it but never happened to attempt to replace that record after the deletion and so never saw an error.

We'll need a better way to research the problem if we want to quantify the duplications.

Comment entered 2015-10-20 09:51:41 by alan

I have to work on some MyNCI tasks for a bit but I've thought more about this task and think, maybe, our number one priority should be to prevent duplicates from entering the system rather than trying to clean up afterward.

One possibility is to write a comparison routine that identifies likely duplicates. It might, for example, compare the article title, authors (or maybe just some number of authors, or maybe just their last names) and the journal ID against the last N (maybe N=6) months of data. If it finds a match, maybe it adds a button to the review screens that allows a user to see the match(es). The user can then take some appropriate action.

This might be particularly useful if it turns out that, for duplicate records, it's always, or almost always, the first record that is the authoritative one and the second is a mistake that NLM will probably delete when it is called to their attention.

Comment entered 2015-10-20 11:14:06 by Juthe, Robin (NIH/NCI) [E]

I think in each of the examples above it was the second citation that was the correct one; the first one was subsequently deleted from PubMed. So, I think either way we will have a clean-up problem on our hands.

While I agree that it would be ideal to eliminate the problem from the get-go (upon import), I'm not sure that that is the top priority at this time since we don't know how often this actually happens. It may actually be better to address the clean-up somehow and give us a little more time to determine the prevalence of the problem before we create a more complicated import utility.

Comment entered 2015-10-22 15:53:55 by Juthe, Robin (NIH/NCI) [E]

I discussed this with the other Board managers yesterday and we think the most important first step is to try to get a better handle on how often this happens. Margaret is reaching out to NLM to get more information from their perspective (if they say it doesn't happen, maybe these few examples really were just an anomaly? - they are all from the same journal, same volume, same issue...). On our end, we were wondering if it would be possible to query PubMed with all of the PMIDs we have in our system in order to determine which ones are no longer in PubMed?

Comment entered 2015-10-22 16:09:59 by alan

Yes, we can do that. I think the code I wrote to search Pubmed for comment on Friday, 16 Oct 2015 12:15 AM can be modified to do it, or Bob's database refresh code can probably also be modified to do it.

I've finished one of the two MyNCI tasks they've asked me to do and will try to finish the other one tonight. If possible, I'll setup a script to run overnight tonight to get the information for us.

Comment entered 2015-10-22 16:16:44 by Juthe, Robin (NIH/NCI) [E]

Thanks, Alan!

Victoria just sent me another example, which, unfortunately, bucks our trend of journal/volume/issue...but may add some new clues.

1: Chen X, Ye G, Zhang C, Li X, Chen Y, Xie X, Zheng H, Cao Y, Wu K, Ni D, Tang J, Wei Z, Shen K. Superior outcome after neoadjuvant chemotherapy with docetaxel, anthracycline, and cyclophosphamide versus docetaxel plus cyclophosphamide:
results from the NATT trial in triple negative or HER2 positive breast cancer.
Breast Cancer Res Treat. 2013 Dec;142(3):549-58. doi: 10.1007/s10549-013-2790-9.
PubMed PMID: 24292815.[PubMed - indexed for MEDLINE]

2: Chen X, Ye G, Zhang C, Li X, Chen Y, Xie X, Zheng H, Cao Y, Wu K, Ni D, Tang J, Wei Z, Shen K. Superior outcome after neoadjuvant chemotherapy with docetaxel, anthracycline, and cyclophosphamide versus docetaxel plus cyclophosphamide:
results from the NATT trial in triple negative or HER2 positive breast cancer.
Breast Cancer Res Treat. 2013 Nov 14. [Epub ahead of print] PubMed PMID:
24234000.[PubMed - as supplied by publisher]

Comment entered 2015-10-22 17:29:03 by alan

I've attached the raw XML files "24234000.xml" and "24292815.xml" extracted from the EBMS database for the two Pubmed records named in the last comment in case anyone wants to have a look.

Comment entered 2015-10-22 17:38:14 by alan

Looking at a diff between the two records, I can guess what kind of mistake occurred at NLM, but I can't see any way that it was anything other than a mistake. However, as you (Robin) said, and unlike in my earlier example, they chose the earlier record to throw into the furnace and kept the later one.

I hope Margaret can get some good info from NLM about this. I'd love to know how often it happens, how they detect the problems (wait for someone to call it to their attention perhaps?) and how they chose a winner and loser.

I'm still thinking it would be a good idea for us to do our own duplicate check if it's practical to do, but I can see that the human users will need some flexibility to choose what to do when a dupe is discovered.

Comment entered 2015-10-23 00:05:19 by alan

I have just finished my two MyNCI tasks and it's late enough that I could do more harm than good if I tried to write and run a program that communicates with NLM to check missing PMIDs.

I'm going to leave that until next Tuesday.

Comment entered 2015-10-27 12:30:16 by alan

I did a test to see if it was practical to try to find duplicates by searching for them.

I first tried matching the titles, journal IDs, and authors (in order) and got hits on .2% of the database. All the ones I happened to look at were updates of previous articles , i.e., not duplicate records of those articles. Restricting the search further to articles imported within six months of each other cut the hits down to .07%. Restricting import dates to within three months of each other cut it down to 200 hits = less than .01% of the database, or about one in 10,000 records. Looking at a few of these, they all looked like they could be mistakes from Pubmed.

I still need to do something to find out how many of these are recognized as mistakes.

Comment entered 2015-10-27 22:12:56 by alan
I wrote a program that did the following:

Find all pairs of article records that had identical:

    Article title
    List or author IDs
       Same names in the same order:
           "Smith J, Jones W" <> "Jones W, Smith J"
           "Smith J, Jones W" <> "Smith J, Jones W, Doe J"
    Journal ID

I told the program that the import dates had to be within 180 days of
each other.

For each of the retrieved PMIDs it then checks Pubmed to see if the PMID
has a record there.  The program then reports (plain text, I can make an
Excel sheet if this is actually useful) for each pair:

  PMID1  ImportDateTime  Good/Bad  <=>  PMID 2  ImportDateTime  Good/Bad

1 and 2 are the two Pubmed IDs that matched.  Good means it was found on
Pubmed and Bad means it was not.

A single Pubmed ID can occur more than once if it matches more than one
record, which sometimes happens.

I ran it on Dev and Prod and have attached the results:

    dupesDev180  = Results on Dev.
    dupesProd180 = Results on Prod.

There were 18 "Bad" records on Dev and 14 on Prod.  I expected the
opposite on the theory that Dev is older and some bad records at NLM
would have been found and deleted since then.  Perhaps some of the bad 
duplicates on Dev were edited so that an author or title was different,
or a journal ID changed, causing the pair not to trigger the problem on
Prod.

On Dev, 6 of the lower numbered PMIDs were bad and 12 higher numbered
were bad.  On Prod the numbers were 6 and 8.  Clearly, we can't assume
that the earlier (or the later) record will be picked by NLM as the
official one.

The fact that there were so many exact matches and so few records that
were not deleted at NLM makes me think that there are errors in the NLM
database (and ours), that they (and we) don't know about.  A great many
are accounted for by updates of earlier articles (e.g., new results from
an ongoing study.), but I have seen some that just look like duplicate
records.

Does this help us?  Maybe.  It leads me to believe that checking for
PMIDs that have been deleted at NLM won't help us as much as we'd like
because:

    NLM corrections probably come too late in the process to save us
    from working on the same record twice under two PMIDs.

    Some duplicates may not be detected at all by NLM.

So we have multiple problems here:

    How big is the duplication problem?

    How do we find duplicates?

    Can we find them before we've done work on both?

    How do we determine which one is good and which bad?

    What do we do with a bad one?

Maybe we can discuss it at the status meeting.
Comment entered 2015-10-27 22:38:13 by alan

I've started a script running on my workstation to lookup every single Pubmed ID in our database and check it against Pubmed. I estimate it should take about 18 hours to run. It could take longer, or totally crash, if the database, the network, NLM, or my workstation burps during that time. But maybe we'll be lucky.

I'll check it when I come in on Thursday, if not before.

Comment entered 2015-10-28 15:52:24 by Beckwith, Margaret (NIH/NCI) [E]

Response from Hilda Bastian at NLM:

G’day!

Yes, I do know the answer to this, although I don’t know how often it happens.

The PMIDs are generated automatically when the publishers submit the data – and some of them accidentally send in duplicates. In QA in the days following, when NLM staff detect it, they delete one. It’s utterly random and unpredictable because it’s publisher error.

No problem asking things like this – even if I don’t know the answer, I can always find someone who does.

Hilda

Comment entered 2015-10-29 11:49:15 by alan

As might be expected, the network outage stopped the Pubmed ID checks on Wednesday.

I've revised the program and restarted it, beginning just after the last Pubmed ID that was checked. I'm not sure how long it will take, but I'll post the results when they're available, or restart again if there is another failure.

Comment entered 2015-11-05 10:08:47 by alan

For future reference, I have removed the partial lists of bad PMIDs (i.e., PMIDs no longer recognized by Pubmed) that I attached to this issue and have instead attached a complete list generated by the program that completed last weekend.

I had modified the program to distinguish errors arising in the network and retry their retrievals so I believe that all of the "Bad" PMIDs in this list really are PMIDs reported as invalid on Pubmed, not retrieval failures due to connection problems.

There are 65 of them on Prod as 10/30/2015 09:46 AM.

Comment entered 2015-12-07 17:24:12 by Juthe, Robin (NIH/NCI) [E]

In last week's CDR/EBMS meeting, we discussed several possible ways of handling and identifying these duplicate records. I'll summarize these briefly below. These aren't mutually exclusive; some could be done in combination. Alan, could you please give us a sense of the LOE for each of these approaches? I think I have them listed here roughly in order of most complicated to least complicated, but it would be helpful to get your assessment.

1. Merge duplicate records. The advantage to merging duplicate records is the preservation of all decisions made about a particular paper. It gets messy, however, when you consider what to do with multiple/disparate decisions at the same step. Which is the "right" decision? As Alan described earlier, we can't use the date/time the decision was entered to determine this.

2. Block/Delete the "wrong" record and record everything that is needed in the "right" record. This could be done by giving the "wrong" record a rejected decision, as Alan describes above. It may require additional tuning of reports/search results to weed out these duplicate records.

3. Check for duplicate records upon import using a complex match of author names, article title, date, journal, etc.

4. Run a periodic sweep of the system to look for additional "bad PMIDs" - PMIDs that are no longer valid in PubMed. This could be automated and/or run on an ad-hoc basis.

5. Create a tag to flag citations that have/are a duplicate.

Thank you!

Comment entered 2015-12-07 17:26:24 by Juthe, Robin (NIH/NCI) [E]

6. I should add that the cheapest option is to do what we're already doing - add comments to make a reference to the other record. I think this could be improved (without too much effort?) with the addition of #4 or #5 above, though.

Comment entered 2015-12-08 14:02:14 by alan

I agree with option 6, including options 4 and 5, which should not be expensive to do. I think we should also look into option 3 with the idea, not of taking direct action on the receipt of a duplicate, but rather of alerting the staff at the earliest possible time that there is a potential problem. If we can come up with an algorithm that gets the right balance of not too many false positives and not too many false negatives, we can save user time by preventing unnecessary reviews and problematic decisions.

Implementing 3, if it's not too expensive, will enable us to catch problems coming out of NLM that NLM itself might not detect until much later.

So I'm voting for 3, 4, and 5 all in support of 6.

Comment entered 2015-12-08 15:30:09 by alan

You also asked for level of effort information.

I estimate that LOE for points 3, 4, and 5 as follows:

#3 Check for duplicates on import. LOE=5.
There are three parts to implementing this.
First, write the code to do the comparisons on import.
Second, test the results against real data using the information about duplicates we already have.
Third, if and only if the second part is successful, modify the import program to enable it to issue warnings as well as errors.

#4 Sweep for duplicates. LOE=3.
I have a script that does this. It only needs to be modified to run regularly and email the report, assuming email is a good way to distribute it.

#5 Add a tag. LOE=1.

Comment entered 2015-12-10 10:22:05 by Juthe, Robin (NIH/NCI) [E]

Thanks for the estimates. We'd like to proceed with #4 and #5 for now.

#4 - Please run the report weekly and have it sent to Bonnie (bonnie.ferguson@nih.gov).

#5 - I will get you the name for the tag.

Comment entered 2015-12-17 10:14:31 by Juthe, Robin (NIH/NCI) [E]

How about "Related citation" for the tag name?

Comment entered 2015-12-17 11:24:50 by Kline, Bob (NIH/NCI) [C]

How is this going to mesh with OCEEBMS-349? Are we going to want two different ways to represent relationships between articles?

Comment entered 2015-12-21 21:18:02 by alan

The full report took about 12-1/2 hours to run. That's because it communicates with NLM individually for each record in our database.

It turns out that, of the 421,000+ records in our database, all but around 55,000 are in one or another of our rejected states. If we only process those non-rejected records, it should cut about 80% off the processing time.

I've added in code to enable that. Should I make that the default, or should the complete set of articles be the default?

Comment entered 2015-12-22 13:00:40 by alan


We'll need to discuss how and where to run the report program. It's written in Python and can read the production database and run on any computer with python, on any of the tiers, and from Windows computers as well.

There would be some advantages to running in some place other than on the production server. It would give the developers direct access to the program and log files and make it easy to modify the schedule and the list of email recipients, and to restart after network or other failures. Also, the installation and maintenance of the program would not be release dependent.

Comment entered 2015-12-22 13:40:44 by Juthe, Robin (NIH/NCI) [E]

Upon further reflection, we've decided we no longer need the tag for these citations. With the mechanism Bob proposed for linking related citations in OCEEBMS-349, a tag for related citations would be redundant. I think linking them is the way to go.

Alan - good idea to filter out the rejected citations. I think we should do this. However, we need to be careful to only filter out citations that have been rejected for ALL associated summary topics. It could be that a citation was rejected for one topic/Board combination for approved for another Board/topic, and it should still be checked by the sweep in that case.

Comment entered 2015-12-22 13:58:24 by alan

However, we need to be careful to only filter out citations that have been rejected for ALL associated summary topics

Right. I think I'm doing that correctly. Running separate queries to find all rejected articles (393,526) and all non-rejected articles (55,893) I get a number (449,419) that is much larger than the total number of articles (421,823). The reason is that there are some articles currently in both states, each for different topics.

I had just started looking at the tag issues. I'll stop and move on to other problems.

Incidentally, there are a few screwy cases. For example there are four articles that don't appear to have any state at all. They are all from 2002 and I imagine that errors in the old CMS database survived into the new one with no good way to fix them. EBMS IDs are 5063, 2386, 2433, 2435.

Comment entered 2015-12-22 16:05:39 by Juthe, Robin (NIH/NCI) [E]

Just looked at those records and none have associated Boards or topics. Very strange. They're so old and such edge cases that I don't think we need to worry about them. Four out of 421,000 isn't bad!

Comment entered 2015-12-31 20:02:38 by alan

I have put the following two scripts into version control in the ebms/branches/3.2/scripts directory.

findPubmedDrops.py
   Finds Pubmed IDs in our database that are no longer in the Pubmed database.
   This is the program that implements Robin's method 4 to find PMIDs no longer at NLM.
   findPubmedDrops.py --help for usage info.

findPubmedDupes.py
   Finds PMIDs with matching authors, titles, and journal IDs in our database.  This one
   produces false hits.  It needs further qualification to at least require matching 
   publication dates, volume and issue numbers, or something like that.  However we
   decided not to take that route so I stopped work on it but am preserving it in
   the version archive in case we change our minds again.
Comment entered 2016-01-21 14:30:18 by Kline, Bob (NIH/NCI) [C]

We're going to also include the "invalid PMID" report on the web Reports section. The link to the report will appear for users with the "Admin Assistant" or the "Site Administrator" role.

Comment entered 2016-01-25 17:09:37 by Kline, Bob (NIH/NCI) [C]

Installed as a web-based report and as a weekly cron job on DEV.

Comment entered 2016-02-01 11:39:00 by Juthe, Robin (NIH/NCI) [E]

We have a few questions about this report:

1) How will citations "fall off" the report? I thought they would fall off by ensuring they have a current "no" decision status for every associated topic, but I just tried adding a no decision to one of them (PMID: 15026919) and it still appears on the report. Is the query done in real time?

2) Could you please explain the "batch size" and "delay" fields? I am tempted to just leave the defaults as is, but I want to be sure I understand them in case I need to change them for any reason.

3) Will the weekly report be set up to run each week regardless of whether there are any active invalid PMIDs or will it only run when there are citations that meet this criteria?

Thanks!

Comment entered 2016-02-01 12:15:50 by Kline, Bob (NIH/NCI) [C]

How will citations "fall off" the report?

If the current state for every topic associated with an article is one of the following, the article will not appear on the report:

  • RejectJournalTitle (journal was on the board's "not" list)

  • RejectInitReview (rejected by the librarians)

  • RejectBMReview (rejected by the board manager based on the abstract)

  • RejectFullReview (rejected after looking at the article's full text)

I just tried adding a no decision to one of them (PMID: 15026919) and it still appears on the report.

The state into which you put the article was "No further action"; would you like that state added to the list we're using?

Is the query done in real time?

Yes.

Could you please explain the "batch size" and "delay" fields?

Sure. Batch size represents the number of Pubmed IDs for we ask NLM to tell us if they can find the articles in their database in a single request. The lower the number, the longer the report will take (increasing the chances that the web server or browser will time out the request), the more likely we'll catch NLM napping (causing the report to fail) because there will be more batches, but the less likely we will run out of resources (memory, for example) either on our server or NLM's server. I would guess that more batches also increases the chances that NLM will regard our requests as "flooding" their server, at which point they might cut us off temporarily. If you set the batch size to "1" the report would take roughly three and a half hours (but it would be timed out long before that), The "delay" value specifies the number of seconds we wait between batches. Longer values reduce the chance that NLM's server will block us because we're overloading their service. I imagine it's only really significant (both in terms of performance, as well as the risk of being blocked) when there are a large number of batches. We don't really know what the thresholds are for what constitutes a "large number" for NLM in this context, but the default values haven't ever failed as far as I know. If they do, we can use the interface to tweak the values.

Will the weekly report be set up to run each week regardless of whether there are any active invalid PMIDs or will it only run when there are citations that meet this criteria?

Well, it's a safe assumption that we'll never have the database in a condition in which every current state for every status of every article was "rejected." We would have to do all the processing anyway to find out which of them NLM can't find, so there's no incentive like avoiding work available (besides the machine's doing the work). It also seems extremely unlikely that NLM will ever resurrect all of the missing articles. If a miracle happened and they actually did that, it's possible that we could have the software avoid sending the report when we discover that there are no IDs on it. But if I were the report's regular recipient, I would be more inclined to believe that the report was broken that week and didn't run than that NLM found all of its lost articles.

Comment entered 2016-02-01 16:04:29 by Juthe, Robin (NIH/NCI) [E]

Thanks, Bob. Could you please add "no further action" (Board manager action) and "not cited" (Editorial Board decision) to the list of rejected values that you are using for this query? Thank you.

Our plan is to review and address the citations each week, both linking them to the correct record and also giving them a "no" decision so that they fall off the report and we aren't seeing the same set of invalid PMIDs each week. It's fine to send the report each week, but I suspect if we are keeping up with this it will often have zero results.

Comment entered 2016-02-02 07:54:24 by Kline, Bob (NIH/NCI) [C]

I'll see if that's feasible. Those aren't states, though.

Comment entered 2016-02-03 10:51:00 by Kline, Bob (NIH/NCI) [C]

I think I have cracked the nut. The report now shows only 15 articles on QA-SG. Please give it a try.

Comment entered 2016-02-03 16:09:19 by Juthe, Robin (NIH/NCI) [E]

This looks great to me. Everything dropped off the report as expected. Thanks, Bob! I'll mark this as verified on QA.

Comment entered 2016-04-06 16:20:29 by Juthe, Robin (NIH/NCI) [E]

Bob, should this report be running automatically on a weekly basis on PROD now? Bonnie said she hasn't received it via e-mail yet, so we just wanted to check. The report (accessible under Citation Reports) is currently showing 16 citations on PROD with invalid PMIDs.

Comment entered 2016-04-26 14:06:40 by Juthe, Robin (NIH/NCI) [E]

Hi Bob, Bonnie still hasn't received this report. I think we were expecting it to be run over the weekend.

Comment entered 2016-04-26 18:29:29 by Kline, Bob (NIH/NCI) [C]

That's because she wasn't in the recipient list. I've fixed that, so she'll get it next week.

Comment entered 2016-04-27 17:53:55 by Juthe, Robin (NIH/NCI) [E]

Thanks, Bob. Could you please add me to the list too if it isn't too much trouble?

Comment entered 2016-04-27 18:14:21 by Kline, Bob (NIH/NCI) [C]

Done.

Comment entered 2016-05-02 13:41:35 by Juthe, Robin (NIH/NCI) [E]

Verified on PROD. We received this report yesterday. Closing issue.

Attachments
File Name Posted User
24234000.xml 2015-10-22 17:29:03
24292815.xml 2015-10-22 17:29:03
25049327.xml 2015-10-15 20:07:45
25513462.xml 2015-10-15 20:07:45
BadPmids.txt 2015-11-05 10:08:47

Elapsed: 0:00:00.000266