EBMS Tickets

Issue Number 687
Summary XML refresh job doesn't find all the articles which have been updated
Created 2023-01-09 14:51:30
Issue Type Bug
Submitted By Kline, Bob (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2023-02-04 14:44:48
Resolution Fixed
Path /home/bkline/backups/jira/oceebms/issue.335972
Description

JIRA won't let me create this ticket. So I have removed the description and will post it as a comment.

Comment entered 2023-01-09 14:55:06 by Kline, Bob (NIH/NCI) [C]

[Second attempt didn't work, either, so I am paring down the comment in the hopes that the JIRA monster will be appeased and accept my submission.)

I opened a ticket with NLM asking why we're not getting notified by PubMed for all of the articles which have been changed. It turns out that there is now a limit on the number of article IDs which NLM will return when we ask which articles were changed on a given day. The response is now limited to 10,000 IDs. You wouldn't think that limit would be a problem, would you? But it turns out that on December 22 (the date NLM gave me for the modification of an article for which I noticed we don't have the changes) NLM changed a whopping 42,051 articles! 🙁 And that's not a once-in-a-lifetime anomaly either. So I've been digging into the documentation, and sure enough, "For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using <EDirect> that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved." So I followed my nose to the EDirect tools, and I have them installed, and I'm trying to use them to get the IDs for ALL the articles that were changed on that day, but neither the documentation for the tools, nor the command-line help for the esearch command explain how to do that. (I couldn't even see an obvious way to tell the tool I wanted it to give me article IDs instead of just a count, but I was able to figure out how to hack the code to force that option to be set.) I asked Sarah Weis (who was responding to my ticket) what's the secret sauce, promising to be suitably embarrassed if she shows me that the documentation actually does contain the information about what ritual sacrifices must be performed "so that an arbitrary number can be retrieved," and I just missed it. Once they share the necessary information I'll modify the cron job to use the new approach.

Comment entered 2023-01-09 14:56:59 by Kline, Bob (NIH/NCI) [C]

Added watchers.

Comment entered 2023-01-19 16:44:26 by Kline, Bob (NIH/NCI) [C]

I have forwarded the correspondence with NLM to .

Comment entered 2023-01-30 09:58:29 by Kline, Bob (NIH/NCI) [C]

I have been implementing and testing an alternate solution for the requirement to identify PubMed articles which have been updated in the NLM's database since we last fetched their data for them. It takes the NLM under a second to give us the wrong answer to the question "which articles have been changed?" and I expect it would take under two seconds for them to give us the right answer. I have a solution which takes less than 1/2 hour, and I'm confident that it's reliable. I'm going to recommend that we stop wasting time trying to plead our case and just go with an approach which we know will work. I have used the new tools to refresh the data for all of the articles on https://ebms.rksystems.com which had gotten behind the records at the NLM (several thousand), though there may be articles which have changed during the intervening hours which will need refreshing the next time the scheduled job runs.

Comment entered 2023-02-04 14:44:48 by Kline, Bob (NIH/NCI) [C]

The scheduled job to refresh the XML has been rewritten to work around the bugs in the NLM APIs.

Elapsed: 0:00:00.000732