Issue Number | 687 |
---|---|
Summary | XML refresh job doesn't find all the articles which have been updated |
Created | 2023-01-09 14:51:30 |
Issue Type | Bug |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2023-02-04 14:44:48 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/oceebms/issue.335972 |
JIRA won't let me create this ticket. So I have removed the description and will post it as a comment.
[Second attempt didn't work, either, so I am paring down the comment in the hopes that the JIRA monster will be appeased and accept my submission.)
I opened a ticket with NLM asking why we're not getting notified by
PubMed for all of the articles which have been changed. It turns out
that there is now a limit on the number of article IDs which NLM will
return when we ask which articles were changed on a given day. The
response is now limited to 10,000 IDs. You wouldn't think that limit
would be a problem, would you? But it turns out that on December 22 (the
date NLM gave me for the modification of an article for which I noticed
we don't have the changes) NLM changed a whopping 42,051 articles! 🙁
And that's not a once-in-a-lifetime anomaly either. So I've been digging
into the documentation,
and sure enough, "For PubMed, ESearch can only retrieve the first 10,000
records matching the query. To obtain more than 10,000 PubMed records,
consider using <EDirect> that contains additional logic to batch
PubMed search results automatically so that an arbitrary number can be
retrieved." So I followed my nose to the EDirect tools, and I have them
installed, and I'm trying to use them to get the IDs for ALL the
articles that were changed on that day, but neither the documentation
for the tools, nor the command-line help for the esearch
command explain how to do that. (I couldn't even see an obvious way to
tell the tool I wanted it to give me article IDs instead of just a
count, but I was able to figure out how to hack the code to force that
option to be set.) I asked Sarah Weis (who was responding to my ticket)
what's the secret sauce, promising to be suitably embarrassed if she
shows me that the documentation actually does contain the information
about what ritual sacrifices must be performed "so that an arbitrary
number can be retrieved," and I just missed it. Once they share the
necessary information I'll modify the cron job to use the new
approach.
Added watchers.
I have forwarded the correspondence with NLM to ~juther.
I have been implementing and testing an alternate solution for the requirement to identify PubMed articles which have been updated in the NLM's database since we last fetched their data for them. It takes the NLM under a second to give us the wrong answer to the question "which articles have been changed?" and I expect it would take under two seconds for them to give us the right answer. I have a solution which takes less than 1/2 hour, and I'm confident that it's reliable. I'm going to recommend that we stop wasting time trying to plead our case and just go with an approach which we know will work. I have used the new tools to refresh the data for all of the articles on https://ebms.rksystems.com which had gotten behind the records at the NLM (several thousand), though there may be articles which have changed during the intervening hours which will need refreshing the next time the scheduled job runs.
The scheduled job to refresh the XML has been rewritten to work around the bugs in the NLM APIs.
Elapsed: 0:00:00.000732