PDQ Issues

Issue Number	211
Summary	[Import Citations] article full text link lost when replacing xml for an article
Created	2014-06-17 21:45:00
Issue Type	Bug
Submitted By	alan
Assigned To	alan
Status	Closed
Resolved	2014-07-31 11:21:22
Resolution	Fixed
Path	/home/bkline/backups/jira/oceebms/issue.129746

Description

Bob discovered that many of the articles in the EBMS database have lost
their full text ID link, the link that connects the downloaded PDF for
the article to the bibliographic record.

After some searching we discovered a possible cause, now confirmed, in
some software that I wrote back at the beginning of the system.  When
the XML for an article is replaced in the database, the full_text_id
linking to the PDF is reset to its initial NULL value, causing the link
to be lost.

Replacements occur as a result of two kinds of activity.

    Firstly, If a user re-imports a record, for example for a new topic,
    the article XML from NLM is compared to the XML we have in our
    database.  If it's different, the new XML from NLM is installed,
    losing any PDF link.

    Secondly, and more seriously, if a program is run to refresh data
    from NLM, the entire database is checked against NLM changes and all
    new XML is downloaded and stored.  This was done recently in
    production and resulted in the loss of some thousands of article PDF
    links, more than half of the total.

The fix involves two parts:

    Fixing the bug.

    Recovering the lost links.

                             Fixing the Bug
                             --------------

I believe that I have fixed the bug on DEV.  To test, I did the
following:

    Find a record with an intact link to a PDF document.

    Change a few characters in the XML in order to force a replacement
    on a new import.

    Imported the article again, assigning a new topic to it.

I checked the results and, indeed, the full text link was lost.  I think
that confirms that we found where the problem was happening.

I then did the following:

    Install the program fix on DEV.

    Restore the full text link - using a routine that can be called
    from anywhere in the software to insert the link.

    Modify a few characters in the record again to force an XML refresh.

    Re-import for yet another new topic.

This time the check showed that the full text link was undisturbed.

I will install the fixed program in subversion and, after consulting
with Bob, create an issue for CBIIT to perform an emergency hotfix to
install it in Production.

                       Recovering the Lost Links
                       -------------------------

Bob and I have done some analysis of the data.  Luckily, it turns out
that the long standing CMS and EBMS convention of including a Pubmed ID 
in the name of a PDF file for an article, can give us the data to
programmatically restore links for all of the articles that have a
Pubmed ID in the file name.  That will be the great majority of the
files.  There will be some others that will require more effort.

This issue is to get the bug fix installed everywhere and to fix the
data.

Comment entered 2014-06-17 21:50:13 by alan

I have added multiple users as watchers on this issue, including our medical librarians who do the bulk of our citation imports.

Please have a look at the description of the issue for information about this critical bug.

Comment entered 2014-06-17 21:51:33 by alan

As a further test of the fix I made, we might try the following:

Get a list of all article IDs and associated full text file IDs on DEV.

Perform a refresh of the data from NLM.

Get the list again. There should be no changes.

Comment entered 2014-06-17 21:51:54 by Kline, Bob (NIH/NCI) [C]

One possible approach to restoring at least some of the links to the full text would be to ask CBIIT for the backup of the database which was made immediately prior to the refresh of XML from NLM, and extract the missing full_text_id values for the articles which lost the value during that refresh.

Comment entered 2014-06-17 22:02:41 by alan

Excellent idea.

They can either give us the full backup or just an extract containing all of the article_id, source_id, and full_text_id tuples where the full_text_id is not null.

I think we should create the issue for this ASAP since CBIIT probably recycles backups after time has passed.

If you know the relevant date, can you either create the issue for the database team or just send me the date and I'll do it. It probably merits a call to Nana too both to explain the criticality of the request and to discuss the issue of a full backup vs. an extract.

Comment entered 2014-06-17 23:44:44 by alan

The fix is in the 3.0 branch of subversion, is merged back into trunk, and is installed in DEV. It's not on QA yet though maybe it should be.

Comment entered 2014-06-18 08:58:49 by Juthe, Robin (NIH/NCI) [E]

Alan and Bob, Thanks for the quick work to troubleshoot this bug and identify solutions.

Alan, is the fix on QA? I am trying to test your fix on DEV while using QA to compare what should NOT happen but they appear to be behaving similarly.

Comment entered 2014-06-18 11:11:54 by alan

The fix is not on QA but it's difficult to test by comparing the two servers.

The import software only replaces the original data stored in EBMS if the new data from NLM is different from that. So importing an article to assign a new topic won't lose the PDF link if the XML from NLM is unchanged.

The way I tested was to go in by hand using a database client to modify the saved XML, forcing the import to re-import new XML, but you can't do that from inside the EBMS because we provide no interface for modifying XML.

If, in your test, you saw the PDF link disappear from both DEV and QA, then I haven't fixed the bug. But if you saw it remain OK on both servers, then it probably didn't test the real problem.

We can find some records that have changed on PROD after the refresh of DEV and QA and you can test with them, but I think our best bet for a more general test is to have Bob re-run his refresh on one or both servers. Before doing that we'll need to make backups (we can do that ourselves on DEV and QA) so that we can quickly restore to the state before any damage done by the testing.

Comment entered 2014-06-18 11:38:25 by Kline, Bob (NIH/NCI) [C]

I dug into my notes, and it appears that I began the refresh of XML from NLM on PROD June 3, so I have requested the backup from June 2:

https://tracker.nci.nih.gov/browse/DBATEAM-1101

Comment entered 2014-07-01 16:13:32 by trivedim

I am ready to import July 2014 Review cycle citations from tomorrow. This might result in few citations replacements. Should I go ahead and import? I can make a note of PMIDs that get replaced.

Comment entered 2014-07-01 16:29:51 by alan

Hello Minaxi,

The bug fix has been installed on production and it should now be safe to import both new and replacement documents. Most of the lost links between articles and full text documents have been repaired, though there are still some that have not been. Replacing those documents via a new import won't fix any of them but won't make anything worse.

So, yes, go ahead and import. I don't think it's necessary to make a note of PMIDs that are replaced. We have that information in the database if we ever need it.

Thanks for checking before importing.

Comment entered 2014-07-31 11:21:22 by Kline, Bob (NIH/NCI) [C]

The fix for the bug is in production. The remaining data cleanup will take place under ticket OCEEBMS-210.

Elapsed: 0:00:00.000552

EBMS Tickets