EBMS Tickets

Issue Number 210
Summary [Literature] Articles in Packets Without Linked Full-Text
Created 2014-06-05 17:33:01
Issue Type Bug
Submitted By Juthe, Robin (NIH/NCI) [E]
Assigned To alan
Status Closed
Resolved 2014-08-26 08:42:47
Resolution Fixed
Path /home/bkline/backups/jira/oceebms/issue.129059
Description

Some articles are creeping into packets without a linked full-text PDF. The full-text should be required in order for the article to be eligible to be included in a packet.

Here's an example:
Genetics of Skin Cancer (March 2014) [Packet #13670]

This packet contains 3 articles that do not have full text.

Comment entered 2014-06-19 15:39:41 by Kline, Bob (NIH/NCI) [C]

The packets were created when the articles all had full text associated with them. A bug (tracked elsewhere) caused the links to the full text PDFs to be dropped. We're working on restoring as many of these links as we can.

Comment entered 2014-08-14 15:49:56 by Juthe, Robin (NIH/NCI) [E]

Erika, we discussed this issue in today's CDR/EBMS meeting and we would like to move it into release 3.1. Bob and Alan felt that they would be able to complete the work within the established timeline. (The Agile Board is not displaying at the moment so I am not able to move this into the release, so I thought I would post this as a comment for the time being.)

Comment entered 2014-08-14 17:01:26 by henryec

Apologies for the board not showing up. It is fixed now and I was able to move this into release 3.1.

Comment entered 2014-08-14 23:30:09 by alan
[Note: I forgot to set the formatting option the first time I saved this.]

I did some checking on this as follows:

    Searched for all articles associated with active packets that
    had no link to a full text PDF.

    For each article, searched for any file that we have that has
    the Pubmed ID for that article embedded in the filename.

There were a total of 60 unique Pubmed IDs found for articles
with no link to a full text file but with at least one link to an
active packet.

Of the 60:

    5 did not have a full text article with a matching Pubmed ID
    in the filename.  It is possible that some of these are in
    the database, but with the Pubmed ID missing or mistyped in
    the filename.

    1 had two full text articles with the same Pubmed ID in the
    title.

I'm attaching a spreadsheet with the results in three columns:

    Column A = Pubmed ID

    Column B = Internal Drupal unique identifier for the file.
               This is what we put in the full text links.

    Column C = Filename of the file.
               " - EXTRA FILE PDF" appended to the one duplicate.
               "NO FILE PDF FOUND" for each of the five missing files.

The first thing we need to do with this spreadsheet is to have
someone check that the results are accurate.  Checking a sample
of them may be enough.  I tried to get into the Prod EBMS tonight
to do some checking but had various connection problems.  I don't
know what's going on with that.

I'd also like to walk through what I did with Bob since he's
worked on this too and might spot anything I did wrong.

After that I can write a script to give to CBIIT to fix the 54
links that the searching found and then we should probably fix
the other six by hand.  That might be done by re-downloading PDFs
from Pubmed or, if we can figure out the right document to
link-to (which we certainly can for the "EXTRA FILE PDF" case and
might also be able to do for some or all of the others, we can
hand CBIIT a script to make those links without downloading
anything again - though I don't know if it's worthwhile bothering
with that.
Comment entered 2014-08-14 23:32:54 by alan

Spreadsheet of articles in packets without linked full-text.

Comment entered 2014-08-15 15:58:26 by Juthe, Robin (NIH/NCI) [E]

I checked several of the articles on the spreadsheet by looking them up in PubMed and seeing if the filename made sense (author, journal, etc). They looked good. However, a better test would probably be to confirm that the file is correct since it could be misnamed. Since there are only 60, this wouldn't be hard to do. Would it be possible to generate another version of this spreadsheet in which the filename is a link to the file? Or, let me know if there's a better way.

I think we should handle the 6 citations without a PDF manually. I will ask Bonnie to upload them.

Comment entered 2014-08-19 09:42:30 by alan

"Would it be possible to generate another version of this spreadsheet in which the filename is a link to the file?"

That sounds like a useful way to test. I'll set it up today.

Comment entered 2014-08-19 09:45:55 by Juthe, Robin (NIH/NCI) [E]

Thanks, Alan. I also wanted to mention that Bonnie has corrected the 6 citations that didn't have a one-to-one match. She uploaded the correct PDFs on prod.

Comment entered 2014-08-19 11:45:18 by alan

Revised version of the spreadsheet. The PMID and PDF columns are now hyperlinked. Records already fixed by Bonnie are not included.

Comment entered 2014-08-19 11:49:26 by alan

I'm thinking that, if we give this high priority and get the new spreadsheet checked in the next couple of hours, and assuming everything is okay, we can write a script to perform the fixes and add it along with the other changes that we plan to give to CBIIT tomorrow.

Comment entered 2014-08-19 14:19:41 by Juthe, Robin (NIH/NCI) [E]

All but one of these checked out - the one that didn't is a submitted manuscript as opposed to the published article. The PMID is 23764071. Would it be best to post the correct PDF manually for that one?

Comment entered 2014-08-19 15:49:20 by alan

I've generated an SQL script to fix all of the broken links that were not already fixed by Bonnie. The next step is to create a request to CBIIT in JIRA for them to install the software updates and run the SQL script.

If anyone fixes more of the links before the script is run, whatever is done by hand will be preserved and that article will be skipped in the scripted update. However, manual effort should be unnecessary unless someone spots an error in the spreadsheet (pmid and pdf don't match.)

Comment entered 2014-08-21 11:41:07 by alan

Yeon Choi of CBIIT ran the script successfully yesterday.

Together with the manual fixes that Bonnie made, all of the articles in active packets should now have correct links to full-text PDFs.

If everything looks good, I think we can close the issue.

Comment entered 2014-08-21 18:51:46 by alan

> All but one of these checked out - the one that didn't is a submitted
> manuscript as opposed to the published article ...

I see that I missed this comment.

Yes, it would be best to fix it manually.

Comment entered 2014-08-25 18:37:38 by Juthe, Robin (NIH/NCI) [E]

This is on prod, so it is no longer part of the 3.1 release. Should it be moved to 3.0?

Bonnie is spot-checking several of the links. If everything looks good, I'll close the issue.

Comment entered 2014-08-26 08:23:26 by Kline, Bob (NIH/NCI) [C]

I moved this ticket to 3.0.

Comment entered 2014-08-26 08:43:08 by Juthe, Robin (NIH/NCI) [E]

Bonnie checked several of the links on prod, and everything looked good. Thanks!

Attachments
File Name Posted User
pmidsWithNoPDFs.xls 2014-08-14 23:32:54
pmidsWithNoPDFsHyperlinked.xls 2014-08-19 11:45:18

Elapsed: 0:00:00.000767