CDR Tickets

Issue Number 3171
Summary [CITE MS] Report of PubMed journals
Created 2010-06-03 11:55:38
Issue Type Improvement
Submitted By Beckwith, Margaret (NIH/NCI) [E]
Assigned To alan
Status Closed
Resolved 2010-06-09 14:03:42
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107499
Description

BZISSUE::4858
BZDATETIME::2010-06-03 11:55:38
BZCREATOR::Margaret Beckwith
BZASSIGNEE::Alan Meyer
BZQACONTACT::William Osei-Poku

Below is an explanation of what Cynthia needs help with to compare pubmed and psychinfo journals and identify journals unique to psychinfo. I wasn't sure who to assign it to so you can feel free to reassign if necessary.

From Cynthia:

The attached file is an excel spreadsheet of all the full journal titles currently indexed in psychinfo. I would like to have a similar list for pubmed in excel so that I can compare the two lists and identify journals included in psychinfo that are not included in pubmed. Unfortunately I have not been able figure out how to generate a list of pubmed journals that I can get into excel. Or a means to do the comparison besides manually.
Pubmed journal lists are available at the following link. The files are zipped and in various formats most of which do not include the full journal title.
http://www.ncbi.nlm.nih.gov/entrez/citmatch_help.html#JournalLists

However, the CMS database also includes a list of all the pubmed journals in both full and abbreviated title form. This list is required so several CMS features. It may be easier to pull this list from the CMS and import into excel.
Once imported into excel, is it possible to remove duplicate titles programatically?

Comment entered 2010-06-03 11:58:46 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-06-03 11:58:46
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::1

Comment entered 2010-06-03 11:58:46 by Beckwith, Margaret (NIH/NCI) [E]

Attachment PsychInfoJrnls.xls has been added with description: Excel spreadsheet of PsychInfo journals

Comment entered 2010-06-03 16:54:22 by alan

BZDATETIME::2010-06-03 16:54:22
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

I downloaded the Pubmed journals and did the following:

Extracted all the journal titles.

Removed leading "The " from titles.
Pubmed uses them, PsychInfo does not.

Removed leading text in brackets.
The ones I looked at had non-English transliterations in
the bracketed text. PsychInfo used the English
translation. For example:
"[Boei eisei] Japanese Defense Forces medical journal"
became
"Japanese Defense Forces medical journal"

Sorted the file with:
Ignore case.
Only use alphanumerics for sorting, no punctuation.

Loaded it into Excel.

Saved it as an Excel 97 spreadsheet.

The Excel is attached.

This got it as close as I could think of to the ordering of the
PsychInfo journals.

I can try to make a programmatic comparison of the two lists. To
do that I would convert all the text to lower case (PsychInfo
uses upper case in places where Pubmed does not.)

If I do that, I think there will be a fair number of match
failures where the two lists of titles have the same journal but
with variations in the title strings. However if you think it
will help I'll do it and upload difference reports:

What was found in both lists.
What was found in the Pubmed list but not PsychInfo.
What was found in the PsychInfo but not Pubmed.

If you can think of any more pre-processing to make it more
accurate, I'll do that before making the comparison.

Comment entered 2010-06-03 16:54:22 by alan

Attachment PubmedSortNoTheNoBracket.xls has been added with description: Sorted Pubmed title list in Excel spreadsheet.

Comment entered 2010-06-03 16:56:59 by alan

BZDATETIME::2010-06-03 16:56:59
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

Cynthia is not in our Bugzilla user list. I suggest we
add her to it - if that's okay with everyone.

In the meantime, I'll send the spreadsheet and the description
to her by email.

Comment entered 2010-06-03 17:55:50 by alan

BZDATETIME::2010-06-03 17:55:50
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

I sent the following email today and am entering it here for
the Bugzilla record.

------------------------------------------------------------
It turns out that there is a way to make a programmatic
comparison of the Pubmed and PsychInfo title lists.

I found the page on the PsychInfo website that has the journal
title download. In addition to the Excel spreadsheet, it also
offers the option of browsing the entire list as a web page (see
the link to "brows the entire list".) That list has both the
title and ISSN and/or eISSN. Pubmed also has the title and the
two ISSN fields.

I can save that web page and write programs to do the following:

From PsychInfo:
Parse the web page, extracting the ISSN, eISSN and title.

From Pubmed:
Parse the journal title list, extracting the same
information.

Write out the two lists in consistent format.

Try to match ISSN where it exists, or eISSN where it doesn't.

Produce lists of:

Titles and ISSN info for journals in PsychInfo but not
Pubmed.

Same for journals in Pubmed but not PsychInfo.

Journals in both.

I'm not sure how long it would take Cynthia to make the
comparisons by hand, but I think I could probably do the above in
about a day's work, and it should be very accurate because the
ISSNs should more accurate matching keys than the titles.

Comment entered 2010-06-04 00:53:14 by alan

BZDATETIME::2010-06-04 00:53:14
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5

Cynthia sent me an email as follows:
--------------------------------------------------------
Alan,
You just made my day!!! This sounds great!
The excel spreadsheet you sent will work if I have to do the comparison
manually which would take at least 20 hours, probably more. If you can
run the comparison programmatically in a day, then this will save a lot
of time.
I'll be back in the office Tuesday.
Thanks!
Cynthia
--------------------------------------------------------

After making her day, I couldn't disappoint her.

I've done two of the three parts for it, parsing each of the two
journal title lists and producing normalized representations of
each one.

I now have to write the program to compare the two by ISSN or
eISSN, depending on what's present.

I should finish that on Tuesday. It shouldn't be very hard.

Doing this is so much more fun than debugging the Citation
Management System. But I will get back to that on Tuesday.

Comment entered 2010-06-04 08:59:47 by Beckwith, Margaret (NIH/NCI) [E]

BZDATETIME::2010-06-04 08:59:47
BZCOMMENTOR::Margaret Beckwith
BZCOMMENT::6

It's fine with me to add Cynthia as a cc, especially since there are several CiteMS issues.

Comment entered 2010-06-08 15:27:14 by alan

BZDATETIME::2010-06-08 15:27:14
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7

I'm adding Cynthia to the list of users receiving copies of
entries on this issue. I'll add her to other Citation
Management System issues as I make entries in them.

Comment entered 2010-06-08 15:42:03 by alan

BZDATETIME::2010-06-08 15:42:03
BZCOMMENTOR::Alan Meyer
BZCOMMENT::8

I believe that this list contains journal titles that are common
to both PsychINFO and Pubmed. There are 1,439 common journals.

The list contains three tab separated fields:

ISSN EISSN Title

I generated the list as follows:

If two titles share an ISSN:
They match. Include in the list.
Else if they share an EISSN
They match. Include in the list.
Else
Do not include.

Some journals in Pubmed, and one in PsychINFO, have no ISSN or
EISSN information. I did not include them in the matching.
There probably weren't any matches there anyway. We could try
matching titles, but it would be difficult because many are
non-English and the two databases use different rules for
representing non-English titles.

The titles I've printed are the Pubmed versions. The version in
PsychINFO may differ in case, character set, handling of initial
articles, transliteration, or other ways.

The program also generates lists of titles in one set that are
not in the other, and titles without ISSN or EISSN values. If
these are needed, or if this list is needed in a different order
(e.g., by ISSN), I can provide that.

It should be easy to import this list into Excel if desired by
just reading it as lines of tab separated columns.

Comment entered 2010-06-08 15:42:03 by alan

Attachment jrnCommonByTitle.txt has been added with description: Titles in both PsychINFO and Pubmed

Comment entered 2010-06-08 15:56:31 by alan

BZDATETIME::2010-06-08 15:56:31
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9

Perhaps Cynthia really needs to know what is in PsychINFO that's
not in Pubmed.

Here's the list of those, in the same format.

Unlike Pubmed, PsychINFO uses Unicode to represent characters
that aren't in the ASCII character set used by Pubmed. You'll
need a Unicde/UTF-8 aware text editor (Notepad will do) to view
it properly.

Comment entered 2010-06-08 15:56:31 by alan

Attachment jrnPsychInfoOnlyByTitle.txt has been added with description: Journal titles in PsychINFO but not Pubmed

Comment entered 2010-06-08 15:58:29 by alan

BZDATETIME::2010-06-08 15:58:29
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10

I'm marking the issue as resolved-fixed.

Comment entered 2010-06-08 16:00:59 by Boggess, Cynthia (NIH/NCI) [C]

BZDATETIME::2010-06-08 16:00:59
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::11

This is exactly what I needed! Thanks Alan!
(In reply to comment #10)
> I'm marking the issue as resolved-fixed.

Comment entered 2010-06-09 14:03:42 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2010-06-09 14:03:42
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::12

(In reply to comment #11)
> This is exactly what I needed! Thanks Alan!
> (In reply to comment #10)
> > I'm marking the issue as resolved-fixed.

Issue Closed!

Attachments
File Name Posted User
jrnCommonByTitle.txt 2010-06-08 15:42:03
jrnPsychInfoOnlyByTitle.txt 2010-06-08 15:56:31
PsychInfoJrnls.xls 2010-06-03 11:58:46
PubmedSortNoTheNoBracket.xls 2010-06-03 16:54:22

Elapsed: 0:00:00.000613