Issue Number | 3154 |
---|---|
Summary | [CiteMs] CMS Import Utility NOT List error |
Created | 2010-05-19 14:25:09 |
Issue Type | Improvement |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | alan |
Status | Closed |
Resolved | 2011-01-25 10:16:54 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.107482 |
BZISSUE::4841
BZDATETIME::2010-05-19 14:25:09
BZCREATOR::William Osei-Poku
BZASSIGNEE::Alan Meyer
BZQACONTACT::William Osei-Poku
This problem was first reported by Minax. Cynthia also tried it and got the same results as Minaxi. Here is Minaxi's initial email:
....Email from Minaxi begins......
There is a problem with the NOT journal filter in CMS import utility. When I add a new title to the list, it works fine if the journal is already in the CMS. But if I add a new journal title to CMS and then flag it for NOT filter, it does not work.
Recently I have added the following two new journals to CMS and then to adult NOT list. But when I run the “Find Record” report by selecting March 2010 review cycle, Adult treatment Board and check NOT List box , they do not show up :
Transfus Apher Sci
East Afr J Public Health
Thanks,
Minaxi
......Email of email.....
BZDATETIME::2010-05-19 14:29:32
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::1
(In reply to comment #0)
i
>
> ......Email of email.....
Sorry. I should have written ...End of email....
The attached document contains the steps (including screen shots) explaining the problem,
Attachment NOTjournal_5-17-2010.doc has been added with description: Not Journal list problem steps
BZDATETIME::2010-05-24 10:21:36
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2
Adding this to my queue.
BZDATETIME::2010-07-22 22:48:12
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3
I've been looking at this problem and am seeing some entries in
the journal title table in the database that surprised me.
In both of the cases cited in the descriptions of this problem,
there are duplicate entries in the journal title list. Here are
the duplicates:
------------------------------------------------------------------
ID: 6498
FullTitle: Transfusion and apheresis science : official journal
of the World Apheresis Association : official journal
of the European Society for Haemapheresis.
ShortTitle: Transfus Apheresis Sci
ID: 18853
FullTitle: Transfusion and apheresis science : official journal
of the World Apheresis Association : official journal
of the European Society for Haemapheresis.
ShortTitle: Transfus Apher Sci.
------------------------------------------------------------------
ID: 18851
FullTitle: East African journal of public health
ShortTitle: East Afr J Public Health
ID: 18852
FullTitle: East African journal of public health.
ShortTitle: East Afr J Public Health.
------------------------------------------------------------------
In each case, the difference between the two records appears to
be in the short title. For the East African title, the only
difference I see is the period at the end of one of them.
I don't know if this is involved in the problem Minaxi saw, but
it looks suspicious to me.
There are many other cases like these.
So, before digging further into the code I have some questions:
1. Are these almost duplicate entries legitimate, or are they
mistakes?
2. From a user's point of view, does it look like this might
account for the problem? Can Minaxi or someone look at the
problem again and see if there might be a confusion between
the two entries that is causing the apparent error?
3. If the duplicates are mistakes, should we try to prune the
database of such duplicates?
Identifying duplicates will be fairly easy. Removing them may
be
considerably harder since, for any duplicate (or triplicate etc.)
title, there may be some articles linked to each of the
duplicated titles. I would probably need to write a program to
fix up the data, relinking all articles and all other entries in
the database from the titles we pick to delete to the titles we
pick to retain.
I don't know what the effects of such a change would be. I can
imagine that there would be some unintended consequences.
Perhaps there are historical reports, for example counts of
articles linked to a journal, that would become invalid -
although another way to think about this is that such counts are
already invalid and at least we could make the future counts
accurate. There might also be archived data somewhere that would
no longer accurately match what's in the pruned database.
It might be worth trying to put some kind of safeguard in place
to prevent duplicates in the future - though I'm not sure what
such a safeguard should do. If one of the entries has a one
character change in spelling or punctuation, that might be
perfectly legitimate and might justify two entries.
Or maybe I'm barking up the wrong tree. Perhaps what I'm seeing
in the database is legitimate and not involved in this problem
and I need to look elsewhere.
I've added Cynthia and William to the CC list for this issue.
We might want to add Minaxi too, but I would first need to get
Volker (who has the Bugzilla administrative privileges) to add
her as a Bugzilla user.
BZDATETIME::2010-07-26 11:08:20
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::4
Alan, Minaxi would be the best person to contact with questions regarding this issue. She identified this problem and maintains the NOT Jrl Lists.
A comment from Minaxi:
"We deliberately create two entries, one with period and one without
period.
Earlier there was no period at the end of journal title in PubMed and
later they added period-– and that created confusion in the tables and
our retrieval of correct nos. in the Not List Report & Yes List
Report."
BZDATETIME::2010-07-27 18:51:14
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5
(In reply to comment #4)
> Alan, Minaxi would be the best person to contact with questions
regarding this
> issue. She identified this problem and maintains the NOT Jrl
Lists.
>
> A comment from Minaxi:
> "We deliberately create two entries, one with period and one
without period.
> Earlier there was no period at the end of journal title in PubMed
and later
> they added period-– and that created confusion in the tables and
our retrieval
> of correct nos. in the Not List Report & Yes List Report."
I'm pretty confused.
Maybe the best thing would be for me to go over to CIAT one day and work with Minaxi and William for an hour or so to get a better understanding of the NOT journal list and filter, the duplicate titles, and the filter problem.
If Minaxi has a remote terminal client (all Windows computers would normally have one) and the ability to make a VPN connection to NCI, I'll be able to also look at the database while we're doing that.
Is that a reasonable idea?
BZDATETIME::2010-07-28 10:23:57
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::6
That sounds like a good plan. Please let me know when you want to come.
Thanks, Minaxi
BZDATETIME::2010-07-28 10:34:49
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::7
(In reply to comment #5)
> Maybe the best thing would be for me to go over to CIAT one day and
work with
> Minaxi and William for an hour or so to get a better understanding
of the NOT
> journal list and filter, the duplicate titles, and the filter
problem.
>
> If Minaxi has a remote terminal client (all Windows computers would
normally
> have one) and the ability to make a VPN connection to NCI, I'll be
able to also
> look at the database while we're doing that.
>
> Is that a reasonable idea?
Sounds good to me also, and our laptops do have remote terminal clients but may not have been configured. We can do that when you come over. We do have the ability to make VPN connections with the CISCO VPN client.
BZDATETIME::2010-07-29 11:34:40
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::8
Alan:
I will be able to demonstrate the problem to you after the meeting this
afternoon. If you still need to come here after that, we can then make
arrangements for Minaxi to demonstrate it to you as also.
BZDATETIME::2010-09-21 19:24:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9
I've been looking at this issue.
I discovered that there are 24 journals in the journal table
that
are duplicates of each other. This duplication does not appear
to have anything to do with the issue with the NOT lists, but I
thought it ought to be reported anyway in case no one was aware
of the problem.
These duplicates are not caused by missing final periods. All
of
the pairs of duplicate records have identical strings, character
for character, including final periods, in both the short and
full titles in each member of the pair. I would have thought we
wouldn't allow that to occur but, apparently, the software
doesn't forbid it.
I've attached a spreadsheet with the data. The first two
columns
are the journal title ids for the matching records.
Attachment titles.xls has been added with description: Duplicate titles in the Citation Management System
BZDATETIME::2010-09-21 23:04:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10
Here is a progress report on this bug. I've added more people
to
the CC list for this so that everyone will know what is going on.
I haven't got the ultimate cause yet but I have found the
proximate cause of the NOT list search failure. Some data
required for NOT list searches is missing from the database.
Although I don't know the full story by any means, I'm
recording
what I found so far both as a way of keeping track of what I've
discovered, and just in case Minaxi, Cynthia, or someone else
will look at these notes and say "Aha! I see what happened!"
The NOT list search is retrieving information from the "bib"
table, using a number of search criteria to find them, including
that the "full_journal" title in the bib table matches the
"full_journal" title in the "lu_not_lists" table, which is the
table containing all of the NOT listed journals with associated
editorial boards.
The Word document from Minaxi noted three bib citations that
should have displayed in the search results but did not. The
three were:
202026
201890
201891
It turns out that all of these have null (i.e., empty) values
for
"full_journal" title in the bib records. Because they have no
full_journal title there, they can't match on the full_journal
title stored in the NOT list table.
Here are some statistics concerning the dimensions of the
problem, taken from the production database.
Total number of entries in the bib table: 208,672
Number of those with null journal_title: 13,339
That's 6.4% of the whole bib table. It's a lot of missing data.
I wondered if maybe this started happening recently, but it
didn't.
Earliest date_input with null journal_title: 2002-07-12
When I looked at the earliest 10 entries, they were all from
2002.
I wondered if maybe the problem occurred in the past and is not
an ongoing problem, but it appears that it is ongoing. Checking
the production database I see:
Number of bib records with date input in Sept. 2010: 2,513
Number of those that have null full_journal fields: 204
None of the 13,000+ entries in the bib table that have no
journal_titles can ever appear on a NOT list search. They will
never match a NOT list table entry. It's therefore very possible
that this problem dates all the way back to 2002, or else the
database was corrupted at some time since then, causing
journal_title fields that used to exist to disappear from some
bib table rows, as well as new entries to be created improperly.
I wondered if certain journals were all right or all wrong, and
I
think some may be. But the problem can affect individual bib
entries differently for the same journal. For example, of 30
entries in the bib table for "Transfus Apher%", six had
full_journal titles and 24 did not. All six of the bib entries
entered from 2002 through 2003 had full_journal titles. All of
the next 24, entered from 2005-2010 did not. However another
journal, "Medical Anthropology" has two null full_journal fields
in among non-null entries entered before and after. So the
problem is not date related in any obvious way.
It might be possible to fix the NOT list searching problem by
populating the full_journal fields for all rows in the bib table.
However it would be a complicated process because:
1. Unless we discover the cause of how this happens, it will
happen again.
We know that whatever bug caused this to happen is still
working in September, 2010. Data errors of this type are
ongoing.
Knowing that the problem is still occurring on a large
scale, we really can't afford to fix the data without first
fixing the bug that causes the data loss.
If we do fix the bug and the data, I'd be inclined to put a
SQL Server enforced constraint on the data to require that a
bib.full_journal field be non-null. If we do that, any
program that tries to create a new bib entry without a title
will cause SQL Server to raise an error.
I'd have to check to be sure that the original programmers
don't ignore SQL Server errors. With the constraints I have
in mind, the program will fail with an error message if a
null bib.full_journal is created. If the programmers ignore
errors, then the program will still fail, but in less
obvious and predictable ways.
2. There is no unique key linking the bib row to the
corresponding journal table row.
The link is made on the bib.full_journal field linking to
the lu_journalTitles.full_title field, but 13,000+ bib rows
don't have a full_journal fields, 24 of the full_titles are
non-unique (see comment #9).
In order to populate the full_journal field we'd have to
figure out which journal record has the data for it. The
only clue in most cases would be the leading substring of
the bib.journal field, up to the beginning of the date. But
that runs afoul of the 24 duplicate journal entries and
another 201 duplicates entries with different full_title and
same short_title. We might have to make some arbitrary
assignments.
Using the full journal title as a linking field between tables
was a most unfortunate design decision. It violates the standard
database principles of storing long and/or variable data in only
one place and using short, unique keys for linking - with the
database itself enforcing uniqueness. However, the decision was
made long ago and changing it now would require very significant
changes to many parts of both the web and the import system.
I'm worried that NOT list search failure may be just the tip of
an iceberg created by the large numbers of missing and non-unique
journal titles. What else isn't working because of these
problems? However the system seems to be meeting people's needs.
Maybe we can patch it up and keep it working for some time to
come until it can be rewritten and integrated into other OCE
systems.
Those are my discoveries so far. If anyone has any ideas or
additional clues, please add them to this Bugzilla issue.
I'll keep working on it. My next step might be to do some
experiments with the dev database. I could load dummy data into
the null bib.full_journal fields, then install a database
constraint to prevent new null entries from being created, then
import a month's data and see where it breaks. But I'd like to
brainstorm with anyone who has any other ideas before I try that.
BZDATETIME::2010-09-22 09:04:17
BZCOMMENTOR::Volker Englisch
BZCOMMENT::11
(In reply to comment #10)
> Those are my discoveries so far. If anyone has any ideas or
> additional clues, please add them to this Bugzilla issue.
Just wondering: The titles that are missing don't contain any special characters, do they?
BZDATETIME::2010-09-22 09:25:28
BZCOMMENTOR::Bob Kline
BZCOMMENT::12
(In reply to comment #11)
> Just wondering: The titles that are missing don't contain any
special
> characters, do they?
Doesn't seem likely, since he says the same journal title is present sometimes and absent others.
BZDATETIME::2010-09-22 12:02:34
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13
(In reply to comment #11)
> (In reply to comment #10)
> > Those are my discoveries so far. If anyone has any ideas
or
> > additional clues, please add them to this Bugzilla
issue.
>
> Just wondering: The titles that are missing don't contain any
special
> characters, do they?
I didn't see any special characters, and the same titles are
stored
in other tables which, I presume, are configured for the same
character set.
But it was a good idea.
BZDATETIME::2010-09-22 17:45:06
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::14
The missing full journal titles and the duplicates may be related. Users noticed the absence of the full journal title for some of the records and reported the issue to be fixed (at Lockheed), some of the tables were updated at that time and it is possible that in the process it may have created the duplicates. I agree with you that duplicates should not have been allowed in the first place.
In terms of the problem of the missing journal titles, our guess is that it has to do with the import process or a system error of some kind, because for some of them we know that the full journal titles exist in PubMed but when they were imported into the CMS, the full title might not have been imported.
BZDATETIME::2010-09-23 11:12:52
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::15
I have attached a proposed solution from one previous programmer to the missing journal titles problem
Attachment MissingFullJounalTitle (2).doc has been added with description: Missing full journal title proposed solution
BZDATETIME::2010-09-23 18:54:38
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16
I reviewed the previous programmer's notes. They were of some
use.
At our CDR status meeting we decided not to address the
underlying problems in the system. The previous programmer
wasn't really proposing to solve them either.
I will concentrate on two changes to ameliorate the NOT list
problem, and maybe other problems if they depend on the
full_journal column being present in a bib row. These are
similar to some of the previous programmer's proposals.
1. I'll write a program to try to populate null full_journal
fields.
I don't know yet how many I can populate accurately or with
what degree of confidence, but I'll experiment to find out.
2. I'll modify the program that currently populates the
full_journal field during an import to make it more
forgiving.
At a minimum, I should be able to make it insensitive to
final periods.
Beyond that, I'm not sure yet what can be achieved, but
hopefully, we can reduce the dimensions of the problem.
I'll work using the dev database.
At some point, it will be helpful to me if someone can send me
an
import Medline text file, preferably from September, to use in
testing. Maybe someone at CIAT can attach one to this issue.
BZDATETIME::2010-09-24 11:20:12
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::17
This is a CAM editorial board file for summary Acupuncture.
Attachment acupunt_TEST_Sept10.txt has been added with description: Medline text file
BZDATETIME::2010-10-05 18:42:54
BZCOMMENTOR::Alan Meyer
BZCOMMENT::18
I've done some more research on the problem and have more
information and ideas to report.
How the System Currently Works
------------------------------
The import system reads files of Pubmed records in Medline
format
for import into the database. Each imported record has a short
journal title that identifies the journal from which the article
is cited. These look something like this in the Medline
formatted data:
"SO - Am J Ther. 2010 Sep-Oct;17(5):476-86."
The import program reads that line, chops off the leading tag
prefix and everything after the period to get the short title,
in the above case this would be:
"Am J Ther"
It then searches for that short title in the journals table
(named "lu_journalTitles").
If it finds a match in that table, it extracts the full journal
title from the journals table (in this case it is "American
journal of therapeutics.") and adds it to the corresponding bib
article record so that the bib record contains both a short and
long title for the journal of the cited article.
If there is no match, the full journal title field in the bib
record is empty (null), and no match can be made on the NOT list,
which requires a full journal title for matching.
Why Matching Fails
------------------
I've found two reasons why matching fails:
1. The journal is simply not in our database.
It appears that journal titles are not added automatically to
the database, it requires a human being to add them, one at a
time, using an interface in the web based system, not the
import system. If an import file contains a journal title
that's not yet in our database, there is no match and the NOT
list will not work for that article.
I looked at the code in the web based system to see if,
adding a new journal title, causes the program to go back and
find any articles that need updating. I saw no evidence that
it did that. Once the article record is incorrect, it
appears to stay incorrect even after the underlying problem
is fixed.
2. Matching fails due to improper formatting of the bib
journal
short title.
For some number of records, the short journal title stored
in the bib record for an article has no period separating the
title proper from the issue information. For example:
"Curr Urol Rep 2001 Jun;2(3):253-8."
This should look like the following, but doesn't:
"Curr Urol Rep. 2001 Jun;2(3):253-8."
^
I don't know whether this problem was caused by a bug or
format change in Pubmed or by a bug in the import program.
However, wherever the problem was, it must now be fixed. The
last article loaded with this error was from Dec. 21, 2005.
There may be other causes, but those are the two that I have
found. All the examples I looked at were caused by one of those
two.
Fixes
-----
I'm inclined to make the following fixes:
1. Load indexed Pubmed journal titles into the database.
I downloaded the list from Pubmed and examined it. If I did
it right, there are 5486 separate journals that are currently
indexed.
We have only a fraction of these in our journal title list.
Most of the rest probably don't have to do with cancer. So
we have two choices if we adopt this fix:
a. I can add just those titles that are cited in the bib
article table but not found in our journal title table.
See 3. below.
b. I can add all of them.
I don't know if there is a downside to loading all of
them. It might reduce the number that we have to add by
hand later. Or it might have unintended side effects.
2. Fix all of the article journal titles that do not have
periods in the right place.
In the example above I would change:
"Curr Urol Rep 2001 Jun;2(3):253-8."
to:
"Curr Urol Rep. 2001 Jun;2(3):253-8."
In other words, I'd put a period in the right place, just
before the first substring beginning " 19" or " 20".
3. Run a script to match everything and update the bib table.
The program would find every article that lacks a full
journal title. For each one it would attempt the following:
Isolate the short title string up to the first period.
Try to find a match in the journal table.
If we've loaded all Pubmed journals into the journal
table then:
We either find a match or don't. Either way,
we're done.
Else:
If no match found:
Look in the list of indexed Pubmed titles for
a match on a title that isn't in our
database.
If a match is found:
Add a journal title to our database for
that journal.
Update the bib article record to include
the full journal title.
When all that is done, I believe there will still be some
articles with short journal titles that don't match anything. I
don't know the exact numbers, but I expect there could be around
250 short titles involved that are not in the Pubmed indexed
list.
I can think of various possible reasons why these articles exist:
Maybe the journals are no longer indexed in Pubmed.
Maybe they were never indexed in Pubmed.
Maybe they were from journals that underwent a name change,
or a change of Pubmed abbreviation.
Once we've fixed everything that can be fixed programmatically,
I
can get an accurate list of the remaining articles for further
research.
Future Issues
-------------
The "Fixes" I've suggested above are a one time fix. They will
fix thousands of records, but the problem can gradually get worse
again. To make a more permanent fix requires a lot more work.
I've thought of several ways to reduce the number of errors in
the future. These include:
1. Load all Pubmed indexed titles, not just the ones we use.
This may reduce the problem in the future of adding an
article from Pubmed for a journal that isn't yet in our
database. It won't help with new journals added by Pubmed
after we load the data.
2. Load all Pubmed titles, indexed or not.
The total journal file at NLM is many times larger than the
indexed list. I have no idea if it is appropriate to add
these or if or will help. Cynthia or Minaxi may know.
3. Combine all of the above steps into a repeatable program.
I could write a program that would do all of the above
automatically, but that could be much more work if it has to
automate everything from downloading the Pubmed journal list
onward.
4. Periodically re-do the fixes by hand.
For example if, after a year, we haven't started on a new
system to replace the Citation Management System, I could
re-do the fixes by hand.
Alternatively, if we build a new system, I could re-run fixes
before we convert the database for loading into the new
system - though possibly a new system won't require this at
all.
5. Institute new processes to try to find the problems for
hand
updates.
I could write a program that would find all of the SO titles
in an input file that aren't in our database. The import user
could then add those titles first before running the import.
6. Fix the import program.
We've already decided not to do major fixes. I could make a
smaller fix that, when a new journal title is added, the
system tries to find any bib articles that have the short
title but not the long one and update them.
I'm not sure yet what would be best and there may well be
others
that I haven't thought of that will be suggested. I'm hoping
users have some ideas.
I won't do any more work until we've discussed all of this and
all of the users have had a chance to think about the problems.
BZDATETIME::2010-10-06 10:46:06
BZCOMMENTOR::Bob Kline
BZCOMMENT::19
(In reply to comment #18)
> I've done some more research on the problem and have more
> information and ideas to report.
> ....
Good analysis work.
> 5. Institute new processes to try to find the problems for
hand
> updates.
>
> I could write a program that would find all of the SO titles
> in an input file that aren't in our database. The import user
> could then add those titles first before running the import.
This seems like low-hanging fruit (a "quick win").
BZDATETIME::2010-10-06 17:06:28
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::20
Minaxi and I have discussed your analysis and possible solutions. We
have not identified any potential problems with the solutions presented
but have two comments:
#1. The missing period at the end of selected abbrev journal titles was
due to a previous pubmed format change. Several years ago this same
missing period was the cause of a citation importing problem. The
problem with importing was resolved but at the time we were unaware of
any problem this missing period was causing with the journals in our NOT
lists.
#2. We only need titles that are indexed in PubMed due to the fact that
ours searches are limited to citations in the PubMed database.
BZDATETIME::2010-10-12 15:56:44
BZCOMMENTOR::Alan Meyer
BZCOMMENT::21
Question:
What should I load into the journal titles table in the
Citation
Management System:
1. All titles indexed in Pubmed?
or
2. Only those titles indexed in Pubmed that are referenced by
articles in our database but aren't yet in the journal titles
table?
Thanks.
BZDATETIME::2010-10-14 10:39:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::22
(In reply to comment #21)
> Question:
>
> What should I load into the journal titles table in the
Citation
> Management System:
Based on emails from Cynthia and Minaxi, the solution I'm
working
on is to load all titles indexed in Pubmed.
What I plan to do is write some programs to match the current
list of Pubmed indexed titles against the list of titles found in
the Citation Management System. I'll output a number of
different lists of titles based on what kind of match is found:
short titles match but not long
long titles match but not short
If there are any in the above categories, we'll need to
figure out why and decide what to do about them.
short and long titles both match
These are good. No action is required.
no match at all
These are the ones I'll load into the Citation Management
System.
BZDATETIME::2010-10-22 00:21:14
BZCOMMENTOR::Alan Meyer
BZCOMMENT::23
I wrote some programs to do the comparisons between the
Citation
Management System journal titles and the Pubmed indexed journals
list. For better or worse, I disregarded final periods on all of
the matches. I spent a little time trying to figure out whether
it was better to match with them or not, or to try both ways, but
it made my brain hurt and I'm not sure how much more or less
information it would give us.
The output results are pretty complicated. Matching can occur
on
the short title or the long title. There can be multiple matches
on one or the other, for example two records with the same short
title or two with the same long title. This can happen in Pubmed
as well as in the CiteMS. It's even possible for one record to
match on the short title but not on the long, and a different
record to match on the long but not the short, etc.
The outputs are in five files which I will attach in subsequent
comments. The form of the outputs is as follows. I've put in
the line numbers here so I can explain each line below. There
are no line numbers in the files
1 'short title' = 'long title'
2 short CiteMS: (RecID, 'short title', 'long title')
3 long CiteMS: (RecID, 'short title', 'long title')
4 full CiteMS: (RecID, 'short title', 'long title')
Explanation:
Line 1: Pubmed record.
The short title comes from the "MedAbbr" value in the
downloadable version of the Pubmed indexed title list.
The long title comes from the "JournalTitle" value.
Line 2: CiteMS record matching on short title.
Some of the output files have these, some don't. If
present, the three fields in parentheses are:
RecId: Internal CiteMS unique ID for the title.
short title: CiteMS version of the short title.
long title: CiteMS version of the long title.
The CiteMS short title matches the Pubmed short title.
Line 3: CiteMS record matching on long title.
Same format as line 2, but the match is on the long title
only.
Line 4: Match on full title.
Same format, both long and short titles match.
These outputs only exist where there is a multiple match.
In other words, I found a record that exactly matches
both short and long titles in Pubmed and CiteMS but, in
addition, I found a different CiteMS record that matches
one or the other of the titles. There are 79 of these,
all in the multi-match output file.
Some of the files have no CiteMS record shown. They are:
The unique matches file - the best possible outcome.
Matching CiteMS records aren't shown because they're
identical to the Pubmed records.
The no match file. There are no matching CiteMS records to
show.
Some files have just short or just long, or both short and long
matches. Only the multi-match output has all three kinds of
output, short, long and full. The multi-match output file wasn't
mentioned in comment #22. I didn't discover that I needed it
until I had written the programs.
Here are the statistics on the output file sizes. The output
counts shown are counts of Pubmed records:
Input counts:
Titles read from CiteMS = 18838
Titles read from Pubmed = 22993
Output counts:
Full matches = 17619
Match on short title only = 545
Match on long title only = 240
No match on either title = 4217
Multi-matches = 372
I propose to load the "No match..." file (there is no match on
either short or long Pubmed title in the CiteMS) into the CiteMS,
but will wait until everyone has had a chance to look at the
results and think about them. I don't know whether we want to
attempt any cleanup on the others or not.
Output files will be attached in the following comments.
BZDATETIME::2010-10-22 00:23:28
BZCOMMENTOR::Alan Meyer
BZCOMMENT::24
These are exact, unique matches, disregarding any final period.
Attachment CiteMS_PubMed_Matches.txt has been added with description: Exactly matching Pubmed / CiteMS journal titles
BZDATETIME::2010-10-22 00:24:45
BZCOMMENTOR::Alan Meyer
BZCOMMENT::25
These are the ones we expect to load into the CiteMS.
Attachment CiteMS_PubMed_NoMatches.txt has been added with description: Pubmed journals with no matches at all in the CiteMS
BZDATETIME::2010-10-22 00:26:25
BZCOMMENTOR::Alan Meyer
BZCOMMENT::26
Sometimes there is more than one CiteMS match for a Pubmed
short title.
Attachment CiteMS_PubMed_ShortMatches.txt has been added with description: Journals matching on the short title but not long
BZDATETIME::2010-10-22 00:28:19
BZCOMMENTOR::Alan Meyer
BZCOMMENT::27
Again, there can be more than one match here.
Attachment CiteMS_PubMed_LongMatches.txt has been added with description: Journals matching on the long title but not the short
BZDATETIME::2010-10-22 00:29:51
BZCOMMENTOR::Alan Meyer
BZCOMMENT::28
These are the ones that may have a match on one record on
short title and a different on long title, or one on both
titles and a different one on a short or a long only.
Attachment CiteMS_Pubmed_Multi.txt has been added with description: Journals with complex matching matrices
BZDATETIME::2010-10-26 10:13:49
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::29
I agree with loading the "No Match.." list.
As for the clean up of the short and long match lists, Minaxi and I will
be discussing this today.
BZDATETIME::2010-10-29 00:56:14
BZCOMMENTOR::Alan Meyer
BZCOMMENT::30
We discussed the issues concerning the cleanup of the journal
titles at our CDR status meeting today.
Here are my suggestions for what to consider regarding the
short,
long, and multi-match lists.
I think the first thing to do is to figure out what is to be
gained by a cleanup, and whether it is worthwhile. Here are some
considerations that I have thought of in that regard. Cynthia
and Minaxi and the other users probably have more ideas.
If we clean up data, will it be possible to keep it clean?
Or will we have to do it again?
What problems will the system have if we don't clean up all
the errors? We know that some NOT list entries won't work,
and I think some reports that group information by journal
title will be inaccurate. I don't know whether those issues
are serious or if there are any other implications.
If we delete duplicates, what will we do with bib records
that use the deleted short titles or full titles? Will we
find them and replace them with the strings that were not
deleted? That might be much easier to do with the full
titles than the short titles, which are incorporated in the
article citations, and which are significant in Pubmed
searches.
Are there any other tables in the system besides the article
table ("bib") and the journal title table
("lu_journalTitles") that will need to be updated?
Where NLM has multiple records for what is apparently the
same journal (e.g., same ISSN), do we want to mirror NLM or
try to do something on our own?
What rules will we follow in deciding what to delete and what
to retain? How will a user (or a program) decide?
If we do decide that a cleanup is warranted, what parts of
the cleanup should be done by a program and what parts by a
human? If there are any parts that have to be done by a
human, is there a way to use programs to make it easier?
One example I've thought of is to have a program produce
a spreadsheet containing listings of all the duplicates.
A human could simply mark the ones to be deleted on the
spreadsheet. Another program could then read the revised
spreadsheet and perform the edits in the database.
It is possible that we'll be able to replace the Citation
Management System in the future, incorporating all of the lessons
learned from the old one, updating the technology, and also
preparing the way for coming changes like electronic
communications with board members and distribution of electronic
copies (e.g. PDFs) of documents, and perhaps closer links to
other systems at OCE such as Summary maintenance systems (CDR)
and possible future Board Management systems. If so, one of the
important issues we'll need to think about is what we want to do
with journals.
Perhaps the most basic questions are:
What is one journal?
Is it what is assigned one ISSN, or some other identifier
that retains its identity across title changes?
Is it what NLM calls a journal?
Is it just whatever is printed (or the electronic
equivalent for e-journals) on the cover?
Will there need to be journals in the system of the future
that don't appear in the NLM database?
Having worked as a professional librarian myself and worked on
systems that managed journals, I'm aware of how complex they are.
They change titles, they can have specially titled parts and
supplements, they can fork into two titles, each with the same
parent title or coalesce two titles into one. They can disappear
and re-appear with ambiguous links between old and new. It's not
an easy problem by any means. In designing a new system we'll
probably have to compromise between producing a super clean
database that can absorb too much labor to keep it clean, and a
database with too much ambiguity to serve our real needs. We'll
have to decide how much we want to rely on NLM or some other
journal authority that has different needs than ours and can
change things without consulting us, and how much we want to do
on our own, with all of the attendant labor involved.
If we build a new system, we'll also have to move data from the
old system to the new one. That also affects cleanup issues.
When, how, and how thoroughly the cleanup needs to be done before
moving the data will have to be part of the new system plan.
Another issue has come up with the CDR that has a higher
priority
at the moment, so I have not loaded the journal titles from NLM
yet. I hope to finish the other problem some time next week and
then get back to this issue.
BZDATETIME::2010-10-29 11:01:48
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::31
NLM created a journals database years ago. You can access it from the
link below:
http://www.ncbi.nlm.nih.gov/journals
Each journal record includes one or more ISSN numbers. For example
Journals with electronic versions will have two ISSN numbers, one for
print and one for electronic. NLM assigned each of the journals in the
database a unique ID called NLM ID. And each journal record includes
title changes, and any other information on the journal if it has merged
with another or split, etc.
Maybe we could link up to this journals database some how and not have
to make a new one of our own or clean up the existing one. Not sure if
this is even possible but I thought I'd toss the idea out there. If
anything this is a good example of how NLM is defining "what is one
journal?" and my help answer some of your other questions.
BZDATETIME::2010-10-29 14:00:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::32
(In reply to comment #31)
...
> http://www.ncbi.nlm.nih.gov/journals
...
Yes, this is the same file that I used to compare the Pubmed list to the journals in the Citation Management System and generate the differences.
It makes sense to me that the NLM listing could be used as an authority file for our system. They already do a lot of the work.
The current software still needs to have the journal title data loaded locally and, based on my limited understanding of the functionality of the system, we would probably need that in the future as well, even if we built a new system. However it might be possible to make links, either through ISSN or NLM ID, that would enable automatic updates of our system and keep us in sync with them. We'd have to think about the implications of this if we ever include non-NLM journal articles in the system.
At some point I'll write this up for the other issue #4937, the notes on a new system.
BZDATETIME::2010-11-11 15:39:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::33
I plan to do the following today:
Backup the test database.
Perform some test inserts of new journals.
If all goes well, insert the rest of the new journal titles
(4217 total new titles.)
If I mess everything up, I'll restore the backup.
It will probably be best if everyone stays out of the test
database for the duration of these changes. I'll start the
process at 4pm to give everyone time to get out if they're
logged in to the test system now.
The production database should be unaffected. People working
in that can continue.
BZDATETIME::2010-11-11 17:29:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::34
I've inserted all of the titles from Pubmed not found in the
CiteMS journal title table.
The next step is to update the article ("bib") table. I'm
working on a program to do that.
I'll examine every journal article record in the bib table that
has a null 'full_journal' column value. For each one, I'll do
some processing to isolate the short title part of the 'journal'
column value and match that against the 'short_title' taken from
the CiteMS journal title table. That table will have everything
it had before plus new entries from the journal title update from
NLM that I just did.
As before, I'll output information about what happened so we
can
analyze the results. There will be three outputs:
1. A list of 1 -> 1 matches from an article table journal
value
to the journal table short_title.
2. A list of article entries with no match in the journal table.
3. A list of article entries with more than one match in the
journal table.
For all matches in category 1, I'll insert the full title from
the journal table into the article table row for the matching
article. Some of these will be journal titles newly loaded from
NLM but some may also be for journal titles imported by hand from
NLM using the web based system, but done after the article record
was already created.
For category 2 there's nothing I can do.
For category 3, it might be a mistake to let the program choose
which journal title to use, so I won't do anything with those
either. We can try an automated cleanup later if we think there
is a good way to recognize the right match programmatically.
Of course we still don't know how useful any of this is. I'm
hopeful that it will help with the NOT list problem and improve
the accuracy of some reports, but we already know it won't
totally solve the problems and we don't know yet what the extent
of the help will be.
I'm trying to keep notes on each of the things I'm doing so
that
if we decide we need to do this again a year from now, I'll know
what I did.
BZDATETIME::2010-11-12 00:41:17
BZCOMMENTOR::Alan Meyer
BZCOMMENT::35
I've gotten pretty far down the road, writing the program to
populate missing full title fields in the article ("bib") table,
but I haven't finished it.
I wrote most of the program and was working on the part that
isolates the short title from citation, for example:
Crit Rev Oral Biol Med. 2002;13(1):62-70.
or
Breast Cancer Res Treat 2002 Mar;72(2):131-7.
But it's an error prone process. In the first example, the
short
title is everything up to the first ".". But the second one
doesn't have a period in the same place, so it's everything up
to the space before the year.
Searching for one or the other mostly works. Most citations
follow one or the other of these patterns and the CiteMS appears
to try both of these techniques. Unfortunately however there are
a few citations that have a date as part of the title before the
period.
J Am Pharm Assoc (2003). 2008 Nov-Dec;48(6):784-92.
Spine (Phila Pa 1976). 2009 Apr 20;34(9):901-4.
Maybe I can isolate these by noting the parentheses. I'm not
sure.
There are also a few garbage citations, for example there are
seven of these:
Entrez PubMed My NCBI [Sign In] [Register] All
DatabasesPubMedNucleotideProteinGenomeStructureOMIMPMCJournalsBooks
Search PubMed Protein Nucleotide Structure Genome Books
CancerChromosomes Conserved Domains 3D Domains Gene Genome
Project GENSAT GEO
At least those shouldn't match anything in the journal table.
I thought maybe I could overcome all of these pattern matching
problems by using ISSNs. I did some research on that since both
article citations and journal records in Pubmed can have ISSNs.
I even came up with a plan for using the Pubmed unique ID (PMID)
for linking the two, i.e.:
Use the PMID in CiteMS to find the article in Pubmed.
Get the ISSN from the article.
Use it to get the journal title from Pubmed.
However it turns out that doesn't work any better. Of the
22,993
records in the Pubmed journal title list, 3,505 don't have an
ISSN. Obviously, there must be a lot of articles that don't have
them either. I suspect there will also be problems of multiple
title strings matching one ISSN and multiple ISSNs matching one
title.
However we look at it, it's a hard problem.
I fear I am at some risk here of slipping into analysis
paralysis. So I'm going to stop over thinking this and continue
on the path I was on. What I'm done it won't be perfect but it
shouldn't make anything worse and may make things better. I'm
trying to be careful to only update records where all the matches
are unambiguous.
I'm hoping that by doing all of this analysis I'm at least
learning more about the problems for our next go round.
I'll try to finish the program next Monday and post the results.
BZDATETIME::2010-11-15 21:33:12
BZCOMMENTOR::Alan Meyer
BZCOMMENT::36
I finished a program to do the updates. I've run it to generate
all of the match information and SQL update strings, but haven't
applied any of the updates to the database. We should probably
review the results before I do that, even in the test database.
The final report from the program is:
INPUTS:
Articles with no full title = 13070
Total journals in database = 23055
CITATION OUTPUTS:
No short title parsed = 0
No matching short title = 941 - CiteMS_Internal_NoMatch.txt
Multiple short title matches = 273 -
CiteMS_Internal_MultiShort.txt
Multiple long title matches = 80 - CiteMS_Internal_MultiLong.txt
Unique matches = 11776 - CiteMS_Internal_MatchUpdates.sql
Here's what these numbers mean. All numbers reference the test
database:
INPUTS:
There are 13,070 articles in the test database with no full
journal title.
That's out of 205,082. About 6.4% have no journal titles.
There are now 23,055 journals in the journal table after
loading journals from NLM.
CITATION OUTPUTS:
I was able to parse out a short title from every one of the
articles that had no full title.
I used the technique in the CiteMS plus one small
modification that picked up a handful that wouldn't be picked
up in the CiteMS.
941 of the articles that didn't have a full journal title
could not be matched against the journal table.
This is even after loading the full NLM title list. I don't
know why, but I'm guessing that at least part of the problem
is at the NLM end, i.e., that they have citations that don't
exactly match their own journal list. But we'd need to do
more research to find out if that's a possible cause or the
main cause.
273 of the articles match more than one short title in the
title table.
This is because there are duplicates in that table. If I did
everything right, these duplicates were there before I loaded
the full NLM list because I tried not to load any that
duplicated ones we have already.
80 of the articles match more than one long title.
I parsed a short title from the citation, found a match for
it in the journal table, then checked to see if the long
title happens to match any other short titles. 80 of them
did.
11,776 updates would be performed.
Of the 13,070 articles in the test database that don't have
full journal titles, 11,776, or about 90%, can be assigned an
unambiguous long title by this program.
By "unambiguous" I mean that I'm disregarding any match that
gets more than one hit on a short or a long title.
We could include ambiguous matches. If we did, the number of
updates would go up to 12,129. That would bring us up to
almost 93%, but at the cost of making some possibly funky
matches - maybe good for some purpose, maybe not so good for
others.
In terms of the total journal matched articles, our
percentage would go from 93.6% journal matched articles, up
to almost 99.4% using only the unambiguous matches.
I think the improvement is enough to make the database a little
less error prone, though there will still be articles that are
not linked to any journal title, and articles linked to the wrong
titles. It looks to me like it's worth doing the updates.
I'll attach output files if anyone wants to see them.
BZDATETIME::2010-11-16 10:45:07
BZCOMMENTOR::Alan Meyer
BZCOMMENT::37
I was about to post attachments with the output files from my
journal matching when I decided to investigate some of the match
failures.
It turned out that we have citations in the "bib" (article)
table
that don't have any journal information except a short title.
For example:
"Am J Med."
The bib table does have a "date_published" value for this
article:
"2005 Oct".
Checking Pubmed for the PMID associated with this article I find:
"Am J Med. 2005 Oct;118(10):1078-86."
That would be the value that should appear in the "journal"
column for this record.
Presumably the record was read in correctly from NLM but
something happened to it, possibly caused by a bug in the import
software, or possibly caused by something that happened later,
that corrupted the citation.
There were some modifications made to the software at the end
of
2005. Perhaps some bugs were introduced at that time and then
fixed, but the database was never corrected.
I'm going to modify my software to handle this kind problem and
re-run. We should be able to improve our match rate.
I'll report back soon.
BZDATETIME::2010-11-16 14:19:57
BZCOMMENTOR::Alan Meyer
BZCOMMENT::38
The change in short title construction helped. Here are the
new results:
INPUTS:
Articles with no full title = 13070
Total journals in database = 23055
CITATION OUTPUTS:
No short title parsed = 0
No matching short title = 320 - CiteMS_Internal_NoMatch.txt
Multiple short title matches = 291 -
CiteMS_Internal_MultiShort.txt
Multiple long title matches = 80 - CiteMS_Internal_MultiLong.txt
Unique matches = 12379 - CiteMS_Internal_MatchUpdates.sql
Applying the updates would now bring us up to almost 99.7% of
all
articles having a full journal title, with 691 still unmatched
(the
sum of the three categories of failed or ambiguous matches.)
I'll post the output files as attachments.
BZDATETIME::2010-11-16 14:21:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::39
The format is:
Internal CiteMS unique ID for the article (not the Pubmed ID): Citation
Attachment CiteMS_Internal_NoMatch.txt has been added with description: Article citations with no match in the journal table
BZDATETIME::2010-11-16 14:26:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::40
The format here is
UI: Citation
Match in journal table
Another match in journal table
etc.
Some of these would also appear in the file containing multiple
long matches, but I have not reported them in both places.
Where both the short and long title are the same in two
different
journal records, the CiteMS has duplicate journal data.
Attachment CiteMS_Internal_MultiShort.txt has been added with description: Cites with two or more short title matches (after updating journals)
BZDATETIME::2010-11-16 14:28:32
BZCOMMENTOR::Alan Meyer
BZCOMMENT::41
These don't match on the short title, but the short
title they match on points to a long title that is a duplicate
of a long title associated with a different short title.
It might be safe to just update these citations to point to
the long title, even though we don't have a unique long title
for them.
"Safe" is a relative term of course.
Attachment CiteMS_Internal_MultiLong.txt has been added with description: Cites with two or more long title matches (after updating journals)
BZDATETIME::2010-11-16 14:31:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::42
These are in SQL format, ready to be applied to the database.
That's probably not the most useful format to read since it
doesn't include the citation but, trust me (hah!), they're
okay 🙂
Attachment CiteMS_Internal_MatchUpdates.sql has been added with description: Article long title updates (after updating journals)
BZDATETIME::2010-11-16 14:33:36
BZCOMMENTOR::Alan Meyer
BZCOMMENT::43
> ... trust me (hah!), they're okay
Don't trust me. I spoke to soon.
There are duplicates that shouldn't be there.
BZDATETIME::2010-11-16 14:55:39
BZCOMMENTOR::Alan Meyer
BZCOMMENT::44
Here's the corrected one.
Same titles, but I had mistakenly put in unique identifiers
for the journals instead of the articles to be updated.
That's now fixed.
I also added a copy of the citation above the SQL update
statement, e.g.,
Arch Neurol. 2005 Dec;62(12):1904-8.
UPDATE bib SET full_journal='Archives of neurology' WHERE ref_id =
91684
If anyone wants to review the updates, that should make it
practical.
Attachment CiteMS_Internal_MatchUpdates.sql has been added with description: Article long title updates (after updating journals)
BZDATETIME::2010-11-16 17:08:56
BZCOMMENTOR::Alan Meyer
BZCOMMENT::45
I propose the following steps in the resolution of this task.
1. Review the outputs from the most recent program.
These are in the attachments created today.
I'm hoping the review will be fairly easy to do, mainly
looking for surprises.
Who: Minaxi and/or Cynthia.
2. Prepare some test cases for NOT list searching.
These should be searches that don't work in the existing
database.
Who: Minaxi and/or Cynthia.
3. Update the article entries in the test database.
Done by applying the SQL in CiteMS_Internal_MatchUpdates.sql.
Who: Alan
4. Re-test.
Are we doing better with the NOT lists?
If something that didn't work before still doesn't work, is
it because it's in one of the files of article citations that
were not updated?
Did anything else break (or get fixed). Most likely
candidates would be anything that involves full journal
titles. I'm not sure what that would be. If we discover
something that looks not right (or better than before), it's
probably possible to compare it to the production system to
see if there was an actual change.
Who: Minaxi and/or Cynthia.
If something is newly broken, I'll diagnose the problem, fix
it, restore the test database from before all this work, and
restart. We have outputs from the data preparation stored in
Bugzilla attachments and I can probably redo step one by an
automated compare of pre and post data, possibly allowing us
to go right back to step 3.
If it looks like we have alleviated the problem and made
progress
then:
5. Re-do the steps for updating the production database.
This requires:
a. Downloading the latest Pubmed indexed journal list.
b. Locating journals in the Pubmed list that are not in the
production CiteMS journal table.
c. Searching the production database for citations that are
missing full journal titles.
d. Preparing input files from b. and c. for the next steps.
e. Generating the update SQL.
f. Backing up the production database.
g. Applying the updates.
h. Testing.
If there is a problem, we need to immediately restore
the backup and diagnose it, then re-run whatever steps
are required.
j. Cleanup.
I have written three programs used in the above steps
that I will put under version control.
I will also need to write a new document, also placed
into version control, that explains everything that was
done. We may want to do it all again in the future if
we don't have a new Citation Management System before
things get too far out of sync.
Who: Alan for most of the steps.
Minaxi, Cynthia and Alan for step h.
The first step I need to perform is step 3 above. I will wait
for a go ahead from Minaxi and/or Cynthia before I update the
test database.
BZDATETIME::2010-11-17 09:57:56
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::46
Minaxi has been having problems access bugzilla. I am posting her comments for her.
Comments from Minaxi Trivedi 11/16/2010 11:06pm:
I have reviewed some titles in Internal No Match List.
Here are my comments:
1. There is strange character in the journal title for the
following:
155205:  J Clin Oncol. 2008 Apr 20;26(12):2027-33.
 87086: Curr Med Chem Anticancer Agents You can see these in word
format.
2. All titles with Entrez PubMed My NCBI [Sign In] [Register] All DatabasesPubMedNucleotideProteinGenomeStructureOMIMPMCJournalsBooks Search PubMed Protein Nucleotide CoreNucleotide GSS EST Structure Genome Books CancerChromosomes Conserved Domains 3D Domains
need to be corrected manually as they were not imported correctly.
3. 87086: Curr Med Chem Anticancer Agents. 2005
Nov;5(6):603-12.
This does not match as it is no longer indexed in PubMed
4. 51178: Curr Drug Target CNS Neurol Disord 2004
Feb;3(1):27-37.
Journal title Not in PubMed
5. All journal titles with parenthesis do not match 6. All journal
titles with Parts do not match 7. 87778: Hematology Am Soc Hematol Educ
Program. 2005:491-7.
Does not match as title has changed and
Continues: Education Program. American Society of Hematology. Education
Program.
8. There are still some titles which are mystery and cannot find reason
for them not to match.
I also tested NOT journal retrieval for the following two titles and it is still not working.
Transfus Apher Sci
East Afr J Public Health
I added the tile in journal title field, checked NOT journals box, selected adult treatment board and March 2010 review cycle.
BZDATETIME::2010-11-17 11:26:10
BZCOMMENTOR::Alan Meyer
BZCOMMENT::47
(In reply to comment #46)
> Minaxi has been having problems access bugzilla. I am posting her
comments for
> her.
Minaxi,
When I'm in the office tomorrow I can call you on the phone and step you through getting into Bugzilla. We'll see if we can make it work for you.
> I have reviewed some titles in Internal No Match List.
>
> Here are my comments:
> 1., 2. ...
I'll check all of these out and get back to you.
I didn't modify any data in the database other than loading journal titles from NLM that weren't in the database already. I believe that all of the dirty data you see in the output files is actually there in the original CiteMS data, and it's in production too, which I haven't touched at all.
But I'll look at them to be sure that I understand what's there.
> 3. 87086: Curr Med Chem Anticancer Agents. 2005
Nov;5(6):603-12.
> This does not match as it is no longer indexed in PubMed
>
> 4. 51178: Curr Drug Target CNS Neurol Disord 2004
Feb;3(1):27-37.
> Journal title Not in PubMed
Does that mean that, even though an article is indexed in Pubmed, the journal list doesn't include the journal title if the current issues are no longer indexed, or perhaps no longer exist?
That has implications for the analysis I did of journal management in Issue #4937 Comment #6.
> 5. All journal titles with parenthesis do not match 6. All
journal titles with
> Parts do not match 7. 87778: Hematology Am Soc Hematol Educ
Program.
> 2005:491-7.
> Does not match as title has changed and
> Continues: Education Program. American Society of Hematology.
Education
> Program.
> 8. There are still some titles which are mystery and cannot find
reason for
> them not to match.
Can you give me some citations for 5, 6, and 8? I'll check them out.
Internally, every article in CiteMS has a unique identifier. Do users
have access to them, or do you only have the citation string as an
article identifier? It's useful if simple, numeric unique identifiers
are made available to users like CDR IDs in the CDR.
>
> I also tested NOT journal retrieval for the following two titles
and it is
> still not working.
Yes. Nothing should work yet because I haven't run the updates yet. I just want us to identify some test cases. We'll see if they work after the updates of the article table.
> Transfus Apher Sci
> East Afr J Public Health
That particular case isn't going to work after the update either. The problem can be seen in the file "CiteMS_Internal_MultiLong.txt. There are two different entries in the journal title file that match the long title, each with a different short title:
Transfus Apher Sci
Transfus Apheresis Sci
We may want to match these and include them in the updates. See comment #41. If we do, I believe this one will work with the NOT list. However I don't know if that will solve more problems than it creates, or vice versa.
If you can identify other cases that don't work in NOT list processing, please do.
Thanks for all of the checking.
BZDATETIME::2010-11-17 11:47:20
BZCOMMENTOR::Volker Englisch
BZCOMMENT::48
(In reply to comment #46)
> Minaxi has been having problems access bugzilla. I am posting her
comments for
> her.
What kind of problems? Maybe I can help?
BZDATETIME::2010-11-17 12:10:59
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::49
I had proble of logging in to Bugzilla, but it seems to be resolved.
Thanks, Minaxi
BZDATETIME::2010-11-17 13:51:40
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::50
Alan,
Here is my reply to your questions:
3. 3. 87086: Curr Med Chem Anticancer Agents. 2005
Nov;5(6):603-12.
> This does not match as it is no longer indexed in PubMed
When I searched PubMed journals database, I noticed the
following:
Current Indexing Status: Not currently indexed for MEDLINE.
So I assume, it may not have added to the list in CiteMs
4. 4. 51178: Curr Drug Target CNS Neurol Disord 2004
Feb;3(1):27-37.
> Journal title Not in PubMed
Does that mean that, even though an article is indexed in Pubmed, the
journal list doesn't include the journal title if the current issues are
no longer indexed, or perhaps no longer exist?
Yes, it seems to be true. I have a list of other journals that were
indexed in PubMed and were part of our NOT list, but no longer indexed
in Medline. They all have
Current Indexing Status: Not currently indexed for MEDLINE.
These are:
Physiol Mol Plant Pathol=Physiological and molecular plant
pathology
Biological control: theory and applications in pest management
Int J Angiol=The International journal of angiology
Integr Med=Integrative medicine
5. All journal titles with parenthesis do not match 6. All journal
> titles with Parts do not match 7. 87778: Hematology Am Soc Hematol
Educ Program.
> 2005:491-7.
> Does not match as title has changed and
> Continues: Education Program. American Society of Hematology.
> Education Program.
> 8. There are still some titles which are mystery and cannot
find
> reason for them not to match.
Can you give me some citations for 5, 6, and 8? I'll check them out.
titles with parenthesis:
212405: Aging (Albany NY). 2009 Dec 31;1(12):988-1007.
197252: Cancer Prev Res (Phila Pa). 2009 Oct;2(10):868-78. Epub 2009 Sep
29.
134943: J Am Pharm Assoc (2003). 2007 May-Jun;47(3):425-6, 428-30.
Titles with parts:
75254: Birth Defects Res A Clin Mol Teratol 2004
Nov;70(11):889-91.
Title in PubMed:
Title:
Birth defects research. Part A, Clinical and molecular teratology
118357: J Environ Sci Health A Tox Hazard Subst Environ Eng.
2006;41(10):2399-428.
Title in PubMed:
Title:
Journal of environmental science and health. Part A, Toxic/hazardous
substances & environmental engineering
Examples of 8:
206925: Surg Innov. 2009 Dec;16(4):306-12. Epub 2009 Dec 22.
210970: S D Med. 2010;Spec No:68.
Some other observations:
67593: Ther Apher Dial 2004 Dec;8(6):468-73.
Does not match because title has changed and continues as Therapeutic
Apheresis
Int J Med Inform—Search in PubMed Journals database retrieves 3
journal titles
202772: Acta Ophthalmol. 2009 Nov;87(8):914-6. Epub .– Search in PubMed
Journals database retrieves 7 journal titles
143738: Arch Womens Ment Health. 2007;10(5):189-97. Epub 2007 Aug
7.
Full Title has an apostrophe—can this be an issue?
Title:
Archives of women's mental health
152569: Acta Neuropathol. 2007 Dec;114(6):641-50. Epub 2007 Oct
3.
Previously Acta Neuropathol (Berl)
There are 45 citations in CiteMs. It matched first 28. The rest that did
not match is in the report.
12594: Congenit Anom Kyoto 2002 Mar;42(1):10-4.
PubMed Journals database abbreviated title is Congenit Anom (Kyoto)
209725: Stat Appl Genet Mol Biol. 2010;9(1):Article 17.
Version Currently Indexed: Electronic
Not all journal titles have this entry. Could this be an issue?
BZDATETIME::2010-11-17 13:59:57
BZCOMMENTOR::Alan Meyer
BZCOMMENT::51
Thanks. I'll look at the examples tomorrow.
BZDATETIME::2010-11-17 15:59:03
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::52
I have reviewd CiteMS_Internal_MultiLong file and attaching the file
with my comments in red and wrong links highlighted in yellow.
Minaxi
Attachment CiteMS_Internal_MultiLong[1].doc has been added with description: Review of the CiteMS_Internal_MultiLong file
BZDATETIME::2010-11-17 18:28:35
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::53
I have gone through the duplicate titles file. In most cases the reason for the duplicate, I think, is because of the addition of a period at the end of the journal title. Many of the duplicate titles do not retrieve any citations so they are not a problem. I am attaching a copy of the duplicate titles with my comments for each title.
BZDATETIME::2010-11-17 18:30:52
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::54
Duplicaate titles list and comments reviewed by Cynthia Boggess.
Attachment duplicates_reviewed_CBoggess_11-17-2010.xls has been added with description: Reviewed list of Duplicate titles
BZDATETIME::2010-11-17 22:39:32
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::55
I Have partially reviewed CiteMS_Internal_MultiShort file and attaching it
BZDATETIME::2010-11-17 22:44:26
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::56
I Have partially reviewed CiteMS_Internal_MultiShort file. I have
made comments in Red and highlighted wrong title IDs in Yellow.
Here are my comments:
1. There are some Duplicate Titles, probably created manually by
mistake. Some seem to be deliberately created duplicates as their IDS
are consecutive and these must be the ones with period and the other
without period.
2. Some Titles have multiple entries due to change in titles. Merger
etc. I checked them with the Journals database and highlighted the wrong
ones.
3. While trying to find the correct match, because of multiple retrieval
for one abbreviation in PubMed, I tried to find the ISSN no. from the
citation and then search in Journals database and that seem to work
better.
I am assuming that this can be done programmatically, so stopped
reviewing further.
Please give me your thoughts.
4. After reviewing all three files, I have one question about matching
process.
Whether the full journal title was matched letter by letter without
striping non alphaneumaric characters?
Attachment CiteMS_Internal_MultiShort_MinaxiReview.doc has been added with description: partially reviewed CiteMS_Internal_MultiShort file
BZDATETIME::2010-11-18 11:50:35
BZCOMMENTOR::Alan Meyer
BZCOMMENT::57
(In reply to comment #50)
> Alan,
> Here is my reply to your questions:
> 3. 3. 87086: Curr Med Chem Anticancer Agents. 2005
Nov;5(6):603-12.
> > This does not match as it is no longer indexed in PubMed
...
This comment is just about some of the problems reported in the
log files, not all of them. I'll continue researching the
others.
It turns out that, at least for the two examples Minaxi found,
even though the journals are no longer indexed in PubMed, they
continue to appear in the downloadable listing that I used to
update the CiteMS journal table.
I traced the one above (87086) through all of the input and log
files to figure out why there was no match. Here is the answer
for this particular case:
The CiteMS journal table has this entry:
Short: Curr Med Chem Anti-Canc Agents
Full: Current medicinal chemistry. Anti-cancer agents.
The current PubMed journal list has this one:
MedAbbr: Curr Med Chem Anticancer Agents
JournalTitle: Current medicinal chemistry. Anti-cancer agents
The citation in the CiteMS bib table had this:
"Curr Med Chem Anticancer Agents. 2005 Nov;5(6):603-12."
When I loaded the PubMed titles into the CiteMS, I took the
very
conservative position that, if either the short or long title in
PubMed matched an existing entry in the CiteMS, don't do
anything. My thinking was, "First, do no harm." Not fully
understanding everything in the CiteMS, I didn't want to do
anything that would introduce more multiple versions of the data
than were already there.
I performed all of my matches without regard to the final
period,
so there was a match between the above CiteMS full title, and the
PubMed JournalTitle. So I didn't load this PubMed record.
As a result, the CiteMS journal table still had the right long
title to match our citation, but not the right short title.
When I looked up "Curr Med Chem Anticancer Agents" I got no match
in the CiteMS journal table.
We can probably make progress with these kinds of problems if I
ignore all full title matches and just load the PubMed journals
if there is no match on the short title. If, in effect, the
short title in the CiteMS journal table is just being used as an
index to the long title, it may not matter that there are
duplicate long titles.
I've mentioned the possibility of doing this in previous
comments, but I didn't want to do it on my own authority.
Incidentally, it appears that PubMed actually goes back and
revises the citations for earlier articles when the MedAbbr value
changes. At any rate, I found some where this must have happened
and the PubMed citation does not match ours.
I'll make a note of this in issue #4937.
BZDATETIME::2010-11-18 21:33:35
BZCOMMENTOR::Alan Meyer
BZCOMMENT::58
In response to comment #50:
I looked at the examples of short titles with parentheses.
Here's what I found:
Aging (Albany NY)
We have no short title in the CiteMS that matches this,
but we do have a long title "Aging." that matches this in
the PubMed journal list. Because I didn't load any
duplicate long titles (i.e. PubMed journals with long
titles that matched a long title already in CiteMS), I
didn't load this one and no match was made on the PubMed
short title.
Cancer Prev Res (Phila Pa)
There's no exact match on this either in PubMed or
CiteMS. The short titles in the databases are:
CiteMS: Cancer Prev Res (Phila)
Cancer Prev Res
PubMed: Cancer Prev Res (Phila)
J Am Pharm Assoc (2003)
CiteMS: J Am Pharm Assoc Am Pharm Assoc
J Am Pharm Assoc Am Pharm Assoc (Baltim)
J Am Pharm Assoc (Wash DC)
J Am Pharm Assoc (Wash)
J Am Pharm Assoc
PubMed: J Am Pharm Assoc (2003)
PubMed has an exact match but, like with Aging, it was
not loaded because the full title in PubMed was an exact
match for the full title in CiteMS for
J Am Pharm Assoc (Wash DC)
In all three cases, the presence of parentheses is a
coincidence. The problem is either a failure to find an
exact match, or a duplicate match on the full title where
CiteMS has a different short title.
Here are some titles with parentheses where update statements
were generated:
Journal of women''s health (2002)
Forschende Komplementarmedizin (2006)
Hybridoma (2005)
International journal of obesity (2005)
etc.
I am more and more thinking that we should load the PubMed
titles without regard to duplication of full titles. That
would fix two of these three problems. Fixing the third one
("Phila PA" when the exact match was "Phila") would require
some sort of fuzzy match - which would require a great deal
of programming to make plausibly accurate.
The problem with parts also seems not to be part specific, but
with the fact that the short title strings differ between CiteMS
and PubMed for the same journal. For example in the citation:
Birth Defects Res A Clin Mol Teratol 2004 Nov;70(11):889-91.
The short title portion of the citation in PubMed matches the
MedAbbr from the PubMed journal list, but the CiteMS has the
short title differently in its journal file:
Birth Defects Res Part A Clin Mol Teratol
----
The underlined part above is not in PubMed. Since the long
titles match in the two databases, my program would not create a
new entry.
Minaxi's comment that "... Does not match because title has
changed ..." is a key to many of these problems. I'll say more
about the alternative solutions to these problems in a subsequent
comment.
BZDATETIME::2010-11-18 23:41:08
BZCOMMENTOR::Alan Meyer
BZCOMMENT::59
In reply to comment #52 and #58:
I wrote:
> Minaxi's comment that "... Does not match because title
has
> changed ..." is a key to many of these problems. I'll say
more
> about the alternative solutions to these problems in a
> subsequent comment.
In comment #52 Minaxi looked at the duplicate long titles and
discovered, among other things, that
Some of the duplicates are different journals.
Some are the same.
It was because of this that I decided not to update the CiteMS
journal table with anything from PubMed for which the full title
matched a full title in CiteMS.
However we have seen that this decision causes some citations
to
have no association at all with a full title. We could increase
the percentage of citations associated with a title by adding
duplicate long titles to the CiteMS journal table where the short
titles differ.
The title table has multiple uses in the CiteMS. They include:
1. Act as an authoritative journal list.
For reasons we have seen, it's not very authoritative.
Adding duplicates makes it worse in this regard.
2. Map journal titles to citations.
This enable us to find citations by title, for searching, for
reports, for NOT lists, and probably other uses.
Adding duplicates increases our "recall" of citations by
title. Some citations that have no full title will get them,
making them searchable by title.
It will decrease our "precision", causing multiple citations
to be found by a title that really belongs to two or more
separate journals.
3. Map citations with embedded short titles to full titles.
Adding duplicates increases the number of mappings.
For this direction of mapping, precision seems less
important. We will map multiple citations to the same title
string, but that doesn't hurt anything if we're just finding
the correct title (not the correct journal!) for a citation.
I don't know whether it's right to add more duplicates to the
journal table. It has advantages and disadvantages that can't be
easily weighed against each other.
My intuition is that it just doesn't matter. Whatever we do, we
aren't going to solve all the problems by updating the database.
To really solve them requires a redesign of some important parts
of the system, plus a lot of database cleanup beyond the simple
stuff that I have worked on.
I think we should execute the updates that my program
identified
and put the bulk of our time into thinking about a new system.
We've gotten good use out of the existing system for ten years.
We've learned a lot about what works and what doesn't. We're
offered new opportunities now that the original system wasn't
designed for, like electronic communications with board members,
machine readable articles, and direct board member user
interfaces. I think it's time to move to the next generation if
we can convince management to do so.
I'll wait for someone to tell me what to do now.
1. I can apply the updates I identified and move on.
2. I can apply the updates I identified plus updates for all
of
the duplicate full titles I identified.
3. Or I can keep analyzing the specifics of journal title and
NOT list issues and do more research.
If we think we're going to be able to work on a new system,
then
I recommend 1 or 2. I can make 2 work with only a small amount
of additional programming above what I've already done. But I
just don't know whether 1 or 2 is best.
BZDATETIME::2010-11-18 23:57:28
BZCOMMENTOR::Alan Meyer
BZCOMMENT::60
(In reply to comment #56)
> Created attachment 2043 [details]
> partially reviewed CiteMS_Internal_MultiShort file
>
> I Have partially reviewed CiteMS_Internal_MultiShort file. I have
made comments
> in Red and highlighted wrong title IDs in Yellow.
> Here are my comments:
> 1. There are some Duplicate Titles, probably created manually by
mistake.
> Some seem to be deliberately created duplicates as their IDS are
consecutive
> and these must be the ones with period and the other without
period.
> 2. Some Titles have multiple entries due to change in titles.
Merger etc. I
> checked them with the Journals database and highlighted the wrong
ones.
My programs aren't going to be able to solve these problems. It
will require user effort - as you have done, together with new
programs that enable a user to fix the errors in the database
after they have been discovered.
> 3. While trying to find the correct match, because of multiple
retrieval
> for one abbreviation in PubMed, I tried to find the ISSN no. from
the citation
> and then search in Journals database and that seem to work
better.
> I am assuming that this can be done programmatically, so stopped
reviewing
> further.
> Please give me your thoughts.
I believe that we can go back to PubMed programmatically and
re-download citations using the PMID, then fix the data by using
the newly downloaded citations. However, if we do that, I think
it should be done as part of a new system so that the fixes are
part of a design that has controls that link it much more tightly
to PubMed, rather than stopgap fixes to a design that will
drift out of sync with PubMed over time.
There are many things we could do if we match up our data with
PubMed again. For example, we can download complete article
records. We can switch to XML and use the full Unicode character
set that PubMed uses for correct representation of foreign
languages. We can distinguish journals that are different
journals but have the same titles. We can update citations to
the latest forms that NLM uses to cite a particular article,
reflecting changes, for example, when an abbreviated title changed.
But I think we should do all this in a new system.
> 4. After reviewing all three files, I have one question
about
> matching process. Whether the full journal title was matched
> letter by letter without striping non alphaneumaric characters?
I stripped the following before matching:
Leading and trailing whitespace (spaces, tabs, newlines).
Trailing periods.
This stripping was done on both short and full titles.
Nothing whatever was modified in the interior of a short or
full
title string.
Strict letter by letter matching was done on the strings after
the ends were normalized as above.
BZDATETIME::2010-11-19 13:27:20
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::61
Alan,
Minaxi and I have discussed all your comments and detailed analysis. We
have come to the conclusion that we do in fact need to move on with
this. We agree with your suggestion to:
2. I can apply the updates I identified plus updates for all of
the duplicate full titles I identified.
All of your research will definitely be useful when it is time to plan the next generation CiteMS. But for now we do not know when such planning will even begin.
Thanks for all your hard work,
Cynthia & Minaxi
BZDATETIME::2010-11-19 16:24:53
BZCOMMENTOR::Alan Meyer
BZCOMMENT::62
(In reply to comment #61)
> Alan,
> Minaxi and I have discussed all your comments and detailed
analysis. We have
> come to the conclusion that we do in fact need to move on with
this. We agree
> with your suggestion to:
> 2. I can apply the updates I identified plus updates for all
of
> the duplicate full titles I identified.
I'll be in again on Monday and will do it in the test database then.
> All of your research will definitely be useful when it is time
to plan the next
> generation CiteMS. But for now we do not know when such planning
will even
> begin.
>
> Thanks for all your hard work,
>
> Cynthia & Minaxi
You two have had to work pretty hard at this too. Thanks for your prompt reviews of test results.
BZDATETIME::2010-11-22 15:03:06
BZCOMMENTOR::Alan Meyer
BZCOMMENT::63
(In reply to comment #61)
> Alan,
> Minaxi and I have discussed all your comments and detailed
analysis. We have
> come to the conclusion that we do in fact need to move on with
this. We agree
> with your suggestion to:
> 2. I can apply the updates I identified plus updates for all
of
> the duplicate full titles I identified.
I've applied the updates to the test database. I included the 80 citations where two or more short titles matched the same full title.
I think the NOT list problem should be alleviated, but possibly not completely solved.
I'll update the production database when given the go ahead. Before I do it, I'll download the latest journal list from NLM to pick up any updates since my last download on October 5, 2010.
BZDATETIME::2010-11-22 22:07:16
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::64
I tested for NOT journal list for the title "East Afr J Public
Health"
It still does not appear as a NOT journal in the search.
BZDATETIME::2010-11-23 14:42:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::65
(In reply to comment #64)
> I tested for NOT journal list for the title "East Afr J Public
Health"
> It still does not appear as a NOT journal in the search.
It looks like my updates were too conservative. This is one of
those cases where we have two short titles, one with and one
without a period at the end, both pointing to the same full
title, one with and one without a period. I detected that
everything was the same, but because we had two separate rows in
the journal table I decided that the program should leave it
alone.
There aren't very many of these. Here's the complete list of
journals with short titles that end in '.':
Ethnographie.
Atlanta Hist J.
J Egypt Natl Canc Inst.
Anticancer Agents Med Chem.
Dev Disabil Res Rev.
Prog Neurol Surg.
Turk Neurosurg.
Dtsch Arztebl Int.
Dig Endosc.
Photodiagnosis Photodyn Ther.
East Afr J Public Health.
Transfus Apher Sci.
B-ENT.
Of these, nine have duplicates without a final period.
Anticancer Agents Med Chem
Dev Disabil Res Rev
Prog Neurol Surg
Turk Neurosurg
Dtsch Arztebl Int
Dig Endosc
Photodiagnosis Photodyn Ther
East Afr J Public Health
B-ENT
They other four don't which, I think, is why "Transfus Apher
Sci." now works but "East Afr J Public Health" doesn't.
I can modify the program to treat these cases as non-duplicates
and go ahead and perform the updates. I think that will fix East
Afr J Public Health.
I thought about just deleting the duplicates, but that opens a
new can of worms. There are articles in the database that have
a period at the end of the full journal title for the article.
If I delete a row in the journal table that has those titles, I'm
not sure that there aren't any implications for other parts of
the software. There probably aren't, but I don't know any way to
tell without a major research project.
Unless I hear an objection, I'll modify the cleanup program as
described above.
BZDATETIME::2010-11-23 16:05:09
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::66
I have a question about NOT List journals. When you updated the journals in CiteMS test db, the journal list in thec Edit NOT featute should also have been updated. But the NOT list does not show it.
I tried to see the following journals and could not find them:
Egypt J Immunol
Aging Male
Asia Pac J Clin Oncol
B-ENT
BZDATETIME::2010-11-23 16:51:15
BZCOMMENTOR::Alan Meyer
BZCOMMENT::67
(In reply to comment #66)
> I have a question about NOT List journals. When you updated the
journals in
> CiteMS test db, the journal list in thec Edit NOT featute should
also have been
> updated. But the NOT list does not show it.
>
> I tried to see the following journals and could not find
them:
>
> Egypt J Immunol
> Aging Male
> Asia Pac J Clin Oncol
> B-ENT
I may be doing the wrong thing, but when I looked for them, all
four
of those were there on the left side of the screen. Is that what
should have happened?
It's only the full title that appears in the left pane, so
finding
the first three is tricky. A search on "Egypt" or "Aging Male"
will
locate the journals but you find them under "The Egyptian..." and
"The Aging Male...".
A search on "Asia Pac" won't work because the actual title is
"Asia-Pacific..." You need the hyphen to get the journal.
B-ENT was right there however. What was the problem with that one?
Am I missing something?
BZDATETIME::2010-11-23 17:18:57
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::68
(In reply to comment #67)
> (In reply to comment #66)
> > I have a question about NOT List journals. When you updated
the journals in
> > CiteMS test db, the journal list in thec Edit NOT featute
should also have been
> > updated. But the NOT list does not show it.
> >
> > I tried to see the following journals and could not find
them:
> >
> > Egypt J Immunol
> > Aging Male
> > Asia Pac J Clin Oncol
> > B-ENT
> I may be doing the wrong thing, but when I looked for them, all
four
> of those were there on the left side of the screen. Is that
what
> should have happened?
> It's only the full title that appears in the left pane, so
finding
> the first three is tricky. A search on "Egypt" or "Aging Male"
will
> locate the journals but you find them under "The Egyptian..."
and
> "The Aging Male...".
> A search on "Asia Pac" won't work because the actual title is
> "Asia-Pacific..." You need the hyphen to get the journal.
> B-ENT was right there however. What was the problem with that
one?
> Am I missing something?
No I think I am loosing my mind. I did not see Egypt or Aging either due to the "The" in front which has become so habit for me to ignore. Now I see them in the list as well as the other two.
BZDATETIME::2010-11-30 22:31:15
BZCOMMENTOR::Alan Meyer
BZCOMMENT::69
I made all of the planned modifications to the program to populate the journal table and the full_journal fields in the article ("bib") table. I planned to run the program this week, but issue #4961 intervened. I'll have to run it when I come back.
I think what I should do at that time is:
Download the latest PubMed journal title list.
Copy the latest production database to test.
Process them in the test database.
Ask users to test.
If all is okay, we can do it in production.
BZDATETIME::2011-01-10 22:22:28
BZCOMMENTOR::Alan Meyer
BZCOMMENT::70
I have completed all of the steps described in comment #69, to
wit:
Copied the production database as of this morning to test.
Downloaded the latest PubMed journal list J_Medline.txt.
Extracted all titles from that list that were not in the
CiteMS.
Inserted those titles in the CiteMS test database.
Located all of the CiteMS citation records that do not have
full titles.
Matched the citations against the newly updated journal title
file within CiteMS.
Updated all matching citations, i.e., copied the full title
into the citation record where a match could be found, using
rules discussed elsewhere in this Bugzilla record.
Statistics from the operation are:
Number of PubMed journals added to CiteMS: 4,725
Number of citations without titles before update: 13,909
Number of citation records updated: 13,559
Remaining citations without titles after update: 350
Note that some of the updated records were assigned titles that
I
added to the database today, and some were assigned titles that
had already been added by users, but only after some
corresponding citations were imported into the database without
them. In effect, this went out and rounded up all the horses
that got out before we closed the barn door.
This should ameliorate the NOT list problem but, because we
still
have 350 citations without titles, it won't fully resolve it.
Still, we have assigned titles to 97.5% of the citations that
were missing them, and we have made it more likely that new
citations coming in will be matched. So it looks like a big
improvement even if it's not total or permanent victory.
I'd like the users to poke around in the test database and try
to
determine if anything is broken. If it looks good, give the word
and I will do the same things to the production database.
BZDATETIME::2011-01-10 22:26:17
BZCOMMENTOR::Alan Meyer
BZCOMMENT::71
Here is the list of updates that were performed to the citations, in case anyone wants to know exactly what was done.
Attachment CiteMS_Internal_MatchUpdates.sql has been added with description: Complete list of all updated citations, test database, 2011-01-10
BZDATETIME::2011-01-12 16:01:32
BZCOMMENTOR::Cynthia Boggess
BZCOMMENT::72
Minaxi and I are unable to login to the test database.
BZDATETIME::2011-01-12 16:16:27
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::73
(In reply to comment #72)
> Minaxi and I are unable to login to the test database.
It looks like the problem has to do with the refresh in OCECDR-3271. The test database is accepting the password for the live database instead of the password for the test database which is preceded by 'test'.
BZDATETIME::2011-01-12 16:31:03
BZCOMMENTOR::Alan Meyer
BZCOMMENT::74
(In reply to comment #73)
> (In reply to comment #72)
> > Minaxi and I are unable to login to the test database.
>
> It looks like the problem has to do with the refresh in
OCECDR-3271. The test
> database is accepting the password for the live database instead of
the
> password for the test database which is preceded by 'test'.
Oops, sorry.
Exactly right.
I forgot to run the script to prepend "test" to the passwords after the last restore of the test database.
It's fixed now.
BZDATETIME::2011-01-12 16:35:36
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::75
I have added the following titles to Adult NOT list:
Kathmandu Univ Med J (KUMJ)
JACC Cardiovasc Imaging
Jpn J Radiol
Med Res Rev
And tested them for retrival as NOT journals for Adult Board and May 2010 review cycle and it works fine.
BZDATETIME::2011-01-12 16:40:22
BZCOMMENTOR::Minaxi Trivedi
BZCOMMENT::76
Both Cynthia and I are able to log in to test database with test prefix to the password.
BZDATETIME::2011-01-14 14:20:53
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::77
We are happy with the current solution to the problem. Please proceed to update the live database.
BZDATETIME::2011-01-17 16:06:11
BZCOMMENTOR::Alan Meyer
BZCOMMENT::78
(In reply to comment #77)
> We are happy with the current solution to the problem. Please
proceed to update
> the live database.
I'll do this tomorrow, sometime after 5pm so as not to interrupt anyone's work. I think it it's probably safe for me to do even if people are working but, since I don't know the internals of the system very well, I'd rather not take a chance.
I'll post a message when it's done.
BZDATETIME::2011-01-18 21:01:52
BZCOMMENTOR::Alan Meyer
BZCOMMENT::79
I made a backup of the production database and then updated it,
inserting 4,777 titles from PubMed and updating 13,668 citations
to add titles to them.
Here are some statistics from the processing.
------------------------------------------
Title comparisons of PubMed / CiteMS:
Input counts:
Titles read from CiteMS = 18838
Titles read from PubMed = 23563
Output counts:
Full matches = 17616
Match on short title only = 548
Match on long title only = 252
No match on either title = 4777
Multi-matches = 370
------------------------------------------
Results of matching Cites without journals:
INPUTS:
Articles with no full title = 14019
Total journals in database = 23615
CITATION OUTPUTS:
No short title parsed = 0
No matching short title = 351 - CiteMS_Internal_NoMatch.txt
Multiple short title matches = 291 -
CiteMS_Internal_MultiShort.txt
Multiple long title matches = 82 - CiteMS_Internal_MultiLong.txt
SQL updates generated = 13668 - CiteMS_Internal_MatchUpdates.sql
------------------------------------------
The numbers are slightly larger than the last run on the test
database because:
PubMed has added some new journals since then.
CiteMS has added some new citations without journals since
then.
I have attached the file of title updates from PubMed and will
attach the citation updates in the next posting.
Attachment InsertPubMedTitles.sql has been added with description: PubMed titles inserted into the production database
BZDATETIME::2011-01-18 21:04:24
BZCOMMENTOR::Alan Meyer
BZCOMMENT::80
Here are the updates made to the citation ("bib") table in the production database.
Attachment CiteMS_Internal_MatchUpdates.sql has been added with description: Complete list of all updated citations, production database, 2011-01-18
BZDATETIME::2011-01-25 10:16:17
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::81
There have not been any problems since the live database was updated so I am closing this issue. Thanks Alan!
Marking as Resolved
BZDATETIME::2011-01-25 10:16:54
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::82
Issue closed.
Elapsed: 0:00:00.001590