Issue Number | 3651 |
---|---|
Summary | External Refs Report |
Created | 2013-08-28 13:24:05 |
Issue Type | New Feature |
Submitted By | Osei-Poku, William (NIH/NCI) [C] |
Assigned To | alan |
Status | Closed |
Resolved | 2013-10-24 11:07:53 |
Resolution | Fixed |
Path | /home/bkline/backups/jira/ocecdr/issue.112536 |
We need a new CDR report to be able to quickly identify external refs
and changes to the page titles of the links. This will affect mostly
summaries and glossaries but we want to be able to use it for other
document types in the future. It will be similar to the URL check report
so let me know if it will be better to modify the URL report to show
only external refs.
Report details:
1. The report will be run by document type, audience and language.
2. The results should display the following info.
-CDR ID
-DOC TITLE
-URL
-CDR TITLE (From new attribute or element)*
-WEB PAGE TITLE (Indicate changes with different font color?)**
-CHANGES? (YES/NO) or just highlight changes in page title (WEB PAGE
TITLE) **
We will need to modify the external ref element to add a new attribute or element called Page Title (I create a new ticket for the schema changes after we’ve discussed this report). We will copy the page title from the Cancer.gov source page into this attribute whenever we add a new external link. This can then be used to check for changes in the page title when the report is run. Then display any title changes in the report display. If an attribute cannot be used for this purpose then a new element is the obvious alternative.
If you can display changes in different font color, there will not be the need for the CHANGES? column.
3. FORMATS of Report: HTML and EXCEL
This request sounds very similar to an earlier request we had discussed (OCECDR-3547) and identified that it will be very difficult to properly identify the title of a web page.
In that issue (OCECDR-3547), we said the reason why it will be difficult to check title changes is that we were not storing the text in the html title tags of the page in the CDR so in this issue, I have proposed capturing that piece of information with the proposed attribute or element. As stated above, we intend to copy the text from the title tags from the page source on Cancer.gov and store it in the new attribute/element so that you can use it to check for changes. If that is also not possible, then I guess a simple ad-hoc query identifying external refs will do.
Displaying the value you enter for the new attribute is easy. Detecting whether what the user considers the title to be on an external web page (remember, we're not talking about the html/head/title element) has the same obstacles identified in OCECDR-3547. It's not possible in the general case to reliably identify the portion of an HTML document which the user will perceive to be the title, as there are countless ways to give a string sufficient relative prominence (including the use of images with embedded text) that the user will mentally associate the string with the role of "title."
What I am proposing is that users will look at the html source page of the link on Cancer.gov, copy the text in the html/head/title element and store it in the new attribute so that the program can use that information to check for and report possible changes. Isn't it possible to match the text in the new attribute with the text in the html/head/title of the link for possible changes?
That's certainly possible, but my recollection of the discussions for OCECDR-3547 was that CIAT had decided that element wasn't a reliable source for the "title."
I've taken over this issue from Volker. He has more on his plate than I
do.
Reading over the issue, the comments, and the earlier issue OCECDR-3547,
I'm not certain we're all on the same page with the requirements.
Here's an example of a case where the title of a document is not
obtainable from the <html><head><title> element:
html title:
"Bladder Cancer Home Page - National Cancer Institute"
Title that appears to the user on the page display:
"Bladder Cancer"
The title that appears to the user comes from a deeply nested inner
element as follows:
<html>
<body>
<div>
<div>
<div>
<div class="document-title-block">
<h1>
Here's another case with a little different title variation:
html title:
"How to Find a Cancer Treatment Trial - National Cancer Institute"
Title that appears to the user:
"How to Find a Cancer Treatment Trial: A 10-Step Guide"
If we store the html title in a CDR document that has an xref url to
these documents, and have the program validate that they match, that
won't tell us if the displayed title string stored in the
document-title-block changed. If we store the display string in the CDR
document, the report program will report an error every time it is run.
I would guess that there are many more cases where the html title does
not exactly match the display title than there are cases where it does.
It might even be true in every single case.
Given that the <html><head><title> element is not displayed to the user
and might not be the best choice of text to put in the cdr:xref, would
it still be helpful to check if the html/head/title element has changed?
Incidentally, this report will be tricky to test in the CBIIT
environment because we can't test anything in DEV or QA that accesses
cancer.gov. For the same reason, the existing URL check report also
won't run on DEV or QA.
It won't be impossible to test this, but it won't be easy.
[I mentioned this to Volker and he said (passing on a comment from
Brett), "Yeah, that sucks."]
If we decide to proceed with this report anyway, I don't think we can
get it done in time for this coming CDR release.
As we decided at the status meeting today, I spoke to Bryan about
this
issue. He wanted to better understand the use of the report. He
thought that, if he knew exactly what CIAT was trying to accomplish,
he
might be able to find a better way to do it, possibly using some
logic
in Percussion and/or Gatekeeper to do things that just can't be done
in
the CDR.
It may be practical, for example, to have Percussion
automatically
generate some kind of notice when something changes that CIAT wants
to
be notified about.
That makes sense to me. I suggest we write up the requirements in
a
more generic way, show them to Bryan and Blair, and then decide
whether
this is best implemented in the CDR, in Percussion, or in some
combination of systems.
That means it won't be ready for this CDR release, but we may get a
more
useful outcome by waiting.
Although this may not be a bad idea since it would cover the majority
of the links we shouldn't forget that the ExternalRef element isn't only
linking to Cancer.gov and we have no influence on those "external
ExternalRefs".
Given William's statement that CIAT would be very selective about adding
the new attribute your proposed approach might not have any effect if
the attribute is only included for non-Cancer.gov links.
We create links to resources that are outside of the CDR, in
Summaries and Glossary terms, by
copying the URLs of the web pages and storing them in the CDR. At the
same time, when creating
the links in the CDR, we also enter the web page display titles in the
CDR. These stored titles
and URLs in the CDR need to be maintained as the pages and titles
change. Sometimes it becomes
necessary to remove links (URLs) from the CDR when they no longer exist
or when they have been comprehensively updated. This report is expected
to help us easily
identify CDR records containing links that need to be removed or updated
because the underlying web pages have been removed or updated. Majority
of the links we create go to non PDQ pages on Cancer.gov. Since we
record and store the web page display titles and URLs in the CDR, we
want to be able to know when the titles have been updated in the target
pages so that we can also update the recorded display
titles in the CDR accordingly.
Currently, there is no way to know if the title of a web page we've
linked to or recorded in the CDR has changed. We either have to be
informed by email or may be lucky to come
across it by chance.
We are hoping to run this report periodically to retrieve a list of
pages whose titles have
changed so we can update the corresponding CDR records.
Clearly, the straightforward approach would be to try to match the
display title on the web page
with the title that is stored in the CDR since they match one to one.
But as we've noticed with
some of the web pages,
it is impossible to extract text from images that are embedded in the
web pages. So, the option
left to us is to try and find a match between the display title and the
title text in the <html><head><title> tag. While
researching
this, we noticed that for most of the web pages, the display title is
included in the <html><head><title>
text which makes it likely that if there is a change in the display
title, it is also likely the
<html><head><title> text will be updated accordingly
to include the updated display title. If this assumption
is true, then reporting the changes in the
<html><head><title> in the selected or target web
pages should be
sufficient. We do recognize that this is not a perfect
solution but it seems to be a better solution than what we are currently
doing.
Thanks for the analysis William.
It looks like we've got two problems we'd like to solve:
1. Are any of the external links (cdr:xref attribute values) broken,
i.e. the linked-to web page does not exist?
2. Have any of the linked to documents changed in such a way that the
content of the element with the xref needs to change? For example,
we might have an element with the content="Radiation therapy" but
that document was split up so that there are now separate documents
for separate types of radiation therapy and so the links need to be
updated and the text of the reference to the document changed.
For the first goal, we don't need to use stored titles at all. We can
just write a program that reads our query_term table to find xrefs,
check each one, and report any that timed out or returned error codes.
That's not a bad idea, though I'd want to check with cancer.gov to see
if they already use a program that does that and we just need them to
send us the reports that pertain to CDR documents. If they're not using
such a program it might be better to write or procure one for them in
order to cover all of the cancer.gov documents, not just the ones
originating in the CDR.
The second goal is harder to achieve. I think the approach suggested by
William won't find all of the problematic links and may find some that
are not problematic, but it might help and I can't think of anything
better to implement purely in the CDR. We'd want to study some sample
documents that we link to to see if other elements like meta tags, or
other heuristics like document size, are worth checking.
For links to articles on cancer.gov stored in Percussion, there might be
something better. From my memory of a conversation with Bryan he
suggested exploring the possibility of storing something in Percussion
that would help out. We could, for example:
a. Store an optional suggested title in Percussion, a change to which
would trigger a report to owners of documents that link to this
document.
b. Alternatively, we could store an optional date element in
Percussion that records the date on which a major change was made
that could affect documents that link to this document. A change
to the date could trigger a report.
These approaches would not help with external references outside
cancer.gov, and they would require cancer.gov work. However they would
benefit other content providers in addition to the CDR.
If we do implement William's suggestion in the CDR, without bothering
cancer.gov, we have two ways to acquire the titles:
a. Have CIAT staff capture the titles of linked documents that the
staff thinks need to be checked and add them to the documents
containing the links.
b. Have a batch job that reads all of our xrefs (which can be done
just using our query_term table), and gather all of the html head
titles from the target links, reporting any that are new or
changed.
The second approach would cover more documents than the first and
require much less effort by CIAT. If we only need to check specific
links, we could do that by adding a generic attribute, e.g.,
checkTitle="Y" to each such link, without requiring CIAT to actually go
and check anything - again saving some labor.
I've added Bryan as a watcher on this issue to get his opinion about it.
Any thoughts?
We have an existing program that currently handles problem 1. It is called the URL check report. Here is the path to report -
CIAT/OCCM Staff/Reports/General Reports/23.URL Check (Batch job - runs ~15 min)
I looked at a number of alternative ways to handle this but finally
decided that the best thing to do is to implement it as William
specified it. If it turns out not to be perfect, we'll change it
afterwards.
It will be a partial modification of the URL Check report. I'll either
use the same user interface and add some radio buttons to enable a check
of URL's or of page titles, or I'll just create a new, similar
interface that looks the same.
One trivial modification I propose to William's requested user interface
is a checkbox that says whether to include everything on the report, or
only include external reference page titles that do not match the title
stored in our documents.
I would also need a pair of radio buttons to specify either html or
Excel. However I propose to implement the html output first and get it
working then add the Excel later if we still want it. I think I can get
it working faster that way, and it might turn out that the mismatch-only
checkbox will make the report small enough in the most useful case that
Excel is superfluous.
The internal logic will have some of the same logic that Bob used to
find web pages corresponding with URLs but with different selection
criteria using a new index (see below for ExRefPageTitle) and different
actions to take when an externally referenced page is found.
I haven't found any evidence that PageTitle has been added to the
schemas, so I propose to add a new optional attribute "ExRefPageTitle"
to the ExternalRef element defined in CdrCommonBase.xml. I like the
"ExRef" prefix because I want to index it with a general rule (see
below) and not have any danger of conflicting with some future attribute
named "PageTitle".
The general indexing rule I would add to the query_term_def table is:
"//@ExRefPageTitle".
That will index every occurrence of this attribute in any document type
where it is added. We'll be able to run the report on new document
types or new elements within a document type without having to change
any indexing definitions.
A user will start the report the same way the way the URL Check report
is started, entering the parameters and submitting the job.
When the report runs it will work (at a high level) as follows:
Select all of the ExRefPageTitle values that meet the user's
selection criteria.
For each one found:
Fetch the corresponding external page.
If it can't be retrieved:
Add an error message to the report output.
Else:
Parse the page and extract the <html><head><title> content.
Normalize the content:
Replace each html tag inside the title with a single
space. This eliminates the problem of super and
subscripts, bold, or other tags inside the title that
cannot be easily stored in an attribute, and are
probably not relevant to what we are trying to do.
Normalize all runs of whitespace characters (spaces,
tabs, newlines, carriage returns) each to a single
whitespace.
Trim away leading and trailing whitespace.
Normalize the content of the ExRefPageTitle using the exact
same algorithm.
Perform a case insensitive comparison of the two strings.
If the two strings differ:
Add a row to the output reporting the difference.
Else:
If a full output is requested, i.e., not just
differences:
Add a row to the output reporting that there is no
change.
I'm not sure how I'll test it yet. The DEV server has no access to
external web pages, even to cancer.gov. What I'll probably do is pick a
document and add or modify some cdr:xref elements to refer to a page in
the CDR itself.
That's my plan. Unless I hear otherwise, I'll start implementation some
time this afternoon.
I've completed a draft of the report and gotten a clean compile.
Tomorrow I plan to make the required schema change on DEV, create some test data, and start testing.
I made a small modification to the user interface for the URL Check report. The same user interface will launch either report, depending on the user selection of a radio button. The generation of the report itself is new code integrated into the CdrLongReports module and conforming to the interfaces and reporting conventions established there.
I have completed initial testing of the new report and it seems to
work, however I have only a trivial amount of totally artificial
data with which to test.
The DEV server is completely isolated from the rest of the world, even
from other computers at NIH. That makes it impossible for me to
construct realistic test data that refers to pages on cancer.gov.
In lieu of that I made some modifications to the Summary document
CDR258032 as follows:
Changed /Summary/SummaryMetaData/SummaryURL/@cdr:xref
from:
http://cancer.gov/cancertopics/pdq/screening/prostate/Patient
to:
https://localhost/CdrAdmin.html
SourceTitle = "Cdr Administration"
Changed /Summary/SummaryMetaData/MobileURL
from
http://m.cancer.gov/topics/testing-screening/bycancer/prostate/patient
to:
https://localhost/CdrAdmin.html
SourceTitle = "CDR Administration is right, but this ain't it"
The two references that would have gone to prostate cancer pages on
cancer.gov now point instead to the CDR Admin page on the DEV server
itself (https://localhost/CdrAdmin.html).
The title of that page is "CDR Administration".
I stored two SourceTitle attributes in the two elements:
One is correct except for having the case the letters changed ("Cdr"
instead of "CDR"). This difference does not cause a mismatch in the
title comparisons.
The other element has additional characters that do cause a mismatch
in title comparisons.
The match outcome is reported using colors. The red title is a mismatch
of the stored title to the actual title found on the web page. If the
appropriate radio button is selected on the user interface, the report
only show Errors and Mismatches, not successful matches.
Here is some more information about the program:
Errors, i.e., inability to connect to a server or to retrieve a web
page in a reasonable time (30 seconds), are reported in the place
where a title would normally be reported. Page retrieval errors are
in dark red. (This is not tested yet - I'll work on that.)
My reporting of retrieval errors is simpler than the reporting Bob
did in the URL Check report. I don't analyze why it didn't work to
the depth that he did. We already have his report for that and it
seemed like an unnecessary duplication to reproduce it here.
The program is designed to work with any of the document types that
can contain an ExternalRef type element, but it has only been tested
so far with this one Summary.
Audience is only checked if the document type is Summary. Language
is only checked if it's Summary or a GlossaryTerm type. If nothing
is selected for those, then all audiences and languages are
included.
The program only looks at the query_term table, not the
query_term_pub table. It is therefore only checking current working
documents not publishable versions of documents. That seemed like
what we would want but, if not, it's easy to change.
I had said in an earlier comment that I would add an
"ExRefPageTitle" attribute to the ExternalRef element type. However
I see Bob has already added "SourceTitle", so that's what I used.
It's possible that I can test this in a more realistic way from my
workstation instead of on DEV or PROD, but it will require some effort
to do that. There are many issues.
When we are back at work, I'll talk to CBIIT about it. If they open a
single hole in the firewall just to go out to cancer.gov it will make
testing much easier.
I don't know if this makes sense but would it be any easier for a "pseudo test" to run this on bastion-2 which has Internet but not DB access?
Volker wrote:
> I don't know if this makes sense but would it be any easier for a "pseudo test" to run this on
> bastion-2 which has Internet but not DB access?
That's an interesting idea, but I think the work involved would turn out to be greater than doing what I did.
I've modelled successful page title comparisons and several different types of failures using DEV alone,
so hopefully everything will work on PROD.
We'll find out.
I've tested what I can think to test and it looks like everything is working.
I've put the modified code into svn.
Testing on DEV is possible. All document types except Summary will report that no documents were found matching the input criteria. Testing with English language Patient Summaries will show the test data that I created.
Implementing the changes in QA and production requires the following steps:
1. Update the CdrCommonbase.xml schema to the latest version.
2. Update CheckUrls.py in the CGI directory to the latest version.
3. Update CdrLongReports.py in the Lib/Python directory to the latest version.
Is QA set up the same way as Dev? I was just wondering why we can't test on QA before putting it into production.
Unfortunately, QA is also unable to see cancer.gov.
When the craziness subsides and we're back at work I'll talk to CBIIT to see if we can open a hole in the firewall for either DEV, QA, or both to at least communicate with cancer.gov.
I'm not very knowledgeable about security issues, but I would think that opening a hole that allows programs on DEV or QA to initiate a connection to cancer.gov and receive responses from it would be reasonably secure. We don't need to enable cancer.gov or any other servers other than the bastion hosts to initiate a connection to DEV or QA.
When the craziness subsides and we're back at work I'll talk to CBIIT to see if we can open a hole in the firewall for either DEV, QA, or both to at least communicate with cancer.gov.
I know that you can connect to www-qa.cancer.gov (or whatever that machine is called). I had them poke a hole in the firewall in order to access the CSS for publish preview.
> ... I had them poke a hole in the firewall in order to access the CSS for publish preview.
That's a good precedent. Maybe they'll do the same for us for this problem.
I don't think so. I wanted them to let me connect to Cancer.gov but only after intense negotiations did they finally agree to let me go to qa.cancer.gov. Maybe you're more persuasive than I am.
I consulted with Wenling Bao at CBIIT and she suggested using the DEV and QA versions of cancer.gov - which looks like an excellent idea.
What I therefore plan to do is modify the program so that if it's running on DEV it will change the URLs
from:
http://cancer.gov/ ....
to:
http://www-blue.dev.cancer.gov/
...
and something similar for QA.
The data in the documents will still point to cancer.gov. It will just be the that the report will modify where the references go when not running in production.
The test won't be 100% perfect. I could make an error in the URL rewrite that causes it to fail on PROD, but I think the required modification should be simple enough that it will be fairly easy to guard against that kind of bug.
I'll post when it's ready.
To make this work I'm going to put two new entries in cdrapphosts.rc:
CBIIT:DEV:CG:www-blue:dev.cancer.gov
CBIIT:QA:CG:www.qa.cancer.gov
I'll probably have to lengthen the timeouts too. The report will take a lot longer to run than in PROD.
Looks like the number of fields is mismatched between the two entries. Is "www-blue" supposed to part of the last field?
Good catch.
Actually I think it may be the other way around, i.e., a colon, not a period, after www in the QA entry. The fields are org:tier:use:name:domain. So I really want:
CBIIT:DEV:CG:www-blue:dev.cancer.gov
CBIIT:QA:CG:www:qa.cancer.gov
"CG" is a new use, not now included in our .rc file.. I'll be using "CG" to stand for "cancer.gov".
The way I plan to implement the changes in the code, I won't need CG entries for PROD and maybe not for stage, if and when we get one of those. I haven't decided yet whether to make the entries anyway.
I think I've got things working on DEV using the new technique. When testing
on DEV or QA, any url pointing to www.cancer.gov or m.cancer.gov will be
transformed on the fly to point to one of the accessible sites:
www-blue.dev.cancer.gov
m-blue.dev.cancer.gov
www.qa.cancer.gov
m.qa.cancer.gov
The report displays the original urls from the url attribute in the
ExternalRef in the CDR document, not the transformed one. For example, if a
url has the value:
http://www.cancer.gov/cancertopics/types/bladder
That's what the report will show, and not the transformed url that was tested,
i.e.:
http://www-blue.dev.cancer.gov/cancertopics/types/bladder
Bear in mind that data on DEV and QA is likely to be both volatile and out of
date, so testing may show errors that will not appear on PROD.
My testing is very limited. I modified two more Summaries to test links to
cancer.gov and mobile.cancer.gov in addition to the other links I had.
I have not modified the old URL Check report. It still won't work on DEV or
QA. However since no changes were made to that I presume no new testing is
required.
The latest notes on deploying the changes in QA and production are:
1. Update the CdrCommonbase.xml schema to the latest version.
2. Update CheckUrls.py in the CGI directory to the latest version.
3. Update CdrLongReports.py in the Lib/Python directory to the latest version.
4. Update cdr.py (in order to get the new MutateCGUrl class.)
5. For QA only, not needed in production:
Update d:\etc\cdrapphosts.rc to include the entries for CG and
CGMOBILE.
I ran a the report a couple of times on DEV and they all ran successfully but there were no output to review, which I believe should be expected. We may have to wait a few weeks after the changes are put in production to see anything show up the report with regards to the new changes.
All of the test data I created was with the following parameters
Doc Type: Summary
Language: English
Audience: Patients
If you try that one you should see 7 lines of output from 4 CDR documents for all titles.
I can see the test data now. It looks good to me.
Marked it as Resolved.
Verified on DEV.
I did some more testing of this and found that I was looking at the wrong field (language) instead of the right field (UseWith) for finding GlossaryTermConcept titles. I fixed that.
I also made a trivial change to the user interface. It had validation code that required a user to choose audience and language if the document type is Summary or GlossaryTermConcept. I removed that validation if the user is trying to run our new report. Audience is irrelevant to the new report and I am allowing a user to pick a language or not as he or she pleases. If no language is selected I'll include everything in the output - both English and Spanish.
This one is proving difficult to test. We need to wait for changes on Cancer.gov before we will be able to see the changes in the report and since we don't know when the changes on Cancer.gov will happen, I tried to test by entering the source title in the document and then modifying it to see if the mismatch will show but it didn't work when I tried it. Do you have any ideas about how to test this? Or we just have to wait until there is an update ?
Let's talk about this at the status meeting today. I think what you did should work but I may be misunderstanding your explanation - or maybe I understand it fine but there is a bug in the code.
The problem was caused by my forgetting to add a new index term to the query term definitions on Prod for the SourceTitle attribute. The query that selects documents searches the query_term table for docs that meet the doctype and other criteria, and have a SourceTitle attribute. But there were none because I hadn't specified that attribute for indexing.
I did that and re-indexed Summaries and it now appears to work. It should also work for documents of all other types in which we create SourceTitle attribute because the query term definition applies to all doctypes, not just summaries. I only re-indexed Summaries on the assumption that no other documents have had SourceTitles inserted.
Yes. It worked for summaries but not GTCs. I actually added the source title to at least one of the GTCs so please re-index when you get the chance.
Alan did mention that he only re-indexed the summaries, so unless you had added the title to the GTC document today you wouldn't have gotten a result for any other document type.
I'm in the process of running the re-index for GTCs this morning. Please double-check that GTCs are working in a couple of minutes.
Sure I will check later to see if it works.
worked for GTCs. Thanks!
Sorry, I forgot that you said you had also modified some Glossary docs.
Elapsed: 0:00:00.001669