CDR Tickets

Issue Number 3547
Summary URL Check report not working
Created 2012-10-01 14:37:04
Issue Type Improvement
Submitted By Osei-Poku, William (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2012-12-13 11:57:26
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.107875
Description

BZISSUE::5244
BZDATETIME::2012-10-01 14:37:04
BZCREATOR::William Osei-Poku
BZASSIGNEE::Volker Englisch
BZQACONTACT::William Osei-Poku

The URL Check report under CIAT/OCCM Staff > Reports > General Reports is not working as expected. I tested this on Mahler and Bach and in both cases, it did not send out the email it's supposed to send after it has completed running.

Comment entered 2012-10-01 14:57:03 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-10-01 14:57:03
BZCOMMENTOR::Volker Englisch
BZCOMMENT::1

It appears that this report ran the last time successfully in Feb. 2009. Back then there were around 12,000 URLs to be checked. The next time this report was started was in Oct. 2010 with 22,000 URLs to be checked.
It appears that all reports that have been started since 2009 were failing after checking around 13,000 URLs.

Comment entered 2012-10-04 18:09:09 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-10-04 18:09:09
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2

Attached is a sample of the URL Check report (limited to 500 entries).

As discussed at our status meeting please take a look at this and identify what kind of changes you'd like to have implemented for the report and/or the user interface. We had talked about running this report by document type only. Currently, the report runs against all document types.

Comment entered 2012-10-04 18:09:09 by Englisch, Volker (NIH/NCI) [C]

Attachment UrlCheck-4456.html has been added with description: URL Check Report (Mahler)

Comment entered 2012-10-17 12:39:24 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-10-17 12:39:24
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::3

Question:
Could you modify the report to also check for changes in the titles of the pages? We normally copy the title of the page and enter it in URL element in the CDR exactly as they appear on the pages. The URL itself goes into the attribute inspector.

Here are the changes we will like to make:
1. Run the report by document type
2. For summaries and Glossaries, we want to be able to run it by the Audience and Language
3. Instead of displaying the results of the report in the browser, we want it to be emailed to the user.

Comment entered 2012-10-23 09:43:07 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-10-23 09:43:07
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4

As discussed at last week's status meeting the document title - as specified in the Title element of the Header block and displayed in the browser toolbar - could be checked but the title of the text, because it could be displayed within H1, H2, P, B tags or even be an image, cannot be checked automatically.

Comment entered 2012-11-14 19:42:45 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-14 19:42:45
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5

(In reply to comment #3)
> 3. Instead of displaying the results of the report in the browser, we want it
> to be emailed to the user.

Are you saying instead of receiving the email with a link to the report you would like the content of the report included in the email body?

In what type of format would you like the report to be displayed? ASCII, CSV, HTML, other? I'm asking because I'm not sure that a tabular report within an email message will be more convenient that a link to the report.

Comment entered 2012-11-15 11:19:42 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-11-15 11:19:42
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::6

(In reply to comment #5)
> (In reply to comment #3)
> > 3. Instead of displaying the results of the report in the browser, we want it
> > to be emailed to the user.
>
> Are you saying instead of receiving the email with a link to the report you
> would like the content of the report included in the email body?
>

An email with the link to the report should be fine.

Comment entered 2012-11-26 14:55:32 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-26 14:55:32
BZCOMMENTOR::Volker Englisch
BZCOMMENT::7

The report has been modified as requested and is ready to be tested on MAHLER.

Comment entered 2012-11-26 17:41:55 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-11-26 17:41:55
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::8

(In reply to comment #7)
> The report has been modified as requested and is ready to be tested on MAHLER.

I tested this on Mahler and it is working pretty well and it is very fast also. Could you limit the doc types in the drop down menu to only the following?

1. Glossary Term Concept
2. Summary
3. InScopeProtocol
4. CTGovProtocol
5. Drug Information Summary
6. Person
6. Clinical Trials Search String
7. Citation
8. Miscellaneous Documents
9. Organization

However, you can let the All Types run against all document types. The list appears to be too long and there are some we know don't have URLs.

Also, in the user interface, could you make the default Language and Audience selections blank?

Comment entered 2012-11-27 09:37:36 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-27 09:37:36
BZCOMMENTOR::Volker Englisch
BZCOMMENT::9

(In reply to comment #8)
> I tested this on Mahler and it is working pretty well and it is very fast

Unfortunately, it is only this fast because I limited the SQL query to a few hundred documents for testing. I've removed that limitation now.

> However, you can let the All Types run against all document types.

Sorry, but the 'All Types' was a left-over from some pasted code. For this report there is no 'All Types' since we've decided to run this by single document type.

The additional changes are ready for review on MAHLER.

Comment entered 2012-11-27 11:15:22 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-11-27 11:15:22
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::10

(In reply to comment #9)
> (In reply to comment #8)

> The additional changes are ready for review on MAHLER.

The interface 'forces' me to choose the Language and Audience before it will run without an error even though some of the document types don't have Language and Audience attributes. Could you allow the program to run successfully for document types that don't have the attributes without requiring the Language and Audience?

Also, for those that need to use the Lang and Audience attributes, if they are not selected, a CGI error is displayed. Could you rather prompt users to make a selection instead of displaying the CGI script error?

Comment entered 2012-11-27 14:34:08 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-27 14:34:08
BZCOMMENTOR::Volker Englisch
BZCOMMENT::11

The additional changes are ready for review on MAHLER.

Comment entered 2012-11-27 15:03:27 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-11-27 15:03:27
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::12

(In reply to comment #11)
> The additional changes are ready for review on MAHLER.

Thanks! It is working as expected. I will do a few more tests and have it promoted to Bach tomorrow.

Comment entered 2012-11-28 10:59:34 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-11-28 10:59:34
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::13

(In reply to comment #12)
> (In reply to comment #11)
> > The additional changes are ready for review on MAHLER.
>
> Thanks! It is working as expected. I will do a few more tests and have it
> promoted to Bach tomorrow.

Verified on Mahler. Please promote to Bach.

Comment entered 2012-11-28 11:51:03 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-11-28 11:51:03
BZCOMMENTOR::Volker Englisch
BZCOMMENT::14

The following programs have been copied to FRANCK and BACH:
CheckUrls.py - R10842
CdrLongReports.py - R10843

Please verify on BACH and close this bug.

Comment entered 2012-11-29 14:23:43 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-11-29 14:23:43
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::15

As reported in the meeting this afternoon, some of the URLs are coming up as inactive but when copied from the report page and pasted in the address bar, they come up as active. Here are a few of them:

62959 http://m.cancer.gov/topics/treatment/bycancer/rectal/Patient 404: Not Found MobileURL
62960 http://m.cancer.gov/topics/treatment/bycancer/esophageal/Patient 404: Not Found MobileURL
62961 http://m.cancer.gov/topics/treatment/bycancer/cervical/Patient 404: Not Found MobileURL
62962 http://m.cancer.gov/topics/treatment/bycancer/child-brain-stem-glioma/Patient 404: Not Found MobileURL

446198 http://www.cancer.gov/cam/clinicaltrials_intro.html 404: Not Found ExternalRef
446574 http://www.cancer.gov/cam/clinicaltrials_pdq.html 404: Not Found ExternalRef
446574 http://www.cancer.gov/cam/bestcase_intro.html 404: Not Found ExternalRef
446574 http://nccam.nih.gov/research/clinicaltrials 404: Not Found ExternalRef
446574 http://www.cancer.gov/cam/clinicaltrials_intro.html 404: Not Found ExternalRef
446574 http://nccam.nih.gov/health/decisions/consideringcam.htm 404: Not Found ExternalRef
446574 http://nccam.nih.gov/health/decisions/practitioner.htm 404: Not Found ExternalRef

Comment entered 2012-12-03 16:29:27 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-12-03 16:29:27
BZCOMMENTOR::Volker Englisch
BZCOMMENT::16

(In reply to comment #15)
> As reported in the meeting this afternoon, some of the URLs are coming up as
> inactive but when copied from the report page and pasted in the address bar,
> they come up as active.

I have made some minor changes, namely using an updated class of the used httplib Python module. This seemed to resolve the problem for one of the mobile URLs I've used for testing.

Please rerun the report and let me know if this change is solving the problem.

Comment entered 2012-12-05 10:56:25 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-12-05 10:56:25
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::17

(In reply to comment #16)

> Please rerun the report and let me know if this change is solving the problem.

It is a lot better. Nearly all the mobile URLs which live are gone from the report. But there are still some that are being reported as "Bad Request" but they are live links. For example:

http://www.cancer.gov/cancertopics/druginfo/colorectalcancer#dal1 400: Bad Request

Also, I am not sure why the ones designated as "302:Found" are being reported as inactive. For example:

62955 http://www.cancer.gov/cancertopics/factsheet/therapy/sentinel-node-biopsy 302: Found

Comment entered 2012-12-05 11:50:30 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2012-12-05 11:50:30
BZCOMMENTOR::Volker Englisch
BZCOMMENT::18

Please keep in mind that the report doesn't just list pages that are inaccessible (error code 404 - Not found) but lists pages that are not OK (error code 200).

302 Moved Temporarily
The page has been redirected, that's why you think the URL was valid.
The URL you're following is
http://www.cancer.gov/cancertopics/factsheet/therapy/sentinel-node-biopsy
but the page presented is
http://www.cancer.gov/cancertopics/factsheet/detection/sentinel-node-biopsy

400 Bad Request: The request cannot be fulfilled due to bad syntax.
It is possible that another client (a.k.a. browser) does try - and even succeed- to retrieve this page but it doesn't change the fact that there's a problem with the URL. In the case of the URL
http://www.cancer.gov/cancertopics/druginfo/colorectalcancer#dal1
the link target 'dal1' must be listed in the document as a unique ID attribute value but it exists multiple times. Your browser probably picks the first one which may or may not be the correct target.

Comment entered 2012-12-05 17:07:35 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-12-05 17:07:35
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::19

Thanks for the clarification. I will close this bug after we've done a few more reviews.

Comment entered 2012-12-13 11:57:26 by Osei-Poku, William (NIH/NCI) [C]

BZDATETIME::2012-12-13 11:57:26
BZCOMMENTOR::William Osei-Poku
BZCOMMENT::20

Verified on Bach. Thank You! Closing issue.

Attachments
File Name Posted User
UrlCheck-4456.html 2012-10-04 18:09:09 Englisch, Volker (NIH/NCI) [C]

Elapsed: 0:00:00.000593