Issue Number | 4854 |
---|---|
Summary | Create Sitemap for www.cancer.gov Dictionaries |
Created | 2020-07-22 09:24:19 |
Issue Type | Task |
Submitted By | Kline, Bob (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | On Hold |
Resolved | |
Resolution | |
Path | /home/bkline/backups/jira/ocecdr/issue.270843 |
Given an input CSV file of CDRID, a key representing the glossary name or drug dictionary and language, the CDR needs to create a sitemap file following the https://www.sitemaps.org/protocol.html format. This sitemap needs to be uploaded to an Akamai netstorage location(TBD). The site map should be updated at least weekly.
Info:
Current XML Sitemap on www.cancer.gov -- https://www.cancer.gov/sitemaps/dictionaries.xml
Dictionary keys (i.e. the second column in the CSV):
'term' – this maps to the 'Cancer.gov' glossary.
English URL Pattern - https://www.cancer.gov/publications/dictionaries/cancer-terms/def/<pretty_url>
Spanish URL Pattern - https://www.cancer.gov/espanol/publicaciones/diccionario/def/<pretty_url>
'genetic' – this maps to the 'genetic' glossary. (Note: the glossary may be called 'genetics'... )
English URL Pattern - https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/<pretty_url>
Spanish - Not yet released, unknown path.
'drug' – This maps to the drug dictionary/terminology.
English URL Pattern - https://www.cancer.gov/publications/dictionaries/cancer-drug/def/<pretty_url>
Some Rules -
Pretty URL Generation -
For glossaries, this follows the exact same rules for generating the prettyUrlName for the glossary API.
For Drug Dictionary, this follows the exact same rules for generating the prettyUrlName for the Drug Dictionary API.
What to do if a term does not exist that is in the input file
Log an error (include the CDRID, Dictionary Key and Language) and exclude from the sitemap
What to do if a term does not have a pretty url that is in the input file
Log an error (include the CDRID, Dictionary Key and Language) and exclude from the sitemap?
What to do with the log file?
TBD
Sitemap and log attached for review. Of the 6,757 entries in the CSV
file, fewer than 1% (66) could not be found in the
pub_proc_cg
table. Pretty URLs were able to be constructed
for all of the other entries. Two pairs of the generated URLs were
duplicates (see the end of the log file).
I have decided to use the standard approach of logging to the file system, rather than create a special database table for that purpose.
The scheduled job has been implemented and installed on DEV. I have generated and loaded the sitemap from that tier using data from the production CDR to static-resources-dev.ssh.upload.akamai.com. The shell I get when I ssh into that host is so broken that I can’t really do any analysis of the results in place, so I had the scheduled job leave the temporary XML file in the file system so I could look over the URLs on the CDR server. Five of the URLs fell back on the CDR ID.
https://www.cancer.gov/espanol/publicaciones/diccionario/def/368445
https://www.cancer.gov/espanol/publicaciones/diccionario/def/368447
https://www.cancer.gov/publications/dictionaries/cancer-drug/def/761606
https://www.cancer.gov/publications/dictionaries/cancer-drug/def/782063
https://www.cancer.gov/publications/dictionaries/cancer-drug/def/788964
On the dev Akamai server, the generated sitemap can be retrieved at https://www-dev-acsf.cancer.gov/sitemaps/dictionaries.xml.
This was closed without any explanation. I'm re-opening it until we get definitive instructions on what should happen with the scheduled job, which is running on DEV, but not in production. The most recent previous comment reported that it was running on DEV, so presumably this never got any testing.
File Name | Posted | User |
---|---|---|
sitemap.log | 2020-07-22 18:59:07 | Kline, Bob (NIH/NCI) [C] |
sitemap.xml | 2020-07-22 18:59:08 | Kline, Bob (NIH/NCI) [C] |
Elapsed: 0:00:00.001631