CDR Tickets

Issue Number 3613
Summary Modify FTP Processing Scripts
Created 2013-07-16 17:09:55
Issue Type Improvement
Submitted By Englisch, Volker (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2013-09-25 18:22:46
Resolution Fixed
Path /home/bkline/backups/jira/ocecdr/issue.110364
Description

In the new CBIIT environment all of the PDQ data processing is done on an NFS share under
/u/ftp/cdr
and I've noticed the processing of the data being extremely slow. So, as a test I ran one of our processing steps manually, once on the NFS share and once on a local directory under /home/cdroperator. The process, untarring a Terminology directory took 1 sec on the local directory but almost 8 minutesâť— on the NFS share.

With small adjustments to the location of some of the processing steps I should be able to significantly decrease the time it takes to process data for the FTP server.

Comment entered 2013-08-20 09:04:42 by Kline, Bob (NIH/NCI) [C]

I'm hoping you meant "... I should be able to significantly decrease the time it takes ...." :-)

Comment entered 2013-09-05 15:52:49 by Englisch, Volker (NIH/NCI) [C]

CTRP had noticed that the Terminology file isn't properly updated and I noticed from the log files that the Terminology update on the FTP server is failing since Aug. 26th.
While going through the logs I've also seen failures on DEV and QA which need to be addressed. However, since testing on the FTP server is extremely slow due to the NFS performance bug we will be better off first changing the FTP processing scripts and then test rather than the other way around.

I've modified the scripts on the CDR server

  • FtpExportData.py

  • FtpOtherData.py

Comment entered 2013-09-06 10:59:18 by Englisch, Volker (NIH/NCI) [C]

Modified and tested one step in the process (Linux server). This step is now running about twice as fast (4 minutes compared to 7.5 minutes):

  • R12008: extractOtherData.py

Comment entered 2013-09-06 17:58:19 by Englisch, Volker (NIH/NCI) [C]

I versioned the changed on the CDR server:

  • R12009: FtpExportData.py

  • R12009: FtpOtherData.py

The data files are now copied to the local directory
/usr/local/cdr/
instead of copying to the NFS share
/u/ftp/cdr

Comment entered 2013-09-06 18:27:49 by Englisch, Volker (NIH/NCI) [C]

The following programs have been updated on the Linux server:

  • R12010: copyData2Pub.py

  • R12010: countXmlFiles.py

  • R12010: createStatFiles.py

  • R12010: createTarFiles.py

  • R12010: extractExportData.py

  • R12010: getSummaryLanguage.py

Comment entered 2013-09-16 13:05:19 by Englisch, Volker (NIH/NCI) [C]

I'm not clear on what has happened but last weekend's licensee data processing finished in less than half the time. The processing which starts at 10am Sundays finished in 3h 18min instead of the previous 7h 17min.
I'm guessing that the changes the CBIIT storage team had put in place may be responsible for this performance increase.

Comment entered 2013-09-18 10:43:54 by Englisch, Volker (NIH/NCI) [C]

Yesterday's test on DEV ran in 1h 40min. It's still a good idea to go forward with this ticket as it's still twice as fast even after whatever changes CBIIT did. I will create the setup documentation while moving things to QA (Linux).

Comment entered 2013-09-25 14:49:33 by Englisch, Volker (NIH/NCI) [C]

I'm a bit stuck with this issue since the system ran out of disk space over night and the Jira ticket I've submitted this morning has not been picked up yet.

Comment entered 2013-09-25 18:22:46 by Englisch, Volker (NIH/NCI) [C]

Changes are complete. Running test job on QA tier.

Comment entered 2013-09-27 14:36:30 by Englisch, Volker (NIH/NCI) [C]

Everything has been setup on QA and the individual steps ran without problem. The final test will be this weekends publishing job.

On the PROD FTP server the following steps will have to be performed:
As cdroperator creating directories (on Linux):

$ mkdir -p /usr/local/cdr/test/pdq/incoming/full
$ mkdir -p /usr/local/cdr/test/pdq/incoming/partial
$ mkdir -p /usr/local/cdr/qa/pdq/incoming/full
$ mkdir -p /usr/local/cdr/qa/pdq/incoming/partial
$ mkdir -p /usr/local/cdr/qa/pdq/full
$ mkdir -p /usr/local/cdr/qa/pdq/partial
$ mkdir -p /usr/local/cdr/test/ncicb/incoming/full
$ mkdir -p /usr/local/cdr/test/ncicb/incoming/partial
$ mkdir -p /usr/local/cdr/qa/ncicb/incoming/full
$ mkdir -p /usr/local/cdr/qa/ncicb/incoming/partial
$ mkdir -p /usr/local/cdr/qa/pdq/full/pubtar

As cdroperator creating files (on Linux):

$ touch /usr/local/cdr/qa/ncicb/incoming/full/sending
$ touch /usr/local/cdr/qa/ncicb/incoming/full/sending
$ touch /usr/local/cdr/qa/pdq/incoming/full/sending
$ touch /usr/local/cdr/qa/pdq/incoming/partial/sending
$ touch /usr/local/cdr/test/pdq/incoming/full/sending
$ touch /usr/local/cdr/test/pdq/incoming/partial/sending
$ touch /usr/local/cdr/test/ncicb/incoming/full/sending
$ touch /usr/local/cdr/test/ncicb/incoming/partial/sending
Comment entered 2013-09-30 19:00:47 by Englisch, Volker (NIH/NCI) [C]

The weekly publishing job ran successfully over the weekend on QA. It ran about 1 hour faster than production (DEV runs about 100 minutes faster than production).

Comment entered 2013-12-03 14:09:47 by Englisch, Volker (NIH/NCI) [C]

Comparing the FTP processing times before and after the release on PROD:
Old: 145 min total processing time
New: 85 min total processing time

The new process is running one hour (around 40%) faster than the old process.

Comment entered 2014-01-23 11:54:22 by Englisch, Volker (NIH/NCI) [C]

The createTarFiles.py program ran into an error when it was implemented due to missing documentation files. The process is currently running successfully but the location of the documentation files needs to be adjusted as these files are located in the pdq/docs directory and should be linked to the pdq/full directory. I will make this fix as part of the issue OCECDR-3689.

Elapsed: 0:00:00.001784