CDR Tickets

Issue Number 3177
Summary Extract Data from PDQ-C Application
Created 2010-06-11 16:12:44
Issue Type Improvement
Submitted By Englisch, Volker (NIH/NCI) [C]
Assigned To Kline, Bob (NIH/NCI) [C]
Status Closed
Resolved 2011-12-22 09:28:54
Resolution Won't Fix
Path /home/bkline/backups/jira/ocecdr/issue.107505
Description

BZISSUE::4864
BZDATETIME::2010-06-11 16:12:44
BZCREATOR::Volker Englisch
BZASSIGNEE::Bob Kline
BZQACONTACT::Volker Englisch

In order to fulfill FOIA requests for data prior to the CDR/Gatekeeper, we have to run the old PDQ-C application - a DOS based program - and extract the documents in question one-by-one.
In the foreseeable future we may not be able to run this application anymore (i.e. the developer's desktops are 64bit machines that only can run the application in a virtual machine running Windows XP) and it had been discussed that we should try to extract the data and store as individual documents that are independent on a particular application.

Comment entered 2010-06-14 10:18:17 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-06-14 10:18:17
BZCOMMENTOR::Bob Kline
BZCOMMENT::1

Have begun work on this task. I've run into a "version incompatibility" error trying to open the database files. Do we have the link library the original pdq/c programmer used?

Comment entered 2010-06-14 10:35:02 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-14 10:35:02
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2

Everything that I could find on the UNIX systems has been copied to the group drive:
L:\OCE_CROSS\FOIA

I'm not sure if this helps, though.

Comment entered 2010-06-23 16:19:44 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-06-23 16:19:44
BZCOMMENTOR::Bob Kline
BZCOMMENT::3

I've made some progress on this task, but I've hit a brick wall. I corresponded with Faircom to see if we could get a replacement for the c-tree 4.3 package used to build PDQ/C. They said it would cost around $1000. So I figured out how to get the data out of the files myself. Unfortunately, what I have is not text, but binary information. I can't find any indication in the original PDQ/C developer's source code that he used compression or encryption, but none of the data files contain more than a tiny bit of random ASCII text. One of the three index files does contain plain ASCII key values, but those values are names of terms and organizations, not the text for protocols or summaries. As far as I know (and this appears to be borne out by the available c-tree documentation, as well as my own memory of working with c-tree in the past) c-tree itself doesn't do any mangling of the values stored in the data files. Unless Alan can find something I'm not seeing, short of contacting the original developer. I'm confident that with enough time I could reverse-engineer the abstruse code and figure out how to get the plain text protocols and summaries, but it would probably take quite a while.

Comment entered 2010-06-23 17:37:40 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-23 17:37:40
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4

(In reply to comment #3)
> I've made some progress on this task, but I've hit a brick wall.

I heard the "thump" when you hit that wall and felt really bad. So I started to look around if I could find some additional information about the system. I've found an old binder with documentation from CTIS from around 2003 (although I believe CTIS just collected the information and it is in fact much older) that has a chapter about the PDQ-C. I don't know if the document has the information you are looking for but I have the feeling it might help a bit either way.

I will put it on your desk so you can do some light reading before I come in tomorrow.

Comment entered 2010-06-23 17:39:04 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-23 17:39:04
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5

Adding Alan as a CC so he can read Comment #3.

Comment entered 2010-06-23 18:37:07 by Englisch, Volker (NIH/NCI) [C]

BZDATETIME::2010-06-23 18:37:07
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6

I found some more data and scripts on UNIXSAG under
/pdqc/pdqc99/can/dbfull

I don't suppose what's there is helping much in terms of getting the data out since it's the directory where we loaded the data but again: more information is better.

Comment entered 2010-06-24 10:40:13 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-06-24 10:40:13
BZCOMMENTOR::Bob Kline
BZCOMMENT::7

(In reply to comment #4)

> I will put it on your desk so you can do some light reading before I come in
> tomorrow.

Thanks. You don't have a version which has the figures (for things like record structures and file relationships), do you? Also, I noticed that this version appears to be somewhat out of date: there's a reference to a pdqdtl.dat file, which according to the manual contains "[b]locks of detail information for specific fields or subfiles," but that file isn't present on any of the CDs. I do see reference to data compression buffers in the document which don't appear to be present in the version of the source code I have, which makes me think that version doesn't match the binaries on the PC CDs.

Comment entered 2010-07-06 12:03:05 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-07-06 12:03:05
BZCOMMENTOR::Bob Kline
BZCOMMENT::8

In order to proceed with the extraction of the PDQ-C data from the c-tree files it would be necessary to purchase replacement libraries from Faircom, for which the vendor has quoted a prices close to $1,000. Other options would include:

  • continue extracting the data by hand for each request, using
    a virtual 32-bit machine for as long as feasible, and if that
    becomes impossible, inform FOIA requesters that the older data
    is no longer available

  • attempt to script PDQ-C to extract the data into logs, and then
    parse those logs into a format suitable for future retrieval,
    such as XML

  • hire someone to run PDQ-C by hand, generating logs for each
    month

I did some preliminary investigation into the second option, and I have come to the tentative conclusion that the chances that it would probably be more expensive than purchasing the Faircom libraries and building the software to extract the data from the c-tree files, largely because of the unpredictability of the number of options for each level of each menu from month to month.

How would you like to proceed?

Comment entered 2010-07-06 12:05:14 by Grama, Lakshmi (NIH/NCI) [E]

BZDATETIME::2010-07-06 12:05:14
BZCOMMENTOR::Lakshmi Grama
BZCOMMENT::9

In terms of cost it is fairly minimal and so I would rather we purchased what is needed. You may need to work with Trisha Laut and Jonathan would need to approve.Would this need to go through Sapient?

Comment entered 2010-07-06 15:22:20 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-07-06 15:22:20
BZCOMMENTOR::Bob Kline
BZCOMMENT::10

(In reply to comment #9)
> In terms of cost it is fairly minimal and so I would rather we purchased what
> is needed. You may need to work with Trisha Laut and Jonathan would need to
> approve.Would this need to go through Sapient?

I spoke with Cheryl, who said that she doesn't have the c-tree libraries. I then spoke with David Campbell, the developer of PDQ/C. He said that there were two copies of the c-tree package, one given to him, and the other to CIPS. Since my inqueries with Cheryl and with John Rehmert haven't found anything, I'm assuming that the copy which CIPS had is gone. David said he'll dig back and see what he has, if anything at this point. If that doesn't pan out I'll start the ball rolling with Ravee for acquiring a replacement set.

Comment entered 2010-07-12 15:26:39 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-07-12 15:26:39
BZCOMMENTOR::Bob Kline
BZCOMMENT::11

P.O. process initiated with Nikki Eiland.

Comment entered 2010-07-16 15:01:06 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-07-16 15:01:06
BZCOMMENTOR::Bob Kline
BZCOMMENT::12

(In reply to comment #11)
> P.O. process initiated with Nikki Eiland.

Just found out that she disappeared (anyone know what happened?), leaving requests to fall on the floor. Resubmitted the request.

Comment entered 2010-08-17 09:05:06 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-08-17 09:05:06
BZCOMMENTOR::Bob Kline
BZCOMMENT::13

The c-tree package has arrived; resumed work on the task.

Comment entered 2010-09-30 14:02:07 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2010-09-30 14:02:07
BZCOMMENTOR::Bob Kline
BZCOMMENT::14

Lowering priority.

Comment entered 2011-12-22 09:28:34 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-12-22 09:28:34
BZCOMMENTOR::Bob Kline
BZCOMMENT::15

Lakshmi said we shouldn't be working on this.

Comment entered 2011-12-22 09:28:54 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2011-12-22 09:28:54
BZCOMMENTOR::Bob Kline
BZCOMMENT::16

Closing issue.

Elapsed: 0:00:00.001310