Issue Number | 3177 |
---|---|
Summary | Extract Data from PDQ-C Application |
Created | 2010-06-11 16:12:44 |
Issue Type | Improvement |
Submitted By | Englisch, Volker (NIH/NCI) [C] |
Assigned To | Kline, Bob (NIH/NCI) [C] |
Status | Closed |
Resolved | 2011-12-22 09:28:54 |
Resolution | Won't Fix |
Path | /home/bkline/backups/jira/ocecdr/issue.107505 |
BZISSUE::4864
BZDATETIME::2010-06-11 16:12:44
BZCREATOR::Volker Englisch
BZASSIGNEE::Bob Kline
BZQACONTACT::Volker Englisch
In order to fulfill FOIA requests for data prior to the
CDR/Gatekeeper, we have to run the old PDQ-C application - a DOS based
program - and extract the documents in question one-by-one.
In the foreseeable future we may not be able to run this application
anymore (i.e. the developer's desktops are 64bit machines that only can
run the application in a virtual machine running Windows XP) and it had
been discussed that we should try to extract the data and store as
individual documents that are independent on a particular
application.
BZDATETIME::2010-06-14 10:18:17
BZCOMMENTOR::Bob Kline
BZCOMMENT::1
Have begun work on this task. I've run into a "version incompatibility" error trying to open the database files. Do we have the link library the original pdq/c programmer used?
BZDATETIME::2010-06-14 10:35:02
BZCOMMENTOR::Volker Englisch
BZCOMMENT::2
Everything that I could find on the UNIX systems has been copied to
the group drive:
L:\OCE_CROSS\FOIA
I'm not sure if this helps, though.
BZDATETIME::2010-06-23 16:19:44
BZCOMMENTOR::Bob Kline
BZCOMMENT::3
I've made some progress on this task, but I've hit a brick wall. I corresponded with Faircom to see if we could get a replacement for the c-tree 4.3 package used to build PDQ/C. They said it would cost around $1000. So I figured out how to get the data out of the files myself. Unfortunately, what I have is not text, but binary information. I can't find any indication in the original PDQ/C developer's source code that he used compression or encryption, but none of the data files contain more than a tiny bit of random ASCII text. One of the three index files does contain plain ASCII key values, but those values are names of terms and organizations, not the text for protocols or summaries. As far as I know (and this appears to be borne out by the available c-tree documentation, as well as my own memory of working with c-tree in the past) c-tree itself doesn't do any mangling of the values stored in the data files. Unless Alan can find something I'm not seeing, short of contacting the original developer. I'm confident that with enough time I could reverse-engineer the abstruse code and figure out how to get the plain text protocols and summaries, but it would probably take quite a while.
BZDATETIME::2010-06-23 17:37:40
BZCOMMENTOR::Volker Englisch
BZCOMMENT::4
(In reply to comment #3)
> I've made some progress on this task, but I've hit a brick
wall.
I heard the "thump" when you hit that wall and felt really bad. So I started to look around if I could find some additional information about the system. I've found an old binder with documentation from CTIS from around 2003 (although I believe CTIS just collected the information and it is in fact much older) that has a chapter about the PDQ-C. I don't know if the document has the information you are looking for but I have the feeling it might help a bit either way.
I will put it on your desk so you can do some light reading before I come in tomorrow.
BZDATETIME::2010-06-23 17:39:04
BZCOMMENTOR::Volker Englisch
BZCOMMENT::5
Adding Alan as a CC so he can read Comment #3.
BZDATETIME::2010-06-23 18:37:07
BZCOMMENTOR::Volker Englisch
BZCOMMENT::6
I found some more data and scripts on UNIXSAG under
/pdqc/pdqc99/can/dbfull
I don't suppose what's there is helping much in terms of getting the data out since it's the directory where we loaded the data but again: more information is better.
BZDATETIME::2010-06-24 10:40:13
BZCOMMENTOR::Bob Kline
BZCOMMENT::7
(In reply to comment #4)
> I will put it on your desk so you can do some light reading
before I come in
> tomorrow.
Thanks. You don't have a version which has the figures (for things like record structures and file relationships), do you? Also, I noticed that this version appears to be somewhat out of date: there's a reference to a pdqdtl.dat file, which according to the manual contains "[b]locks of detail information for specific fields or subfiles," but that file isn't present on any of the CDs. I do see reference to data compression buffers in the document which don't appear to be present in the version of the source code I have, which makes me think that version doesn't match the binaries on the PC CDs.
BZDATETIME::2010-07-06 12:03:05
BZCOMMENTOR::Bob Kline
BZCOMMENT::8
In order to proceed with the extraction of the PDQ-C data from the c-tree files it would be necessary to purchase replacement libraries from Faircom, for which the vendor has quoted a prices close to $1,000. Other options would include:
continue extracting the data by hand for each request,
using
a virtual 32-bit machine for as long as feasible, and if that
becomes impossible, inform FOIA requesters that the older data
is no longer available
attempt to script PDQ-C to extract the data into logs, and
then
parse those logs into a format suitable for future retrieval,
such as XML
hire someone to run PDQ-C by hand, generating logs for each
month
I did some preliminary investigation into the second option, and I have come to the tentative conclusion that the chances that it would probably be more expensive than purchasing the Faircom libraries and building the software to extract the data from the c-tree files, largely because of the unpredictability of the number of options for each level of each menu from month to month.
How would you like to proceed?
BZDATETIME::2010-07-06 12:05:14
BZCOMMENTOR::Lakshmi Grama
BZCOMMENT::9
In terms of cost it is fairly minimal and so I would rather we purchased what is needed. You may need to work with Trisha Laut and Jonathan would need to approve.Would this need to go through Sapient?
BZDATETIME::2010-07-06 15:22:20
BZCOMMENTOR::Bob Kline
BZCOMMENT::10
(In reply to comment #9)
> In terms of cost it is fairly minimal and so I would rather we
purchased what
> is needed. You may need to work with Trisha Laut and Jonathan would
need to
> approve.Would this need to go through Sapient?
I spoke with Cheryl, who said that she doesn't have the c-tree libraries. I then spoke with David Campbell, the developer of PDQ/C. He said that there were two copies of the c-tree package, one given to him, and the other to CIPS. Since my inqueries with Cheryl and with John Rehmert haven't found anything, I'm assuming that the copy which CIPS had is gone. David said he'll dig back and see what he has, if anything at this point. If that doesn't pan out I'll start the ball rolling with Ravee for acquiring a replacement set.
BZDATETIME::2010-07-12 15:26:39
BZCOMMENTOR::Bob Kline
BZCOMMENT::11
P.O. process initiated with Nikki Eiland.
BZDATETIME::2010-07-16 15:01:06
BZCOMMENTOR::Bob Kline
BZCOMMENT::12
(In reply to comment #11)
> P.O. process initiated with Nikki Eiland.
Just found out that she disappeared (anyone know what happened?), leaving requests to fall on the floor. Resubmitted the request.
BZDATETIME::2010-08-17 09:05:06
BZCOMMENTOR::Bob Kline
BZCOMMENT::13
The c-tree package has arrived; resumed work on the task.
BZDATETIME::2010-09-30 14:02:07
BZCOMMENTOR::Bob Kline
BZCOMMENT::14
Lowering priority.
BZDATETIME::2011-12-22 09:28:34
BZCOMMENTOR::Bob Kline
BZCOMMENT::15
Lakshmi said we shouldn't be working on this.
BZDATETIME::2011-12-22 09:28:54
BZCOMMENTOR::Bob Kline
BZCOMMENT::16
Closing issue.
Elapsed: 0:00:00.001310