PDQ Issues

Issue Number	2785
Summary	Port of CDR Server to .NET (estimate and plan)
Created	2009-01-28 11:52:54
Issue Type	Improvement
Submitted By	Kline, Bob (NIH/NCI) [C]
Assigned To
Status	Closed
Resolved	2014-03-14 14:12:43
Resolution	Won't Fix
Path	/home/bkline/backups/jira/ocecdr/issue.107113

Description

BZISSUE::4460
BZDATETIME::2009-01-28 11:52:54
BZCREATOR::Bob Kline
BZASSIGNEE::Alan Meyer
BZQACONTACT::Bob Kline

Reza has asked us to come up with a plan/estimate for porting the CDR Server to .NET and C#. I'll feed you my own estimate of how much time it would take me to do a simple port on my own (including no ramp-up time to learn a new language, since I've worked on other .NET projects), which you can fold into a more nuanced analysis which possibly considers other enhancements you feel we should consider implementing as part of this effort.

Volker:

Would you please add "2.0" to the version dropdown, and reassign this issue to that version? Thanks.

Comment entered 2009-01-28 15:20:34 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-01-28 15:20:34
BZCOMMENTOR::Bob Kline
BZCOMMENT::1

Alan:

As promised, here's a first cut at my numbers. I believe I have over-estimated. I expect you will disagree.

Comment entered 2009-01-28 15:20:34 by Kline, Bob (NIH/NCI) [C]

Attachment cdr-server-modules.xls has been added with description: Estimate for conversion to .NET

Comment entered 2009-01-28 15:46:24 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-01-28 15:46:24
BZCOMMENTOR::Bob Kline
BZCOMMENT::2

I had omitted a module.

Comment entered 2009-01-28 15:46:24 by Kline, Bob (NIH/NCI) [C]

Attachment cdr-server-modules.xls has been added with description: Estimate for conversion to .NET

Comment entered 2009-01-29 16:57:05 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-01-29 16:57:05
BZCOMMENTOR::Bob Kline
BZCOMMENT::3

When Alan and I discussed the port he identified the custom validation processing as one of the trickier pieces to convert. I had promised to pick a chunk to implement to give myself a reality check on my estimates. The attached code reproduces the essentials of what we currently do in the CDR server with custom rule validation.

I threw in the code to populate the cdr-eid attributes, which Alan had to do with string manipulation in C++ in order to keep the performance within acceptable limits. In this implementation I was able to add the attributes by manipulating the objects representing the document, and the performance was more than acceptable. I wrapped that step in a loop of 1000 iterations (not retained in the posted code) so I could get an average over a statistically large enough sample of passes, and for the largest document in the system (CDR478976) I was able to populate the attributes in 17,163 elements in an average elapsed time of 14.969 milliseconds (.0000008 seconds per element). For a more representative document (CDR69400) the time to populate the attributes was under a millisecond for the entire document.

Alan:

Do you recall what the numbers were when you were wrestling with the performance for this task?

I did all my XML document building by constructing a tree, rather than by direct serialization, taking another hint Alan threw out on Tuesday. Don't see that this will be either a big deal from a performance point of view, nor for programmer productivity.

Comment entered 2009-01-29 16:57:05 by Kline, Bob (NIH/NCI) [C]

Attachment CustomValidationDemo.cs has been added with description: Proof of concept code for CDR custom rule validation

Comment entered 2009-01-29 22:36:01 by alan

BZDATETIME::2009-01-29 22:36:01
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

I think the code I wrote was in the subsecond range, but I don't
recall the actual numbers. I may be able to find them. However
there's no doubt that 15 ms in the worst case is more than fast
enough.

Comment entered 2009-01-30 11:48:27 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-01-30 11:48:27
BZCOMMENTOR::Bob Kline
BZCOMMENT::5

We need to think carefully about schema support. We have identified the ability to provide the users with full schema validation capability as the benefit of a port to .NET which is the most clearly visible benefit of such an effort, and it's true that we've got a lot of code in the current implementation which we could leave behind and still end up with all the hard work of implementing full XML Schema validation provided out of the box. However, we have to decide how XMetaL determines what can go where in the documents. Off the top of my head here are the approaches we might take:

(1) Install the schemas used in the CDR on the client machine and tell
XMetaL to use them. We'd need to figure out how that's done, including
making sure XMetaL can recursively find imported/included schemas
(this part shouldn't be hard). We'd need to do some testing to
make sure schema support in XMetaL actually works (I expect it does).
This seems to be the most attractive option, as it avoids some (but
not all, since we'll still have custom validation rules and link
validation) disconnect between what XMetaL thinks is valid and
what the CDR thinks is valid, and it avoids the problem of generating
DTDs from the schemas.

(2) Expand our software which currently generates the DTDs we provide
to XMetaL so that it's capable of dealing with anything that might
go into a schema which is allowed by the recommendation. Not a
trivial task, I'm guessing.

(3) Variation of (2), with some restrictions imposed on which features
of XML Schema are allowed in our schemas, in order to make the
task of enhancing the DTD generation software easier. I still
like (1) better.

(4) See if we can find a third-party tool to generate DTDs from
schemas.

Alan:

Please include testing, analysis, decision-making, and implementation for this aspect of the rewrite in what you come up with. My own inclination would be to jump on the task of determining if we can use option (1) above, because if we can, the issue goes away, and we end up with the best of all possible validation worlds (delta schema validation bugs in XMetaL that we decide we'll live with). We still have to make some modifications to the client refresh software and what we put in the documents where we currently reference the DTD. And we'd also have to figure out how to perform the on-the-fly tweaks we currently make to the DTDs to allow Insertion and Deletion elements and cdr-eid attributes, but do it for schemas instead. We might also find that we need to merge included schemas into a single one-per-document-type schema (I'd guess not, but it's possible).

Comment entered 2009-01-30 13:40:56 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-01-30 13:40:56
BZCOMMENTOR::Bob Kline
BZCOMMENT::6

We also need to make sure we can extract valid value lists from schemas which are more complex than we have supported in the past, because some of our admin pages (as well as our XMetaL DLL customizations) depend on this capability.

Comment entered 2009-01-30 14:47:24 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-01-30 14:47:24
BZCOMMENTOR::Bob Kline
BZCOMMENT::7

(In reply to comment #5)

> (1) Install the schemas used in the CDR on the client machine and tell
> XMetaL to use them. We'd need to figure out how that's done, including
> making sure XMetaL can recursively find imported/included schemas
> (this part shouldn't be hard). We'd need to do some testing to
> make sure schema support in XMetaL actually works (I expect it does).
> This seems to be the most attractive option, as it avoids some (but
> not all, since we'll still have custom validation rules and link
> validation) disconnect between what XMetaL thinks is valid and
> what the CDR thinks is valid, and it avoids the problem of generating
> DTDs from the schemas.

I figured out how to get XMetaL to load a schema for a document (wasn't documented anywhere, but it wasn't hard to reverse-engineer what XMetaL itself was doing for a document newly created in the client), and verified that nested inclusion of schemas works. We don't even have to rename the schema files from --~~.xml to -~~-.xsd; both work just fine. The product documentation is vague about how complete support for schemas is in XMetaL [1] but some preliminary testing appears to show that it's at least as good as we get from DTD validation, and we do get at least some of the additional capabilities provided by schemas. I've written to Derek (tech support at Just Systems) and asked him for any further elaboration he can provide here.

> And we'd also have to figure out how to perform the on-the-fly tweaks we
> currently make to the DTDs to allow Insertion and Deletion elements and
> cdr-eid attributes, but do it for schemas instead.

Closer inspection of the programmer documentation appears to indicate that, at least for the version we're using, it's not possible to do that on the fly from within XMetaL for schemas, so we'd need to modify the schema documents we install on the client machines themselves. I'm pretty sure that will be possible, and not too difficult (though I need to look at that note in the footnote below about restrictions on wildcard usage, which makes me a little nervous), but it's something we need to remember to include on the task list. Again, I've passed the question on to Derek, who may tell us that newer versions of XMetaL support what we need to do on the fly with macros, but I'm not holding my breath.

> We might also find that we need to merge included schemas into a single
> one-per-document-type schema (I'd guess not, but it's possible).

Not necessary; see above.

[1] From the help file:

Schemas

XMetaL Author supports schemas for XML documents. Schemas provide restrictions and allow you to edit structured documents without having to rely on a separate DTD. Schemas also expedite the exchange of information with common sets of rules, so that information is more quickly and easily passed between systems, people, client and servers, etc.

There are, however, certain limitations to using schemas with XMetaL Author:

Identity-constraint definitions are ignored.
The <redefine> tag is not supported.
The instance attributes xsi:nil and xsi:type are ignored, and cannot be
edited in Normal or Tags On view.
Checking an XML schema (.xsd file) for errors is limited in XMetaL
Author. It is recommended that you use third-party tools, such as the
ones available from the W3C web-site.

Wildcard limitations

Wildcards are not fully supported in XMetaL:

xsd:anyAttribute is not supported
the xsd:any processContents=skip and the xsd:any processContents=lax
process contents controls are not supported
xsd:any processContents=strict is supported, but the used elements
must be imported using <xsd:import>.

Comment entered 2009-02-02 16:40:37 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-02-02 16:40:37
BZCOMMENTOR::Bob Kline
BZCOMMENT::8

Derek has confirmed that XMetaL (including versions later than the one we're running) does not have a schema equivalent for the ability to modify the DTD validation so that Insertion and Deletion elements are allowed anywhere in the document. So unless (a) we convince the users that they really don't need revision markup (and that Hell is going to freeze over); or (b) we convince the users to accept a significantly restricted use of Insertion and Deletion elements (in effect, they would have to provide us with a finite list of places where these elements would be allowed, without nesting; I would expect this to been even less successful than convincing them to live with restrictions on table complexity), we're going to have to stick with DTDs within XMetaL, using schema validation in the CDR. I've done some experimenting with the .NET schema object model, and I believe we should be able to reproduce the code which generates DTDs for XMetaL. We might need to implement that part to handle the schema features we're currently using and expand the support as we go along to pick up additional complexities as we need them.

Comment entered 2009-02-02 18:24:18 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-02-02 18:24:18
BZCOMMENTOR::Bob Kline
BZCOMMENT::9

Got another message from Derek. He now says that he will put in a feature request to support what's needed for the old-style revision markup in schemas (I say "old-style" because the vendor abandoned the approach they were recommending when we first launched in favor of a processing-instruction based approach). He has no idea what the timetable would be (or, I assume whether it would ever be implemented) so I assume we'll need to continue to generate DTDs for XMetaL in the short term.

Comment entered 2009-02-13 00:48:58 by alan

BZDATETIME::2009-02-13 00:48:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10

(In reply to comment #3)
> Created an attachment (id=1619) [details]
> Proof of concept code for CDR custom rule validation

I have read through the code in the proof of concept. My C#
experience is still limited but I think I was able to follow
it reasonably well.

> When Alan and I discussed the port he identified the custom validation
> processing as one of the trickier pieces to convert. I had promised to pick a
> chunk to implement to give myself a reality check on my estimates. The
> attached code reproduces the essentials of what we currently do in the CDR
> server with custom rule validation.

In some ways I like this implementation better than the C++
implementation. It is more readable, more straightforward,
and I like the use of XML node processing better than the
string processing that we have throughout the C++ program.

> I threw in the code to populate the cdr-eid attributes, which Alan had to do
> with string manipulation in C++ in order to keep the performance within
> acceptable limits. In this implementation I was able to add the attributes by
> manipulating the objects representing the document,

This is far superior to the approach I took. The XSLT code
I wrote was clean and simple, but the performance was
unacceptable. The C++ code was very fast, but not clean
or simple.

> ... and the performance was
> more than acceptable. I wrapped that step in a loop of 1000 iterations (not
> retained in the posted code) so I could get an average over a statistically
> large enough sample of passes, and for the largest document in the system
> (CDR478976) I was able to populate the attributes in 17,163 elements in an
> average elapsed time of 14.969 milliseconds (.0000008 seconds per element).
> For a more representative document (CDR69400) the time to populate the
> attributes was under a millisecond for the entire document.
>
> Alan:
>
> Do you recall what the numbers were when you were wrestling with the
> performance for this task?

I don't recall the numbers. I believe that the C++ version was
in the low millisecond range. It may have been substantially,
but not significantly, faster than the C#. Even if the difference
is substantial, e.g., a factor of 10, it's not significant because
15 milliseconds is only a tiny part of the time spent in validating
a document. Speeding that up cannot make a user perceptible
difference.

> I did all my XML document building by constructing a tree, rather than by
> direct serialization, taking another hint Alan threw out on Tuesday. Don't see
> that this will be either a big deal from a performance point of view, nor for
> programmer productivity.

Excellent. That's clearly the best way to do it.

Comment entered 2009-02-13 01:28:04 by alan

BZDATETIME::2009-02-13 01:28:04
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

(In reply to comment #5)

> We need to think carefully about schema support.
> ...

Here are my current thoughts on this topic. They're not fixed in
stone. Some of what I say here takes into account things that
you posted in comments later than comment #5, but I wanted to
discuss the four options laid out in #5, so I've created a
response to that comment.

> (1) Install the schemas used in the CDR on the client machine and tell
> XMetaL to use them. ...

I can see the reasoning in calling this the most attractive
option, but I'm inclined to vote against it on the following
grounds:

Assuming that we can get the existing schema -> DTD
software working in .NET, it's not required to get XMetal
to use it. The switch to XMetal schema support can be
treated as a separate task later. The only saving in doing
it now is that we don't have to port the translator.

If porting the translator is hard, then my take on this
point is wrong.

We have no current substitute for the Insertion / Deletion
technology that is not supported with XMetal schema.

Full schema validation on the server side is required
anyway and must be implemented whether or not XMetal has
schema validation built in.

There may be editing advantages to using schema on XMetal
(i.e., immediate support for stopping some kinds of errors
right in the editor), but it's not clear how much of that
would occur.

> (2) Expand our software which currently generates the DTDs we provide
> to XMetaL so that it's capable of dealing with anything that might
> go into a schema which is allowed by the recommendation. Not a
> trivial task, I'm guessing.

I'm against that one too on the grounds that it's unnecessary.
We have gotten along fine while staying away from DTD ambiguous
names in the past and I can't think of any actual practical
problem that avoiding such ambiguity poses for us.

> (3) Variation of (2), with some restrictions imposed on which features
> of XML Schema are allowed in our schemas, in order to make the
> task of enhancing the DTD generation software easier. I still
> like (1) better.

I appreciate the advantages of (1), but I lean towards (3).

The biggest issue for me currently is the Insertion / Deletion
problem. We have a lot of software and a lot of user capability
that depends on that. I wouldn't want to commit to abandoning it
as a requirement of moving the CdrServer to .NET. I'd rather
treat that as a separate task that we can work on when the rest
is done, and only if it turns out to have a favorable
cost/benefit ratio.

> (4) See if we can find a third-party tool to generate DTDs from
> schemas.

I think this is worth looking for. A Google search on "schema to
dtd" translation produces a number of interesting hits, for
example: http://norman.walsh.name/2005/01/18/hardcider.
One guy presents his design for a translator and another says he
once did the same thing in XSLT.

However, even if we had a translator that handles more cases, we
don't have a solution for Insertion and Deletion.

>
> Alan:
>
> Please include testing, analysis, decision-making, and
> implementation for this aspect of the rewrite in what you come
> up with. My own inclination would be to jump on the task of
> determining if we can use option (1) above, because if we can,
> the issue goes away, and we end up with the best of all
> possible validation worlds (delta schema validation bugs in
> XMetaL that we decide we'll live with).

If we had no custom schema assertion rules, and if XMetal handles
putting the cursor in the right place when an error occurs, and
we can make Insertion / Deletion work right then Yes, the issue
goes away. But it's a lot of ifs.

> We still have to make some modifications to the client refresh
> software and what we put in the documents where we currently
> reference the DTD.
>
> And we'd also have to figure out how to perform the on-the-fly
> tweaks we currently make to the DTDs to allow Insertion and
> Deletion elements and cdr-eid attributes, but do it for schemas
> instead. We might also find that we need to merge included
> schemas into a single one-per-document-type schema (I'd guess
> not, but it's possible).

Basically, what I'm arguing for here is that we not make the .NET
port depend on changing the way we use XMetal. Porting the
server is a pretty major task. Switching to schema in XMetal is
a pretty major task. I'd like the two tasks to be separate and
not depend on each other.

If Insertion / Deletion worked with schemas, we could convert to
using schemas in XMetal first, before we did the .NET port. Then
we wouldn't have to port the schema conversion software. But if
we can't do that, I'm inclined towards sticking with DTDs for
now.

Comment entered 2009-02-18 05:15:52 by alan

BZDATETIME::2009-02-18 05:15:52
BZCOMMENTOR::Alan Meyer
BZCOMMENT::12

Here are some preliminary thoughts on Bob's time estimates for a
.NET port of the CdrServer.

In order to get a better guage of how hard it will be, I've been
working on porting the CdrDoc module, starting with the QueryTerm
maintenance software.

I didn't finish the query term generator. I'm spending lots and
lots of time reading the MSDN documentation to find the various
System classes, methods, and properties that I need, but i think
I've done enough to feel like I know what I'm doing in C#. I've
written some SQL stuff (following the ADO model that you used
Bob), some XML stuff, some generic collection stuff, and played
with lots of C# constructs.

If I were asked right now whether I'd prefer to write a new
server in C# or C++ I'd pick C# hands down. I'm fairly confident
that, even though I have hundreds of times as many hours writing
C+, I could do it at least twice as fast in C# as in C+. I
always felt that we could have written the CdrServer at least
twice as fast in Java and it looks to me like C# will be even
easier.

I'm inclined to think Bob's CdrDoc estimate is about right.
Assuming the other estimates are in the same ballpark, then I'm
leaning towards agreement with them.

One place however where I would recommend a lot more time is
testing. The spreadsheet shows 120 hours for that. I'd be
inclined to triple that. I'd rather ask for too much time and
finish early than ask for too little and have to go back for
more.

I can also still hear the voice of Mike Rubenstein reverberating
in my head insisting that, "It will take us three years to build
this system", when Bob and I thought it would take six, or maybe
conservatively, nine months.

I also remember Michael Arluk saying about Mike's estimate, "I
can't see how it could take that long, but on the other hand, I
know he's always right."

Comment entered 2009-02-20 00:55:09 by alan

BZDATETIME::2009-02-20 00:55:09
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13

One of the decisions we have to make in a port is how to handle
namespaces.

There are lots of issues, but one I've encountered immediately in
the CdrDoc module is what to put in query_term paths.

I believe there may be hundreds of queries that assume the paths
use prefixes, not fully qualified names. I therefore propose
that we keep the prefixes in the queries and translate parsed
data from the fully qualified paths back to prefixed paths before
inserting entries in the query_term tables. That will enable all
of our existing queries to continue to work and will also make
queries easier to read.

I'm doing this in my code. It's easy to do and centralized at
one spot.

Comment entered 2009-02-20 08:18:43 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-02-20 08:18:43
BZCOMMENTOR::Bob Kline
BZCOMMENT::14

(In reply to comment #13)

> I therefore propose that we keep the prefixes in the queries and
> translate parsed data from the fully qualified paths back to
> prefixed paths before inserting entries in the query_term tables.

Agreed.

Comment entered 2009-02-23 20:16:54 by alan

BZDATETIME::2009-02-23 20:16:54
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15

Here are some more thoughts on the C# port.

One of the things I'd like to do with the port to C# is perform
an upfront review of a number of system wide issues.

The spreadsheet has some excellent ideas about this in providing
for time to determine the order of implementation, and an entire
week for looking at the issues related to DOM and serialization.

I'd like to add some more time for additional upfront analysis
including:

Review of namespace management.

Review of communications, especially switching to port 80 and
consideration of using Microsoft tools for session and
communication management.

Testing, including:
Development of a CdrServer test plan.
Development of a test plan for our existing filters using
the new XSLT engine.
Possibly adding some unit tests, perhaps with NUnit.

I'll try to come up with a third draft of the spreadsheet
tomorrow, reviewing the existing estimates and adding some time
for the above, and some additional time for testing at the end.

Comment entered 2009-02-24 15:55:49 by alan

BZDATETIME::2009-02-24 15:55:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::16

I have reviewed the estimates line by line. I've made very
few changes to the individual lines, but did add some time
for a bit of work at the beginning, and for additional
testing.

Comment entered 2009-02-24 15:55:49 by alan

Attachment cdr-server-modules-Draft3.xls has been added with description: Estimate for conversion to .NET

Comment entered 2009-03-05 15:27:51 by alan

BZDATETIME::2009-03-05 15:27:51
BZCOMMENTOR::Alan Meyer
BZCOMMENT::17

Bob and I reviewed the revisions and agreed on them.

I have sent the estimated level of effort spreadsheet to
Reza.

Comment entered 2009-10-27 22:26:07 by alan

BZDATETIME::2009-10-27 22:26:07
BZCOMMENTOR::Alan Meyer
BZCOMMENT::18

I propose to lower the priority of this task from P4 to P8
and do no more work until we are so instructed.

Comment entered 2009-10-29 21:45:58 by alan

BZDATETIME::2009-10-29 21:45:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::19

Lowering priority until we are directed to resume this task.

Comment entered 2014-03-14 13:48:00 by Englisch, Volker (NIH/NCI) [C]

This task is 5 years old. Can we assume that it will not be resurrected any time soon (if ever)? In that case we should just close it.

Comment entered 2014-03-14 14:01:04 by Englisch, Volker (NIH/NCI) [C]

Looks like nobody paid attention to my question, so I've added you two as a watcher.

Attachments

File Name	Posted	User
cdr-server-modules.xls	2009-01-28 15:46:24
cdr-server-modules.xls	2009-01-28 15:20:34
cdr-server-modules-Draft3.xls	2009-02-24 15:55:49
CustomValidationDemo.cs	2009-01-29 16:57:05

Elapsed: 0:00:00.001320

CDR Tickets