PDQ Issues

Issue Number	2788
Summary	Port of CDR Server to .NET (SQL Server usage)
Created	2009-01-30 20:18:54
Issue Type	Improvement
Submitted By	alan
Assigned To
Status	Closed
Resolved	2014-11-05 11:08:45
Resolution	Won't Fix
Path	/home/bkline/backups/jira/ocecdr/issue.107116

Description

BZISSUE::4463
BZDATETIME::2009-01-30 20:18:54
BZCREATOR::Alan Meyer
BZASSIGNEE::Alan Meyer
BZQACONTACT::Bob Kline

As part of the process of converting the CdrServer to C# and
.NET (see Issue #4460), we should consider using the XML related
features present in the recent versions of SQL Server.

These features were not in the version of SQL Server available to
us when we designed the CdrServer, and therefore all of the XML
awareness of the CdrServer is in our own application code. There
is none that is managed by SQL Server.

SQL Server 2005 and 2008 have added a number of XML aware
capabilities. We should examine these capabilities, determine
whether they are useful to the CDR, and consider whether to take
advantage of them in a re-implementation of the CdrServer.

Comment entered 2009-02-06 00:37:53 by alan

BZDATETIME::2009-02-06 00:37:53
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

I haven't scratched too deeply beneath the surface of SQL Server
2008 XML capabilities, but what I'm finding so far is that there
don't appear to be big advantages to us in the CDR in using those
capabilities.

A particularly instructive, short document can be found at:

"Understanding XML in SQL Server"

http://msdn.microsoft.com/en-us/library/bb522493.aspx

Here are some conclusions I draw, in part from that document:

1. The stored "xml" data type is a parsed InfoSet, not serial
XML.

Storage and retrieval requires parsing and subsequent
serialization. XML services depend on this InfoSet so,
obviously, malformed data cannot be stored in an xml column.

That may not be a big problem for us, though we currently
will store a document if it is malformed.

2. Schemas can play a significant role in stored xml.

If a schema is declared for an xml column, SQL Server will
use the information in the schema to type the data stored in
the indexes, leading to significantly faster searching for
some data (e.g., integers). This sounds like what we do with
the int_val column in the query_term table, but is not
dumb, simple, and automatic the way our approach is. On the
other hand it does offer more types of optimizations than
just integer indexing.

However, with a schema, the document will be automatically
validated by SQL Server. That doesn't work for us. We must
be able to store and index invalid data. I don't see any way
around that requirement.

3. Query performance is faster on relational columns than on XML
element content.

Microsoft recommends hoisting out high use query data from
the XML to store in relational columns - i.e., what we are
doing with the query_term table.

4. The "xml" data type allows software to update individual XML
elements and attributes using SQL.

This is an interesting capability, but our entire current
design is predicated on a totally different approach. We
always completely replace a document for any update, using
XSLT for almost all changes.

It was our intent that document specific update logic would
be simpler and safer to maintain in XSLT than in C++ (or C#)
code. It has turned out that the amount and complexity of
our XSLT is pretty extensive, but I'm not sure that it
wouldn't be more so in T-SQL code.

The issues here could probably only be resolved after
considerable experience with both approaches.

It appears that the key questions for us at this point are:

1. What is the cost of storing "xml" data types?

a. Additional space for documents and indexes.

b. Time to parse and index the data when we store or update a
document.

c. Time to serialize a document when we retrieve it.

d. Query time executing native XML typed queries as compared
to the SQL query style we currently use.

2. What additional functionality does searching provide?

We currently have the ability to index on absolute paths from
the root, and on paths specified only at the lowest level
(e.g., "//cdr:id"). This actually allows some very complex
queries. I can't think of any query we have been unable to
perform if we add the required paths to our query_term_def
table.

Are there any significant searches we'll be able to perform
that we cannot now perform?

3. What impact is there on software complexity?

I think this divides into two sub-questions:

a. If we were starting a new application from scratch, would
it be simpler, requiring significantly less software
development and/or maintenance, if we stored xml in "xml"
data typed columns?

The answer seems to me that it could be simpler, but I'm
not certain it would be. If we were able to place a
document in an xml column and allow SQL Server to do all
of the parsing, indexing, schema validation, and searching
using the xml query extensions, then the answer is
probably "Yes, that would be simpler than what we have
done".

However if that didn't work, if we were required to
implement our own query_term parsing for performance
reasons, as now, and turn off automatic SQL Server schema
validation, as I think we would have to do, then the
software complexity benefits are less clear. We might
wind up with much of the same software complexity we have
with our current design plus additional complexity in
maintaining some software that uses the SQL Server XML
capabilities.

b. Given that we already have our current design working,
and we spend very little on maintaining it, would there be
software complexity benefits in a re-design?

The answer to that one is clearer, and I think it's "No".
The reasons are as follows:

1) We would need to redesign deep internals of the CDR.
This is not just a C# / .NET port, but a major
refactoring.

2) We would need to do extensive proof-of-concept
implementation and testing before we would know what
native SQL Server xml capabilities would work for us.

It would be very dangerous to implement a significant
redesign without doing this.

3) Development would be expensive.

4) Testing would be expensive.

Testing a redesign would necessarily require much more
effort than testing a port.

I'm not ready by any means to declare this task done.

I'm certainly not ready to say that future applications should
avoid SQL Server built-in XML capability. The answer to my
question 3.a. above could well be "Yes". But I'm currently
leaning against redesigning the CdrServer to use SQL Server
native XML capability.

Comment entered 2009-10-27 22:28:22 by alan

BZDATETIME::2009-10-27 22:28:22
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

If the CDR continues for long in its current form, we might
want to upgrade to the latest SQL Server, perhaps waiting
until our next computer upgrade.

However, with respect to SQL Server 2008 and a port to
.NET, we have suspended work for a long time now and should
probably lower the priority to 8 until we are directed to work
on this again.

Comment entered 2009-10-29 21:46:40 by alan

BZDATETIME::2009-10-29 21:46:40
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

Lowering priority until we are directed to resume this task.

Comment entered 2014-03-14 13:47:28 by Englisch, Volker (NIH/NCI) [C]

This task is 5 years old. Can we assume that it will not be resurrected any time soon (if ever)? In that case we should just close it.

Comment entered 2014-03-14 14:01:55 by Englisch, Volker (NIH/NCI) [C]

Looks like nobody paid attention to my question, so I've added you two as a watcher.

Comment entered 2014-11-05 11:08:45 by Kline, Bob (NIH/NCI) [C]

This task is superseded by Dave Vismer's project to determine the future direction for the CDR.

Comment entered 2014-11-05 11:11:29 by Englisch, Volker (NIH/NCI) [C]

I would check with Reza first if it's OK to close this issue. :-)

Elapsed: 0:00:00.000485

CDR Tickets