CDR Tickets

Issue Number 2441
Summary Schema validation packages
Created 2008-01-08 19:27:43
Issue Type Improvement
Submitted By Kline, Bob (NIH/NCI) [C]
Assigned To alan
Status Closed
Resolved 2013-07-11 19:04:19
Resolution Won't Fix
Path /home/bkline/backups/jira/ocecdr/issue.106769
Description

BZISSUE::3818
BZDATETIME::2008-01-08 19:27:43
BZCREATOR::Bob Kline
BZASSIGNEE::Alan Meyer
BZQACONTACT::Lakshmi Grama

Please take a look at the available options for replacing our home-grown schema validation software with an off-the-shelf solution.

Comment entered 2008-01-11 00:50:54 by alan

BZDATETIME::2008-01-11 00:50:54
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

The first thing I did, before looking at any schema validation
packages, was to take a quick look at alternative methods of
validation.

Some alternatives to schema that have been proposed are:

Schematron.
Trex (now part of Relax NG).
Relax NG (REgular LAnguage description for Xml - New
Generation.)
Examplotron.

I saw no indication that any of these are going anywhere. A
couple of them had some sort of attempted commercial
implementation, but it's not clear if these are actively marketed
or supported or that anyone is using them.

It doesn't look to me like there would be any benefit in pursuing
them. I'll confine my further investigation to schema validation
only.

Comment entered 2008-01-11 00:52:46 by alan

BZDATETIME::2008-01-11 00:52:46
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

The next thing I think I need to do is to develop a list of
requirements and desirable features in a schema validator. We
need to know what we're looking for before we can say that one
package is better than another.

Comment entered 2008-01-18 01:00:17 by alan

BZDATETIME::2008-01-18 01:00:17
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

The most promising off the shelf validator I have found so
far is sold by Intel. It looks superficially similar to the
validator in the Java SDK. It costs $5,000 per server plus
$1,250 per year. The cost doesn't seem justified to me unless
it has significant advantages over our home grown program -
which remains to be seen.

I've downloaded an evaluation copy, good for 30 days, and will
try to work with it.

Bob suggested checking the Apache / Jakarta / Xerces projects
for validators - which I will do.

Comment entered 2008-01-23 00:23:17 by alan

BZDATETIME::2008-01-23 00:23:17
BZCOMMENTOR::Alan Meyer
BZCOMMENT::4

I spent a half hour or so today with Bryan Pizzillo looking at
the schema validatator in C#. It looks flexible and powerful.
It does not appear to be based on the MSXML ActiveX control
distributed with Internet Explorer - which also has schema
validation built in.

It looks like there would be a steep learning and experimentation
curve to climb if we want to invoke the C# validator from our
server written in native C++.

To begin with, it's not easy to get information on how to do
this. I have pursued a number of leads on the net only to
discover that the "unmanaged C++" in the author's example was
still written to generate .NET common language runtime
"intermediate language", not native Windows executable code.

Getting past that, we'll have to master "Interop Pinvoke",
"marshaling" (serializing data for transmission between
programs, there are about 8 ways to do this), management of
object lifetimes, prevention of inappropriate garbage collection,
dealing with multi-threading in our own application with possible
resource pooling, and of course, avoiding the evils of double
thunking. My inner geek is reeling from all of the possibilities
that this presents.

We could take a shot at it, though if we really wanted to do this
we should reconsider rewriting the CDR server in C# - especially
since it would also allow us to eliminate separate packages for
XSLT processing, regular expressions, DOM and SAX parsing, and
maybe some other things. For better or worse, we could put all
our eggs in the Microsoft basket.

Comment entered 2008-02-14 15:17:15 by alan

BZDATETIME::2008-02-14 15:17:15
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5

I looked at the problem of implementing a new schema validator. In
order to find out what was involved, and how good it would be, I thought
I might write a test program that uses a different validator in order to
see what the difficulties and benefits might be. The one I chose was
the open source Xerces-c 2.8 XML parser/validator.

I spent a couple of hours looking at this package in order to understand
it. It turns out that even writing a test program will require some
effort. We'll have to write the program and modify some schemas to
conform to the W3C Schema recommendations that emerged after Bob wrote
our original schema validator, and which handle namespaces and
required/optional characteristics differently. So, while not extremely
difficult, there is some work involved in writing a test program.

I have suspended that effort. See the next comment for more reasons
why.

Comment entered 2008-02-14 15:35:48 by alan

BZDATETIME::2008-02-14 15:35:48
BZCOMMENTOR::Alan Meyer
BZCOMMENT::6

My original idea for program packaging of our schema error reporting was
to produce an error handler object that could be passed in to the schema
validator. I planned to modify our existing schema validation program
to accept such an error handler and, if we switched to a new schema
validator, to re-use it as-is.

Deeper investigation shows that this isn't as good an idea as I
originally thought.

In the first place, our error handling plan will require information to
be communicated from the schema validator to the error handler for which
no provision is made in the default interfaces for off-the-shelf schema
validator software. So it's not clear that the plan we have now will be
that useful later on.

In the second place, the current schema validation software does not
have a validation object into which we can pass an error handler.
Modifying it to contain our validation in such an object requires more
changes than we would like - both in terms of the time involved and in
terms of the reliability risk involved in modifying working code. This
risk would be incurred for no particular benefit except to make our
software more compatible with a package that we might never use.

Finally, it now appears to me that the server side software needed for
better error reporting is not trivial, but it does not require a major
effort. The effort required is small enough that it no longer looks
appealing to add significant extra work to it to make it re-usable. If
we switch to another validator and throw away the new changes we plan to
implement in our current validator, the loss will not be that great.
Furthermore, the concepts embedded in our new software (attaching an
error ID attribute to every element) may not translate well to a new
validator, which may not be compatible with that approach, so the
software may not be re-usable anyway.

For all those reasons I have removed the block that said issue #3637
depended on this one.

Comment entered 2008-04-10 14:17:46 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2008-04-10 14:17:46
BZCOMMENTOR::Bob Kline
BZCOMMENT::7

Lowering priority at Alan's suggestion.

Comment entered 2013-07-11 19:04:19 by Englisch, Volker (NIH/NCI) [C]

As discussed at the weekly status meeting this issue won't be addressed in the foreseeable future.

Elapsed: 0:00:00.000529