CDR Tickets

Issue Number 2676
Summary Create classes for person and organization contact information
Created 2008-10-21 16:17:21
Issue Type Improvement
Submitted By Kline, Bob (NIH/NCI) [C]
Assigned To Englisch, Volker (NIH/NCI) [C]
Status Closed
Resolved 2014-03-14 14:12:26
Resolution Won't Fix
Path /home/bkline/backups/jira/ocecdr/issue.107004
Description

BZISSUE::4339
BZDATETIME::2008-10-21 16:17:21
BZCREATOR::Bob Kline
BZASSIGNEE::Alan Meyer
BZQACONTACT::Bob Kline

We need Python classes which extract contact information from Person and Organization documents using the lxml module. We have existing classes (see the cdrdocobject module) which use XSL/T filters, but these are not as flexible or efficient as we need. If it makes sense to re-implement the existing classes, replacing the filtering scripts with direct parsing, but preserving the existing functionality and interface, then go ahead and do it that way. You might find it preferable to keep the existing classes as they are for compatibility with existing reports that depend on them, but use a different API for the new classes. One nice feature (whichever way you do it) would be to have the constructors take flags (possibly in the form of dictionary parameters) to control which elements are needed in the objects. For example, if all that's needed are the name, email address, and country, it would be much more efficient to skip all of the other address parsing. We also discussed Person and Organization classes which could be used to collect all of the contact blocks for an individual or an org, not just one of them.

Comment entered 2008-11-04 11:08:25 by alan

BZDATETIME::2008-11-04 11:08:25
BZCOMMENTOR::Alan Meyer
BZCOMMENT::1

I'll get to this when higher priority tasks are done.

Comment entered 2008-12-16 23:13:06 by alan

BZDATETIME::2008-12-16 23:13:06
BZCOMMENTOR::Alan Meyer
BZCOMMENT::2

Bob,

I've been thinking about this issue and reading existing code.
I've got some questions:

1. Are there specific performance issues that I should think
about?

Can you give me some specific cases where a speed up is
desirable?

2. Is there a specific program that is an ideal test case for
optimization?

Is there a program that you recommend I run with the old and
new modules to test whether there is a useful speedup and how
large it is?

3. Is there specific functionality you'd like to have that's not
currently present in the existing cdrdocobject module?

4. Is there anything in the existing cdrdocobject interfaces that
seem objectionable?

I'm leaning towards a straight conversion of the cdrdocobject to
use a different underlying method for extracting data from the
Person and Organization documents, rather than a new interface.
Unless the answers to questions 3 and/or 4 indicate that there's
something bad or suboptimal about the existing interface,
re-implementing that interface will give us some advantages,
including:

Existing code can use the new module without change.

Performance comparisons are easy to make and should be very
accurate.

The design and implementation of a new module will be faster
because there is less analysis and fewer design decisions to
make.

On a separate matter, I've done a little research on the net to
learn about lxml performance. I came across a set of benchmarks
of lxml, ElementTree, and cElementTree.

ElementTree and cElementTree are implementations of the Element
Tree interface that are distributed as part of Python and don't
require separate installation. They can be imported as:

xml.etree.ElementTree
or
xml.etree.cElementTree

The benchmark page is at:

http://codespeak.net/lxml/performance.html.

It turns out that, if I understand the benchmarks correctly,
cElementTree is equal to or faster than lxml for parsing, but
slower for serialization - something we don't do a lot of. Here
are some quotes from the document:

"For parsing, on the other hand, the advantage is clearly with
cElementTree. The (c)ET libraries use a very thin layer on
top of the expat parser, which is known to be extremely fast"

...

"While about as fast for smaller documents, the expat parser
allows cET to be up to 2 times faster than lxml on plain
parser performance for large input documents"

...

"For applications that require a high parser throughput of
large files, and that do little to no serialization, cET is
the best choice. Also for iterparse applications that extract
small amounts of data from large XML data sets that do not
fit into the memory. If it comes to round-trip performance,
however, lxml tends to be multiple times faster in total. So,
whenever the input documents are not considerably larger than
the output, lxml is the clear winner."

I wouldn't care much about any of this except that cElementTree
is distributed with Python and requires no additional package
installation. That means that upgrading may be a bit easier, and
it may mean that the cElementTree package will be more
supported - or maybe not.

I don't know how cElementTree and lxml compare on functionality
and compatibility. It looks like lxml is considerably richer.
I'm not seeing any XPath or XSLT functional in cElementTree.
However the functions that they have in common appear to have the
same interfaces. So when all we want to do is parse documents,
we might use cElementTree, without having to learn two separate
interfaces, and with an easy conversion from cElementTree to lxml
if and when a program needs lxml functionality. To simplify
things slightly we could have a simple convention that makes one
of the following imports:

import xml.etree.cElementTree as etree
or
import lxml.etree as etree

Then conversion from one to the other just changes the import
statement - unless of course we think there is a reason to use
both in one module.

Comment entered 2008-12-16 23:27:49 by alan

BZDATETIME::2008-12-16 23:27:49
BZCOMMENTOR::Alan Meyer
BZCOMMENT::3

I see that you (Bob) have already considered some of these
issues in cdr.importEtree().

Comment entered 2008-12-17 09:48:04 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2008-12-17 09:48:04
BZCOMMENTOR::Bob Kline
BZCOMMENT::4

One comparison I'd like to make is between the current approach, which collects everything about a person or organization, and a more surgical approach, which (for example) just gets an email address or a phone number. If we find that the performance advantages are sufficiently compelling, then we would be looking at a modification to the interface (perhaps plugged in behind the existing interface to preserve backward compatibility) to allow the programmer to specify which piece(s) he needs. A good test might be to create 1,000 person contact objects, reporting the CIPS-contact email addresses for each.

1. As currently done (get everything with XSL/T)
2. Getting everything, but with lxml
3. Getting everything, but with cElementTree
4. Extract just the email address with lxml
5. Extract just the email address with cElementTree

I agree that the performance advantage of the more surgical approach would need to be significant to justify the additional API complexity. It would be nice if we find out that lxml or cElementTree gives us overwhelmingly better performance than XSL/T, but that it doesn't make much difference whether we avoid collecting contact information we don't need.

Comment entered 2008-12-18 19:31:32 by alan

BZDATETIME::2008-12-18 19:31:32
BZCOMMENTOR::Alan Meyer
BZCOMMENT::5

I ran a simple test on Mahler that created a CipsContact object
for each of 1,000 records, then asked the object for an email
address and a city. I ran it a few times using the first
thousand Person docs and again with the last thousand Person
docs.

The test ran in 71 or 80 seconds when first run, then 55 seconds
when run again, presumably showing the effects of cacheing of the
documents in the server.

Bob and I discussed this. The savings to be gained from a
re-implementation of the logic using parsing with lxml or
cElementTree might be relatively large, but wouldn't necessarily
result in an overall benefit that justified the cost. Saving one
to 1.5 minutes per thousand records would only be significant in
a batch job if the number of Person or Organization documents was
large. If we ran a nightly job on the full 34,000 Person set, it
would be large. But we don't do that now.

At Bob's suggestion, I will spend some time reviewing filters
that can be invoked in the existing cdrdocobject. There are many
possibilities and it's possible that an existing one would
provide more throughput than the one that is now the default. If
I find such a filter, I will either make it the default or at
least document it in the cdrdocobject code so that any of us who
uses the object will know what filters are fast for what
operations.

The test program is attached.

Comment entered 2008-12-18 19:31:32 by alan

Attachment TimeAddr.py has been added with description: Python program to test performance of cdrdocobject

Comment entered 2008-12-18 19:46:58 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2008-12-18 19:46:58
BZCOMMENTOR::Bob Kline
BZCOMMENT::6

(In reply to comment #5)

> If I find such a filter, I will either make it the default or at
> least document it in the cdrdocobject code so that any of us who
> uses the object will know what filters are fast for what
> operations.

Best not to change the default (unless you find all of the code which invokes the constructor and fix any that would be broken by a change in default).

Comment entered 2008-12-22 20:38:45 by alan

BZDATETIME::2008-12-22 20:38:45
BZCOMMENTOR::Alan Meyer
BZCOMMENT::7

I started my examination of Person address filters by searching
for all filters that had the string "Address" in them. There
were a lot. After looking through them, I decided not to further
examine any filters that:

Were marked "OLD" or "TEST".
Output HTML instead of XML.

Although we he have a lot of filters that process Address
elements, our modularization is still surprisingly good. There
are a number of modules that handle address processing centrally
for different purposes. It would be ideal if there were only
one, but that doesn't appear to be practical. I think we could
get closer to that than we are, but only with a great deal of
work with very little payoff.

After reading all of these filters, I didn't spot any advantage
that any of them would have over the ones we are using in
cdrdocobject.py (CDR0000000072 and CDR0000000084). I don't see
any that would be faster or more general or more useful for going
after a few high use elements - though perhaps by listing them
here Bob and Volker will see one that rings a bell for further
investigation.

CDR0000000052 – Person Address Fragment With Name
Used in mailers

CDR0000000069 – Person Locations Picklist
Extract address info for protocol update person contact info
picklist.
Produces one for each PrivatePracticeLocation, OtherPracticeLocation
and Home if requested.
Format is a list of lines with no internal XML structure.

CDR0000000072 – Person Address Fragment
Produces one single address when given a fragment ID.
This is the one currently used by cdrdocobject.py for Person
addresses.
It is also imported into a half dozen other filters.

CDR0000000084 – Organization Address Fragment
Produces one single address when given a fragment ID.
This is the one currently used by cdrdocobject.py for
Organization addresses.

CDR0000000087 – CIPS Contact Address
Looks like it's supposed to get the name and address
information for the practice location identified as the
CIPSContact.
However it might have bugs.
It looks for a cdr:id matching the CIPSContact in
"//PrivatePractice[@cdr:id = $fragId]".
(I think) It should be looking in:
"//PrivatePracticeLocation[@cdr:id = $fragId]".
Also does not actually bring back location info even when
it finds the right address.
This was last edited in 2002. I suspect it's matched to
older schemas and is, in fact, no longer used.
If so, we should probably mark it as "OLD" in the title.

CDR0000000098 – InScopeProtocol Status and Participant Mailer
Returns name and address info for protocol update people for
a passed lead org for a specific protocol.
Gets the same kind of information as other address filters,
but uses elements in the protocol to override elements
from the Person and Organization records, so the logic is
much more complex. It's not clear whether incorporating a
call to a more general filter would be of significant help
or significantly reduce the complexity. The comments
describing how the logic works may indicate that it would
be dangerous to use a more general filter because the
specific logic could change and it's probably better to be
completely free to do whatever is required.

CDR0000000128 – Vendor Filter: Person
Publishing filter.
Includes address information.
Uses CDR0000315447 "Vendor Address Templates" for some of the work.

CDR0000000169 – Contact Detail Formatter
This an in import module (it should probably be named
"Module:...") used to format Contact information for three
Protocol Review reports.
It outputs lines of data formatted for human readable display
Individual elements are concatenated with labels and/or
other elements to make up the lines.

CDR0000271413 – Approved Not Yet Active Protocols Report Filter
Formats XML for a report, including person info for the
protocol update person.
Has code that is similar to several other filters.

CDR0000315446 – Module: PersonLocations Denormalization
Extracts Person address information for persons found in
protocols.
Most of the logic appears to be protocol rather than Person
specific.

CDR0000315447 – Module: Vendor Address Templates
This is a generic module incorporated in six vendor filters,
for getting address information.
It is a possible candidate for use elsewhere.

CDR0000393218 – InScopeProtocol Status and Participant eMailer
Processes protocol information for emailers.
Uses generic "Module: Emailer Common" for the address
elements.

CDR0000393219 – Module: Emailer Common
Extracts Person address information on behalf of calling
filters that process protocols.
The code looks generic with the protocol specific
information processed by the calling filters and this
module just used for Person/Organization specific info.

CDR0000555446 – Board Member Address Fragment With Name
Calls "Person Address Fragment With Name" to do the address
management.

Comment entered 2008-12-23 21:22:58 by alan

BZDATETIME::2008-12-23 21:22:58
BZCOMMENTOR::Alan Meyer
BZCOMMENT::8

I missed one filter in my last comment that should have been
included:

CDR000000426518 – Person Contact Fragment With Name
This filter runs "Person Address Fragment With Name" and adds
contact information, phone, fax and email.

I've tested a number of the possible candidate filters for Person
contact information for both speed and to show what address
elements they actually retrieve.

To find out whether the filter works with cdrodocobject, and
whether it produces useful information, I ran it using Document
ID 375 - a Person document that has a complete set of information
elements. If there was any useful output, I also ran a speed
test on the same set of 1000 records. I ran tests enough times
to eliminate the effects of caching.

Three of the Person filters I tested produced useful results, as
follows:

Person Address Fragment
46 seconds for 1000 records

Street lines = (u'P.O. Box 647', u'601 Elmwood Avenue')
City = Rochester
State = New York
Zip = 14642
Country = U.S.A.
Addressee = None
Phone = 585-275-5622
Fax = 585-275-1531
Email = louis_constine@urmc.rochester.edu

Person Address Fragment With Name
48 seconds for 1000 records

PersonalName = Louis S. Constine, MD
Street lines = (u'P.O. Box 647', u'601 Elmwood Avenue')
City = Rochester
State = NY
Zip = 14642
Country = U.S.A.
Addressee = Louis S. Constine, MD
Phone = None
Fax = None
Email = None

Person Contact Fragment With Name
53 seconds for 1000 records

PersonalName = Louis S. Constine, MD
Street lines = (u'P.O. Box 647', u'601 Elmwood Avenue')
City = Rochester
State = NY
Zip = 14642
Country = U.S.A.
Addressee = Louis S. Constine, MD
Phone = 585-275-5622
Fax = 585-275-1531
Email = louis_constine@urmc.rochester.edu

I also tested a number of other filters that produce contact
information, but they were of no use in cdrdocobject because they
don't produce the same XML output structure the cdrdocobject
requires. For all of them the output was:

Street lines = ()
City = None
State = None
Zip = None
Country = None
Addressee = None
Phone = None
Fax = None
Email = None

There was no point testing speed.

The filters that were of no use in this context included:

CIPS Contact Address
Vendor Filter: Person
Contact Detail Formatter

"Person Contact Fragment With Name" was the only filter to
retrieve everything.

"Person Address Fragment" was a little faster and produced
everything except the name.

If we want to make some small rationalizations here, I think we
could get rid of all three of these filters and create one new
one that does everything that "Person Contact Fragment With Name"
does, but does it in one filter without having to include "Person
Address Fragment With Name". That would probably shave a couple
of seconds off the 53 seconds.

Or we could just make "Person Contact Fragment With Name" the
universal default for all uses and not worry about a few seconds
per thousand records.

Comment entered 2008-12-29 17:07:43 by alan

BZDATETIME::2008-12-29 17:07:43
BZCOMMENTOR::Alan Meyer
BZCOMMENT::9

Well, after deciding this wasn't worth writing new code for,
Volker got me thinking about it some more. Having per document
type Python classes was sounding attractive to him.

I'm wondering if we could write a generic class for document
objects that would do a lot of useful things, including having a
generic way of parsing a document using
xml.etree.cElementTree.iterparse(), which is apparently the very
fastest way to get data out of an XML document. It both creates
an ElementTree and provides sax like callbacks during the process

  • supposedly much faster even than xml.parsers.expat. For
    performance data see:

http://effbot.org/zone/celementtree.htm

The basic concept is that we could create an abstract base class
that will parse a document and do things with the elements as
specified by a dictionary in the subclass. Each subclass would
add special logic, like resolving links that aren't just simple
denormalizations.

The model that I'm thinking of here is something like a generic
Object Relational Mapping module, only this would be a generic
Object XML Mapping module. The basic stuff that's the same for
all doc objects, like mapping an XML element or attribute into an
object attribute (maybe as a single value, maybe as an array of
values), denormalizing a link (probably only when the content of
the link is actually requested), etc. would be in the abstract
superclass.

The subclass would add specialized logic like finding the
CipsContact address link for a Person, or putting address
elements together into a mailing label.

Will this be worthwhile?

Maybe.

Here are the issues to consider that I think might make it
worthwhile.

1. Performance.

It should definitely be faster than the existing method using
XSLT. It's not clear to me however that the extra speed
matters.

2. Centralization of functions.

I think this is the key thing. If we can implement a lot of
functionality in a base class so that writing the one per
document type subclasses is easy, then it could be worth it.

If that turns out to be too tricky, or if the amount of
functionality we can push into the base class is not
significant, then we're going to have to write a lot of
custom code for each object type. That may not be worth it.

I propose to set this aside for now and have the three of us
discuss it when Bob gets back.

Comment entered 2008-12-29 21:07:06 by alan

BZDATETIME::2008-12-29 21:07:06
BZCOMMENTOR::Alan Meyer
BZCOMMENT::10

I know that I proposed to set this aside, but I couldn't leave it
alone.

I did bunches of tests using cElementTree both to learn the
functionality and to measure the performance.

It appears to be dramatically faster than using XSLT inside the
server - though I did not attempt to do the same thing that my
tests of XSLT did, so I haven't got a direct apples to apples
comparison yet. However I learned enough to believe that there
is at least a good chance that we'll see much faster performance,
possibly in the ballpark of 5-10 times faster, and that if we
proceed with this approach we aren't likely to be disappointed
in the performance.

In particular, iterparse() enables us to perform an event
callback parse and a tree building parse both at the same time,
and still perform very fast. It appears that we can perform
event driven parsing as we go along and, at any point during the
parse access the tree as it is being built! Even without lxml,
which adds a lot of power, it's still very powerful.

I also figured out what I was doing wrong with lxml.xpath(). I
had thought maybe it didn't work right with namespaces but I now
think it does, it was just a misunderstanding in my code. If we
need something more powerful than cElementTree, the xpath
function in particular adds great power to the interface and
is probably intermediate in performance between cElementTree
and XSLT.

I also now have a better understanding of the API, which is very
poorly documented in the Python docs, but not badly documented
elsewhere.

Comment entered 2009-01-06 23:50:20 by alan

BZDATETIME::2009-01-06 23:50:20
BZCOMMENTOR::Alan Meyer
BZCOMMENT::11

I've been thinking about whether there is a good way to build a
generalized replacement for cdrdocobject.py.

Here's a one strawman approach to creating document type specific
objects to get discussion started. It suffers from my usual
tendency to over abstraction, but I thought I ought to try for a
highly general approach and then back down from it if it doesn't
work rather than give up the goal before trying.

Concept
-------

Create a base class (let's call it DocObj for now) for managing
XML documents. Put as much common logic into it as possible.

Create a collection of subclasses of the base class, one for each
specific document type for which we want to do custom Python
processing. Put all the specific data and specific methods in
the subclasses.

Design the base and subclasses in an extensible way to allow us
to add more functionality over time.

Here are some more specific ideas for how this might be done.

Base Class
----------

We might have a base class with the following fields and methods:

Class Fields:

A dictionary of control information, possibly completely
empty. Let's call it "docTypeData". It will be overridden
by the subclasses.

e.g., docTypeData = {}

I'm envisioning this dictionary to have something like the
following structure:

{name : control_object,
name : control_object,
...
}

where:

name = a logical name of some data

control_object = An object (I've called it DocData below)
with various fields that tell how to retrieve the
data corresponding to the name, together with a flag
indicating which field to use.

Possible fields in the DocData object could be:

An xpath to retrieve data.
A reference to a named or unnamed function to
retrieve the data.
An xslt script identified by CDR ID or perhaps
stored directly as a string.

class DocData:
def init(xpath=None, func=None, xslt=None):

  1. Must supply exactly 1 of the parms
    count = 0
    if xpath: count += 1
    if func: count += 1
    if xslt: count += 1
    if count != 1:
    raise ...

  1. Save them
    self.__xpath = xpath
    self.__func = func
    self.__xslt = xslt

Instance Fields:

Common kinds of fields found in all document types:

CDR ID
Version number
Validation status
Active status
Document type
Serial XML representation of the document
ElementTree representation of the document
etc.

Also maybe some procssing related info like a database
connection, an indicator of whether the document is
locked, maybe a session, a list of error msgs, etc.

Methods:

The base class could have something like the following
methods common to all doctypes:

  1. Constructor
    def init(
    self,
    docId=None, # Load from the CDR by this ID
    verNum=None, # Version number or CWD
    xml=None, # Or from this string
    fname=None, # Or from this file
    conn=None)

The constructor creates the object loads a document
into it, and possibly parses it (or sets a flag to
say that a parse is needed - if we want to get clever
and possibly outsmart ourselves - I've done it
before.)

  1. Get something out of the document
    def getValue(
    self,
    name, # Search a dictionary with this
    parms, # Pass this sequence to custom func
    default) # If data not found

getValue() would lookup the name in the docTypeData.
If it found something, it would execute the
appropriate logic to retrieve the data - search via
an xpath expression, evaluate a function, run a
script, etc.

If it got back data, it would return it. Otherwise
it would return the default.

We could have some other common methods depending on how we
envision use of the objects.

Examples might include:

Validate the doc (invoking cdr.valDoc())

Store it (cdr.repDoc()).

Modify it.

etc.

I'd be inclined to start with a read-only implementation and
add store and modify operations only if we need them.

Subclasses
----------

The subclasses would be defined for each document type for which
we want one of these, for example:

class Person(DocObj)

class Organization(DocObj)

class CTGovProtocol(DocObj)

etc.

Each subclass would begin with a class variable consisting of the
control dictionary, the docTypeData. This dictionary never
varies, so it's initialized only once for each document type for
the life of the program, no matter how many documents of the type
are processed.

class Person:

  1. Class control info
    docTypeData = {
    "fullname":
    DocData(func=Person.getFullName)
    "lastname":
    DocData(xpath="/Person/PersonNameInformation/SurName")
    "givenname":
    DocData(xpath="/Person/PersonNameInformation/GivenName")
    "address":
    DocData(func=Person.getAddress)
    etc.

  1. Function invoked by getValue("fullname")
    def getFullName(self, parms=None):
    nameStr = getValue("lastname") + ", " + \
    getValue("givenname") + " " + \
    getValue("midinit", "")
    return nameStr

  1. Function invoked e.g., by

  2. getValue("address", (('addr','CipsContact')))
    def getAddress(self, parms):
    """
    Do a bunch of things to search the object, find the
    address signified in the parms, construct a new
    Organization argument if needed, etc. Then put
    together and return the data.
    """

etc.

Comment entered 2009-01-08 22:43:25 by alan

BZDATETIME::2009-01-08 22:43:25
BZCOMMENTOR::Alan Meyer
BZCOMMENT::12

I wrote a test program to call the same iterparse function in
lxml, ElementTree and cElementTree. I took one callback on each
element, performing a trivial task in each. I counted the nodes
with each method to confirm how many nodes were actually visited.

The results surprised me.

lxml and cElementTree are comparable in speed. lxml was
actually trivially faster on two of the tests. The touted
superiority of cElementTree has apparently been overcome in
whatever release of the two packages we have on Mahler.

Plain old ElementTree is out of the running. It's an order
of magnitude slower than the other two.

So, on my tests at least, there was no penalty for using the more
functional lxml. We should use it.

The program is attached.

RESULTS:

10,000 InScopeProtocol parses

Length of xmldoc=34295
lxml: nodes=662 time=28.812000 sec 0.002881 per parse
ElementTree: nodes=662 time=295.963000 sec 0.029596 per parse
cElementTree: nodes=662 time=21.313000 sec 0.002131 per parse

1,000 InScopeProtocol parses, much bigger protocol
(Ran only 1,000 loops to avoid long running time)

Length of xmldoc=177551
lxml: nodes=2218 time=13.343000 sec 0.013343 per parse
ElementTree: nodes=2218 time=110.811000 sec 0.110811 per parse
cElementTree: nodes=2218 time=13.593000 sec 0.013593 per parse

10,000 Person parses

Length of xmldoc=2781
lxml: nodes=39 time=20.906000 sec 0.000209 per parse
ElementTree: nodes=39 time=217.980000 sec 0.002180 per parse
cElementTree: nodes=39 time=21.172000 sec 0.000212 per parse

Comment entered 2009-01-08 22:43:25 by alan

Attachment speed1.py has been added with description: Performance tester for 3 ElementTree implementations

Comment entered 2009-01-09 00:22:54 by alan

BZDATETIME::2009-01-09 00:22:54
BZCOMMENTOR::Alan Meyer
BZCOMMENT::13

I did some more tests comparing the following approaches:

1. Looking for simple tags using iterparse().

I captured end element events and looked to see if the tag
name for the element is in a set of names.

I did not look at parent paths, so this was oversimplified.

2. Using xpath.

I parsed the same document with fromstring() and used xpath
to find the elements by fully qualified path, for example:

'/Person/AdministrativeInformation/Directory/Date'

3. Using xpath with simple names.

I parsed the same document with fromstring() and used xpath
to find the elements by simplified paths, for example:

'//Date'

I ran this code in two different ways, once to check performance
and the other to prove that I was finding the same elements with
all three techniques.

All three techniques did produce the same expected results, so
all of the techniques appear to be correct and the speed
comparisons appear to be valid.

Here is an output from a search for 5 occurrences of four fields
in a Person record, tested 10,000 times:

Length of xmldoc=2781
Sax like parse: found 5 time=2.094000 sec 0.000209 per parse
fromstring+xpath: found 5 time=3.140000 sec 0.000314 per parse
fromstring+//xpath: found 5 time=3.078000 sec 0.000308 per parse

Not surprisingly, SAX style parsing is faster. However xpath is
certainly reasonable in Person type documents.

I then tried the same test using a large protocol. I didn't
change the paths I was searching for, so there were no hits. In
this one, xpath is faster than the SAX like parse.

Length of xmldoc=177551
Sax like parse: found 0 time=13.234000 sec 0.013234 per parse
fromstring+xpath: found 0 time=10.250000 sec 0.010250 per parse
fromstring+//xpath: found 0 time=10.875000 sec 0.010875 per parse

I thought the cause was that I had no real protocol paths, but
when I repeated the experiment with real paths, xpath was still
faster. I believe that finding a small number of elements in a
very large document is faster with xpath than taking callbacks on
a large number of elements with iterparse.

I got some inconsistent results when I did this which I'll have
to track down. It is probably due to identical names at the
bottom of a path with different intermediate names. But the
timings were consistent. I'll attach code when I've checked it
out more thoroughly.

Comment entered 2009-04-30 13:22:40 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-04-30 13:22:40
BZCOMMENTOR::Bob Kline
BZCOMMENT::14

Bob (or Alan) needs to pick a program to convert.

Comment entered 2009-05-14 23:25:22 by alan

BZDATETIME::2009-05-14 23:25:22
BZCOMMENTOR::Alan Meyer
BZCOMMENT::15

Bob proposed the CTGovExport program as a candidate for
conversion to the new approach to document type classes.

That's what I'll work on when higher priority tasks are
taken care of.

Comment entered 2009-10-08 13:56:38 by Kline, Bob (NIH/NCI) [C]

BZDATETIME::2009-10-08 13:56:38
BZCOMMENTOR::Bob Kline
BZCOMMENT::16

Lowered priority at Alan's suggestion.

Comment entered 2014-03-14 13:53:06 by Englisch, Volker (NIH/NCI) [C]

This ticket is 5 1/2 years old. Is it still relevant or should it be closed?

Attachments
File Name Posted User
speed1.py 2009-01-08 22:43:25
TimeAddr.py 2008-12-18 19:31:32

Elapsed: 0:00:00.000380