Today, more and more disparate IT systems exchange data via XML. While much work has been invested in designing protocols such as SOAP and REST to post XML data in realtime, few approaches have been published for bulk-importing data from xml-encoded files with arbitrary schemas. Here comes one solution I proposed for a client recently.
Nota bene: This is not about storing whole XML documents in PostgreSQL, or any XML-specific feature of the database. While the code is specific to Postgres, the concepts can be equally implemented for Oracle SQL*Loader or any other table-based bulk-import facility.
I picked Python as the implementation language for the excellent psycopg2 database driver which comes with an API for PostgreSQL's COPY command. It copies table data to and from files in a tab-delimited format, bypassing the SQL engine.
The predominant methods for processing XML is via the SAX and DOM APIs. The Simple API for XML (SAX) offers an event-based interface for reading arbitrarily large structures, whereas the Document Object Model (DOM) present a fully parsed document in a tree structure for random access. Looking at a high volume scenario, I went with SAX.
In Python, SAX is implemented in the xml.sax package. We instantiate a parser, install our own document handler callback, and let the parser parse an XML document:
parser = xml.sax.make_parser()
parser.setContentHandler(docHandler())
parser.parse(file('a.xml'))
Our document handler needs to implement the three functions startElement(), endElement() and characters() which will be called by the parser on the occurrence of an opening tag, closing tag, or character data respectively:
class docHandler(xml.sax.ContentHandler): def startElement(self, name, attributes): ... def endElement(self, name): ... def characters(self, ch): ...
Let's assume the XML data to be imported looks like this:
<book isin="1234-1"> <title>Title</title> <author firstname="First" lastname="Last" /> </book>
Our strategy is to build a stack of domain objects on startElement, hand it XML attributes and characters as well as references to enclosed objects to be set as data members, and save it on endElement():
class docHandler(xml.sax.ContentHandler):
def __init__(self):
self.stack = []
self.ch = ''
def startElement(self, name, attributes):
if name == 'book':
self.stack.append(Book(attrs['isin']))
elif name == 'author':
book = self.stack[-1]
self.stack.append(Author(attrs['first'], attrs['last'], book)
def endElement(self, name):
if name == 'title':
book = self.stack[-1]
book.setTitle(self.ch)
else
obj = self.stack.pop()
if self.ch:
obj.characters(self.ch)
obj.save()
self.ch = ''
def characters(self, ch):
self.ch = self.ch + ch
The domain objects simply collect their member data and save themselves to the copy store:
class Book:
def __init__(self, isin):
self.isin = isin
self.id = store.getId()
def setTitle(self, title):
self.title = title
def setAuthor(self, author):
self.author = author
def save(self):
store.add('books', self.id, self.title, self.author)
class Author:
def __init__(self, first, last, book):
self.first = fist
self.last = last
book.setAuthor(self)
self.id = store.getId()
def save(self):
store.add('authors', self.id, self.first, self.last)
The copy store holds an array for every database table to be filled, organized in a dictionary keyed by table name. It also hands out object ids:
class copyStore:
def __init__(self):
self.tables = {}
self.id = 0 # TODO: get next value from database sequence
def add(self, table, *columns):
if not table in self.tables:
self.tables[table] = []
self.tables[table].append(list(map(str, columns)))
def getId(self):
self.id = self.id + 1
return self.id
Furthermore the copy store can flush each table to the Postgres database via psycopg's copy_from() method. The necessary file-like object is implemented in the class copyFile, avoiding having to read and write real files.
class copyStore:
...
def flush(self):
for table in TABLES:
curs.copy_from(self.getFile(table), table)
#TODO flush sequence value
def getFile(self, table):
return copyFile(self.tables[table])
class copyFile:
def __init__(self, table):
self.table = table
def readline(self, size = -1): #TODO: implement proper size handling
row = self.table.pop(0)
return flatten(row)
def read(self, size = -1): #TODO: implement proper size handling
data = ''.join(map(flatten, self.table))
self.table = []
return data
def flatten(row):
return '\t'.join(row) + '\n'
Relational data models often contain reference data that enumerates possible values of certain data items, such as a list of countries, languages or the like. In order to accomodate for that let's assume we have a language tag in the book XML and we want to add a foreign key database field in the books table that points to the correct row in a languages reference table. We implement a lookupStore that caches the reference table and look up the correct value when saving a book:
class Books:
...
def __init__(..., language):
...
self.language = language
...
def save():
store.add(..., lookup.lookup('languages', self.language), ...)
...
class lookupStore:
def __init__(self):
self.tables = {}
for table in LOOKUPS:
curs.execute('select * from %s' % table)
self.tables[table] = invert(curs.fetchall())
def lookup(self, table, name):
return self.tables[table][name]
def invert(rows):
dict = {}
for row in rows:
dict[row[1]] = row[0]
return dict
The main routine connects to the database, creates a cursor, instantiates the copy store and lookup store, and sets up the XML parser as explained at the beginning:
import xml.sax
import psycopg2
DSN='dbname=postgres user=postgres'
TABLES = ['books', 'authors']
LOOKUPS = ['languages']
...
if __name__ == '__main__':
conn = psycopg2.connect(DSN)
curs = conn.cursor()
lookup = lookupStore()
store = copyStore()
parser = xml.sax.make_parser()
parser.setContentHandler(docHandler())
parser.parse(file('a.xml'))
store.flush()
conn.commit()
Now we can parse the above example XML, extract the data into the different tables, look up reference data and load the table data directly into Postgres. This concludes our little tour around the Python XML and Postgres APIs.
The author acknowledges that the implementation is simplistic in a number of ways. This helps emphasizing the concepts by not hiding them in implementation details, and can be fixed with reasonable effort:
File-like objects have to honour the optional size parameter when returning a chunk of data through read() and readline(). The current implementation always returns all the data, leading to problems when the data exceeds a certain size. This can be fixed by flattening the array into a data string row-by-row and counting in the process in order to return the exact number of requested bytes.
The implementation hands out IDs from a single, volatile counter. This needs to be extended with a non-volatile solution such as a database sequence. Note that you need not implement one for every table: If you choose int8/bigint as the datatype a single sequence gives you over a quintillion (10^18) rows per table in a schema with 1000 tables...
Here we read the whole reference data table into memory and put it in a dictionary keyed by the lookup name. Needless to say this only works for data sets of a couple of megabytes, beyond a certain size we would need to implement a least-recenty-used (LRU) cache.
The stack solution allows for access to other domain objects only within the current branch of the XML document. If this is too limited add a context dictionary to the document handler where arbitrary data such as sibling nodes could be passed on. If context handling becomes to cumbersome revert to the DOM model where full random acces to the whole document is possible at the expense of memory limitations and execution speed.
In this example only a single XML file is read, feel free to modify this to a work queue model featuring 'in' and 'out' directories.
The domain objects only differ in their data members and which reference values ned to be looked up. If anybody has an idea on how to elegantly instantiate this code as would be possible in Ruby and implemented in the Ruby-on-Rails ActiveRecord framework, drop me a line.
mclassen consults on various IT topics such as
in and around Munich, Germany.