Ten years is a long time in IT. But in the world of database systems, a ten year old technology is barely out of kindergarten.
Relational databases took a decade to move from the labs to commercialisation, and another decade before it unseated hierarchical databases as the most prevalent data management systems.
XML celebrated its tenth birthday in 2006, but database environments designed to hold XML data are still only just establishing their credibility. But even around its birth, a small number of prescient software engineers and entrepreneurs saw that the set of definitions and rules for adding structure and metadata to text documents had even more potential.
They noted that XML had a lot of the attributes of an entire database and programming environment: a basic storage mechanism (a document); low level and customisable schemas for defining the meaning of the tags inside the documents; several query languages for extracting data from collections of tags and documents; and even some programming interfaces.
Certainly, XML did not initially have all the sophisticated functions of a commercial database management system. There were no agreed methods for how collections of documents should be organised, there was no proper indexing, security, transaction management, data integrity, multi-user access, triggers or a host of other functions. But these could be added later: almost from the start, it was clear that XML documents could be gathered together, and managed and queried like a database.
There were, and still are, at least three good reasons for storing XML documents in native XML databases, rather than in the relational databases. First, XML, even in its short life, has become a de facto means of describing the meaning of data in web pages, and as a simple, and flexible means of moving data and documents between applications. Why not store these in native XML, rather than translating them for holding relational databases?
“There is a large and growing amount of XML data in the world,” says Bernie Spang, director of IBM data servers. “We need to secure it, to be able to query and access it, to back it up and so on – just like all other data in the world”. Relational databases are not the ideal way of storing this data, because it needs to be translated from XML before going in and out of the database, adding a big performance overhead.
A second critical advantage that XML databases have over relational databases is that XML schemas can be changed or extended without recasting the entire organisation of the rows and fields in a database. This means that, for example, if a tax or a mortgage form changes from one year to the next, the schema can be updated in a few minutes. With relational databases, schema changes like this require planning and can take days or weeks.
How XML is stored
Pure XML database (‘binary XML’)
This is a “native” store of documents, managed by specially developed XML tools and usually organised in a hierarchical fashion. This approach is fast and offers great flexibility, but database administrators have proved wary of it because of the immaturity of the products; the lack of sophistication and resilience; the need to manage two entirely different technologies; and possible data integrity issues.
CLOB (character large object)
The XML document is stored as one large binary object inside a relational cell. Retrieval is fast and integrity is good. The problem is that the document is stored as a whole – there is almost no re-use of tags and schema, and there is no ability to query the metadata that has been so carefully planned in. That can only be done by using external tools – a much slower and more expensive process. Effectively, the XML is buried.
Relational table
The XML document is ‘decomposed’ or parsed, and the data stored in rows and tables – a technique known as ‘shredding’. This enables re-use of schema and querying of data, but adds a big performance overhead. Shredding can also destroy ‘fidelity’ – the exact document is not re-created, but only the data. Relationships between the data may be lost, unless further tables are created, creating additional complexity.
Hybrid databases
Hybrid databases are seen as the future of XML data storage. Effectively, there are two databases underneath the hood – one storing binary XML and one storing tables and rows that are queried by SQL. This provides the flexibility that XML brings, but also provides enterprise class database management. Microsoft, IBM, Oracle and Sybase all now support hybrid databases, although only IBM has tightly integrated the data types (in DB2 9) so that the SQL and XML queries can be made across both types, with the results appearing seamlessly to the applications.
And there is a third advantage: hundreds of thousands of web specialists and other developers have learned how to use XML, but have no inclination to learn SQL or the more technical aspects of relational databases. Often, they are using document creation tools like Microsoft Office. “We are trying to remove the chasm between humans and machines, so that people can use computers as they want to use them,” says Hideki Hiura, CTO and chief scientist at the North America subsidiary of JustSystems, the major Japanese office software company that is providing an enterprise-class application development, runtime and visualisation framework for XML applications.
Five-year hiatus
With these advantages – and there are many others – the prospects for the early pioneers of the XML database looked highly promising in the late 1990s. Companies such as Software AG, Excelon and others developed products that put some structure and resilience into the way that XML documents are stored, added query tools, and waited for the sales to come in.
But it didn’t turn out that way: instead, for the past ten years most businesses have ignored XML databases and simply stored XML data in relational formats, using translation tools that strip XML of much of its power and which added a performance overhead (see box, ‘How XML data is stored’). Some commentators filed XML databases in the same folder as object databases – technically interesting but commercially insignificant.
Are XML databases, then, another false turn in software development? Kazunori Ukigawa, CEO & founder, JustSystems, points out that this view is too simple. “XML is just ten years old. It’s really young compared to relational databases. It’s too early to judge the XML database a ‘failure’.
Even Oracle sees a rosy future for native XML storage. Mark Townsend, senior director of database products for Oracle, says, “XML data storage is of the utmost importance. Usage of XML is growing around the tagging of documents and a lot of development tools use XML. All that is mandating the use of XML.”
That is true. But why, then, do most applications use XML translators and relational databases rather than XML databases?
Ukigawa argues that two things needed to happen before XML databases could really take off. The first is a gradual process of education and market development – people needed to properly understand the technology and learn how to get the best out of it: “People aren’t utilising the power of XML yet. That’s one reason people don’t view XML as attractive as it could be.”
For example, many XML applications and schema are used as if the data will be stored in relational databases, which are designed for tightly structured information that rarely changes. “People use XML as something similar to the simple schemas of relational databases. But XML should be free from those limitations.”
In other words, they don’t make use of its ‘extensibility’ – the ability to easily define and add new ‘fields’ and ‘tags’. In relational databases, developers have to decide what can be stored, and then only things that conform to that can be stored. With XML, anything can be stored. Schemas or predetermined templates of metadata can then be applied where needed to provide interoperability and the ability to query the data.
A further problem that has slowed take up is that many XML schema are much too large and ambitious, making re-use of data and schemas too difficult. “One of the benefits of XML is the reusability. All of the data expressed in XML should be reusable,” says Ukigawa.
JustSystems’ xfy XML application framework enables users to create and re-use simple granular documents, and then to build up composite documents and schema. “You can really recycle the components that you build. It’s in a really practical form,” says Hiura. The company is one of many now building tools and applications that run on top of XML databases, just as many tools are designed to exploit relational databases.
All these functions provide powerful reasons for using XML databases – the extensibility, flexibility, and portability of XML are lost in a relational database, leaving developers with something that, in the words of one analyst, "looks like XML, smells like XML, but doesn't walk like XML". But even this has not been enough to encourage widespread take up – users also want reliable, enterprise class products from leading suppliers.
This is why a second and equally important development is likely to prove critical: the introduction of hybrid relational and XML databases. These are databases that store native XML documents alongside SQL data, rather than attempting to ‘squeeze’ the XML data into relational formats (see box).
“The hybrid database completely changes the story. The user doesn’t have to choose between relational or XML. This will change the way people look at the XML database,” says Ukigawa.
Gartner, the analyst group, sees this is an important development. “The ability to store XML data in a standard DBMS enables XML to have the same data persistence, reliability, security and stability as data stored in a central data storage facility.”
The leading database suppliers now view this is an important opportunity, with all the key suppliers claiming to have native XML capability in their products and more facilities on the way.
Some suppliers view IBM’s DB2 9 ‘Viper’ database, delivered in late 2006, as an important turning point, (although Oracle claims it offered many of the facilities before IBM). IBM claims it is able to offer seamless integration between the two: “If I do a query, I’ll get results from both the relational and the XML database,” says Bernie Spang of IBM. “Users can mix SQL and XQuery (the XML query language) together.”
To help speed up its system IBM called on the help of engineers working on its IMS database – the long standing hierarchical database that predates DB2, and which still runs on many mainframes across the world. IBM claims it can run a query across half a million XML documents in 5% of the time it took using relational storage methods.