In depth: Data virtualisation

For many years, business intelligence systems have relied on extract, transform and load technology (ETL) to take data from transactional systems and place it in a data warehouse, where it can be safely sliced and diced without affecting operational performance.

There is, however, an alternative approach to the dominant data-warehousing paradigm. That approach is data virtualisation.

Instead of creating a new copy of the data required for analysis or for use by an application, data virtualisation creates a virtualised layer that brings together data from heterogeneous sources in real time – or close to it. Put simply, the virtualised layer ‘points’ to the data in its original data store, rather than replicating and moving it.

It may sound like a subtle difference to existing data warehousing techniques, but for organisations such as drug-maker Pfizer the benefits are very real. According to research fellow Dr Michael Linhares, data virtualisation has cut the time it takes to deliver data-centric applications to users from months to just days.

“Instead of doing a grand engineering project, it’s about doing things in shorter, rapid cycles where you can experiment a little bit,” he says. “You can build something quickly, put it out to customers to use and, if they don’t like it, you didn’t make a huge investment and can change it equally quickly.”

Beyond federation

Data virtualisation arose out of the concept of data federation, which was pioneered in the early 2000s. The idea was that instead of consolidating multiple data stores, which was complex, expensive and prone to failure, disparate data could be linked together as and when it was needed.

This approach was dogged by inconsistencies across disparate datasets, however. “There would be so many inconsistencies that pulling it all together in real time served no purpose,” explains Ash Parikh, product management director at Informatica. Because the data could not be trusted, data federation could not be used for mission-critical systems.

How data virtualisation differs

Where data virtualisation differs from data federation is that it introduces a new, virtual layer between the data sources and the applications. This layer, which operates in its own server environment, applies predetermined rules to the data in order to compile a meaningful and consistent view. It does not create a new copy of the data, but receives queries from applications, fetches the appropriate data from the underlying sources in accordance with the rules, and returns the desired data in the form of a ‘view’ – which provides read-only access – or a ‘data service’, which allows the underlying data to be changed as well.

“These objects [views and data services] encapsulate the logic necessary to access, federate, transform and abstract source data and deliver it to consumers,” explains Robert Eve, executive vice president of marketing at data virtualisation vendor Composite Software.

A typical data virtualisation system includes a development environment where the rules that define these objects can be created and edited. This can either be in programmatic form for use by software developers, or have a drag-and-drop interface so that business analysts can create their own views.

Like web services, virtual data objects can be reused in multiple applications, which accelerates application development time and removes duplicated effort. It also means that data virtualisation can be implemented gradually, the scope of the virtual data layer expanding to encompass more applications across the enterprise over time. Views and data objects can also be used in combination with one another to perform sophisticated data management operations, Eve adds. 

Data virtualisation is not without its drawbacks, of course. First of all, the software is not cheap. An all-encompassing suite from US vendor Denodo, for example, has a list price of $150,000 per licence for a quad-core server. Secondly, the model is not right for all circumstances. Speaking to Information Age last year, Dennis Wisnosky, CTO of the US Department of Defense’s Business Mission Area, explained that he had evaluated data virtualisation when looking for a new way to integrate human resources data.

However, when a data virtualisation server ‘translates’ underlying data into a view or data service, it consumes processor cycles. Such was the scale of the queries that the DoD wanted to run that this ‘translation’ put too great a processing burden on the servers, and the queries were unusably slow.   It should be noted, however, that the DoD was attempting to query data about every single employee across the organisation – and it is the single largest employer in the world. Wisnosky’s experience need not apply to more everyday data volumes.

One organisation that has had more success with data virtualisation is US health insurance provider HealthNow New York. It has healthcare data spread across 16 enterprise databases and 30,000 Microsoft Access databases. Prior to implementing the Informatica Platform for data services, the organisation had struggled to get data fast enough to the users who needed to consume it – building a data mart could take months. Now, though, it can put together a ‘virtual data mart’ in days.  

The deployment of data virtualisation is one component of HealthNow’s ongoing implementation of master data management (MDM) architecture, says Rob Myers, an enterprise data warehouse solution architect at HealthNow: “We are going down an MDM path, standardising our data, but we are also using data virtualisation to create a ‘data abstraction layer’ throughout our entire organisation. That will become the ‘go-to’ place for data.” 

HealthNow selected Informatica, he adds, because the skills required were similar to the Informatica data warehousing software that the organisation already used.

A maturing technology

HealthNow’s adoption of data virtualisation is an example of the growing maturity and sophistication in both the technology and its use. According to IT analyst company Forrester Research, early adopters are moving on from tentatively integrating just a handful of sources, driven by the needs of particular projects. They are starting to devise enterprise-wide data virtualisation strategies bringing together hundreds of sources to deliver data in real time to a wider range of mission-critical applications.

Furthermore, they are widening their scope to include unstructured and semi-structured data, and moving beyond read-only use cases. Some are even integrating with external data sources, including Windows Azure Marketplace, Dun & Bradstreet and LexisNexis. The different vendors in the space each offer different strengths. Early data virtualisation pioneer Composite Software, for example, developed its tools to be familiar to database developers proficient in standard SQL, making it easy to adopt and offering tight integration to support most use cases, according to Forrester Research.

A similar, ‘best-of-breed’ company is Denodo, says Forrester’s Brian Hopkins: “Composite has a bit more of a structured data approach, so it’s coming more from the background of connecting into a bunch of back-end databases, generating a virtual database on the fly. Denodo is historically stronger in unstructured web content aggregation and virtualisation.”

The two other companies categorised as leaders in Forrester’s ‘Wave’ market analysis are Informatica and IBM. For these two, data virtualisation is a component of a broader suite of data management tools.

“Informatica’s data virtualisation technology is an add-on to its entire data integration pipeline, so you can choose to integrate data virtually or physically,” says Hopkins. “IBM’s data virtualisation component has dozens of different products but it’s a very small piece of the company’s overall information services strategy, which includes quality, cleanliness, traditional integration and metadata management.”

Enterprise software vendors including SAP, Microsoft and Red Hat are bidding to catch up. “SAP has done well in data quality, transformation, integration and performance [while] Microsoft continues to show dramatic improvements in its data virtualisation offering,” states Forrester in its first-quarter 2012 Forrester Wave report. Red Hat, meanwhile, offers the sole open source option.

Like any data management technique, data virtualisation has its strengths and weaknesses, and circumstances in which it will or will not work. Critically, though, it offers a much-needed alternative to the monolithic model of data warehousing that many organisations have found expensive, over-ambitious and ultimately unsuccessful.

Related Topics

Data Virtualisation