With an annual budget of $550 billion, the US Department of Defense is the largest government department in the world. It employs 700,000 civilians and around 2.5 million soldiers – more people than any other organisation on the planet.
Since its inception in 1949, the DoD has played a pioneering role in information technology. Its research arm, DARPA, invented the precursor to the Internet, ARPAnet, making its record for world-changing innovation pretty much unbeatable.
But even with these credentials, the DoD has not escaped one of the most crippling afflictions in enterprise IT – siloed legacy data sources.
As Dennis Wisnosky, chief architect and chief technology officer for the DoD’s Business Mission Area, explains below, a recent attempt to unify the human resources databases of the various military services took 11 years and $1 billion before being scrapped as a failure.
Now, though, under Wisnosky’s leadership, the DoD is solving that problem with one of DARPA’s own inventions: semantic technology. First developed to help intelligence agencies process data, Wisnosky is rolling it out to allow users to query data across the legacy silos without the need to create a data warehouse.
Here, Wisnosky tells Information Age how he came to discover semantic technology, his plan for gradually introducing it to the senior management at the Pentagon and what he expects the benefits to be.
Information Age: What was it that prompted the search for a new approach to data?
Dennis Wisnosky: There was a giant project here at the Department of Defense called DIMHRS (Defense Integrated Military Human Resources System), which was designed to be a single integrated personnel system that could access all the Army, Navy, Air Force and Marine Corps human resources databases around the world.
But these were all static relational databases, and the cost and complexity of connecting them was extremely high. The services couldn’t agree on any shared definitions, and the database schemas that they used were changing even as we were trying to connect them together.
After 11 years and about $1 billion had been spent on the project, it became clear that the interconnection problems were never going to be solved. In January 2010, former deputy secretary of defense Gordon England cancelled the DIMHRS programme, just as he announced he was leaving office.
What alternatives did you consider?
Mr England told Congress that, instead of DIMHRS, we were going to build an enterprise data warehouse. But it had always been part of the DIMHRS plan to build a data warehouse. It didn’t take a rocket scientist to figure out that if you’d tried to build one for 11 years and couldn’t, it wasn’t going to happen.
So we looked for approaches that did not involve a data warehouse, and we came across something called data virtualisation. This means when you run a query, you reach into your authoritative data stores and aggregate it together in real time. The data doesn’t persist – it’s only there while you answer your question.
We thought that was a great idea, but data virtualisation uses translations – you translate the data into a common definition when you run the query. For example, in an Army database, a definition of a ‘soldier’ might be encoded in one way, but in the Marine Corps database, the definition of a ‘marine’ will be encoded in a different way, so you would have to translate them both to ‘service member’ when you run a query. Every time you do one of these translations it consumes processor cycles, and you end up with a very slow system, so we looked for other ways of doing it.
How did you come across semantic technology?
I was presenting at a conference, and I described the situation with DIMHRS and said that we needed something new. After the talk, someone came up to me and said ‘the Department of Defense already invented the new way’. He was referring to semantic technology, which was developed in the 1990s by DARPA [the Defense Advanced Research Projects Agency] for the intelligence community, which has the job of gathering and analysing both structured and unstructured data from around the world, in as near real time as possible.
How would you explain the technology?
Semantic technology involves defining the meaning of concepts. Using the previous example, it means understanding that the meaning of the words ‘soldier’ and ‘marine’ is ‘service member’.
In semantic technology, the record of all meanings is the ontology, which is made up of RDF [Resource Description Framework] triples. They are called that because they have three components: the subject, the predicate and the object. Examples might be, ‘[Dennis] [is] [a person]’, and ‘[Dennis] [works for] [the DoD]. Put those together and you know that Dennis is a person who works for the DoD.
How does that link back to legacy systems?
DW: There is a standard from Worldwide Web Consortium (W3C) called R2RML, which can be used to translate relational databases into triple stores. We also have another technology called SPARQLizer, which converts SPARQL, the query language for RDF, into SQL, the query language for relational databases. So you run a query based on the definitions in your ontology, and SPARQLizer converts it into a SQL query and it goes into the relational database.
How did you decide whether or not it would work for you?
After I heard about the technology, I went to the DoD’s deputy chief management officer and said I had this idea that we really need to try. I said I want to establish a little team, and come back to you every 90 days with some results. People thought that I was looney, because 90 days is the speed of light here in the Department of Defense.
So we formed a team and set out to prove that we can take data from two separate sources, federate it easily and come back with a result. For example, one of the first proofs of concept we did was to answer the question: how many service members do we have in Afghanistan that can speak Arabic?
It’s a simple question, but you need answers quickly. And it worked.
What is the implementation plan?
After the first proof of delivery, we set up a four-year plan and a detailed two-year plan. The plan has two parts: the first is building the ontology and keeping up to date with state-of-the-art semantic technology, and the second is looking at new business problems to solve.
So far, we’ve done eight 90-day proofs of delivery (PODs). On the ontology side, with every POD we add more data and use more of the standards that the community has developed. That keeps us on track with what the rest of the industry is doing – we don’t want to be behind the industry, but we also don’t want to be too far ahead, so we can recruit people from outside the organisation to work on this.
On the business problem side, we are looking for areas where we can prove that this technology can help us. For example, during the Haiti earthquake, the DoD needed to find any service members that could speak Haitian creole or Haitian French, and that could be deployed in 24 hours and had 12 months or more service time left. In one of the PODs, we showed that we could answer the question much more quickly than had been the case using the traditional methods.
When will the project go live?
So far, all of our PODs have not been using live data. But from March 30th 2012, the tools that certain senior personnel at the Pentagon use to answer questions – for example when they have to give information to Congress – will be querying the semantic information.
But I don’t like to think of this as a ‘switchover’ to a new system – the DIMHRS project was about trying to do an instant switchover, and it didn’t work. This is about gradually getting familiar with this technology at a senior level, and a long-term conversion from one way of thinking to another.
What do you anticipate to be the financial benefits?
Well, we know how much we spent doing this the old-fashioned way, and we know our percentage of success, which wasn’t very good. This is based on industry standards, so when new data sets are created using these RDF triples, the cost of building and accessing these large data stores just has to go down.
We’re in talks all the time with people on the other side of the river here in Washington about how we can make sure that we’re moving in the same direction, and that our data can be linked when it needs to be. Projects like [US open government portal] data.gov and [tax transparency site] recovery.gov are all beginning to use the same technology.
What about the organisational benefits?
There is a concept in semantic technology called ‘provenance’. This is about building trust in the data that you have. Ultimately, I think we will be able to have unequivocal trust in the data that we convert into information (for a given point in time – all data is temporal).
Personally, I think this is what will have the biggest impact. We’re going to learn how to trust the answers that we get from data a whole lot more than we do today. That will lead to faster decisions we need to make now, and more certainty in decisions we have to make for the future.