#
 
ANALYSISSTORAGE MANAGEMENT

Antidote to data cloning

Deduplication is the hottest technology in storage, promising to purge bloated databases and bulging email servers

A massive 75% of all the data held by businesses is replicated – the same information cloned many times over and held across numerous storage servers and individual PCs.

“Terabytes, if not petabytes – is replicated data,” says Steve Mills, the head of IBM’s $20 billion software business.

And that is a wasteful and expensive practice. According to research by IDC, the amount of information being created and replicated is growing at 60% a year, and that is costing organisations dearly, not just in terms of the additional storage units they need to hold all that data, but in the space they occupy in data centres, the energy consumed to keep them spinning and the manpower it takes to ensure the availability and integrity of their contents.

“Most people managing that [situation] don’t get 60% more budget a year,” observes Joe Tucci, CEO of storage systems giant EMC. So, with IDC predicting that the amount of data held globally will have grown 10 times over the period 2006 to 2011 (to 1,773 zetabytes), they are hungry for an answer.

Put simply, that means finding a way to stop replicating so much data. But only relatively recently has technology started to emerge that can deal with the innate complexity of scanning for copies of the same data within files. Enter the storage industry’s panacea for runaway data proliferation – deduplication.

Unlike previous ‘solutions’, which have tried to implement single-instance storage (SIS) to eliminate the holding of multiple copies of the same file, ‘dedupe’ scans at a sub-file level, removing redundant sections of data and replacing them with pointers to a master copy.

With SIS, a PowerPoint presentation emailed to everyone in a sales and marketing department, for example, will be kept only once – as long as it remains unchanged. But as different sales executives update the cover page of the presentation with their own name and title, the whole document is stored each time.

The new breed of dedupe technologies go deeper. By scanning the blocks of data that make up that document file (at a fixed or variable level), they keep a single copy of the static part of the document and small separate parts where it has changed. Another example would be a company logo that appears on all corporate correspondence, office stationery and every page of every presentation – with dedupe, the logo would only be stored once.

Aside from email servers, the ramifications for backups, virtualisation and disaster recovery, all of which involve large quantities of replicated data, are enormous. Indeed, in some views, where there is data there will be a dedupe engine to pare it down. “Deduplication is going to exist in every part of your business… in many forms,” says Tucci of EMC, which sells deduplication under the Avamar brand.

The reason that it will become a pervasive technology is that the payback is already proving compelling. “We’ve seen 20 times compression with a database, and very aggressive deduplication rates with [virtualisation technologies such as] VMware. We really chew that up – sometimes we get 40 to 60 times compression,” says Beth White, vice-president of marketing with dedupe specialist Data Domain.

One of the more established players in the field, Data Domain has recently seen a rush into dedupe by the major storage companies, often through acquisition. Notably, EMC bought Avamar 18 months ago and IBM came late to the party with the acquisition of Diligent Technologies in April.

Others, such as Network Appliance (NetApp), have created and integrated their own versions of dedupe into existing storage appliances.

“People are not looking solely for dedupe – they’re looking for the infrastructure around it,” says Aad Dekker, NetApp’s director of solutions marketing, explaining that dedupe adoption has been driven largely by the trend towards virtualisation and the “80% to 95% reductions in storage” that can be achieved there.

Another force in deduplication, Quantum, argues that there are multiple levels of sophistication, with the algorithms that underpin the search for multiple copies of the same data the critical elements. “It’s not as much about the technology as it is about ownership of the dedupe patent, and there aren’t that many,” explains Mark Galpin, product marketing manager with storage vendor Quantum. “Now it’s more about who’s licensed to use them.”

Quantum would know – it has a reputation for fiercely defending its own algorithm (the variable dedupe) patent, according to Galpin. A case against Riverbed for patent infringement was thrown out earlier this year over a technicality, while a similar dispute against Data Domain in 2007 saw the smaller vendor hand over 390,000 shares in a cross-licensing agreement just prior to its IPO in 2007.

Copy killer

While dedupe has been accepted in backup and recovery for two or three years, according to Galpin, it is now being innovatively applied to other areas. Besides reducing the indirect costs associated with storage, such as physical space, cooling and infrastructure, the data reduction qualities of dedupe are especially attracting attention.

In wide area network (WAN) optimisation, dedupe is being used by the likes of Silverpeak and Riverbed to speed up data movement over distance by minimising the amount of duplicate data sent. This has exciting implications for things like dispersed offices, enabling much faster remote backup. After an initial full backup, the office runs a small deduping appliance locally, and thereafter only the changes are sent.

“You reduce at the edge and store at the centre,” explains Galpin, “which is particularly good for those that don’t have a big, thick link to the organisation’s data centre. Having the data remote to where it is backed up is also good disaster recovery [practice], especially when there’s a limit to how much you can store at the edge.”

While proving vital to speeding backups, dedupe is also seen as having potential when applied to primary storage – although the jury is still out on what form that will take. “Primary storage is not quite there yet,” Galpin says, agreeing with Beth White of Data Domain, who observes, “Primary storage is not going to see as much redundancy anyway.”

But Mark Sorensen, senior VP of EMC’s Storage Division, believes that dedupe will be part of the primary storage picture: “In the next generation of products, deduplication will become as ubiquitous as RAID. It will become part and parcel of all our storage capabilities – not just in backup, but in 2009 you’ll see us roll it out in our primary storage platforms – all designed to make storing information vastly more efficient.”

The mathematics behind dedupe have been around for many years – one developer confided that it was “the kind of problem you would find in an undergraduate IT programme”.

David Messina, IBM’s manager of business development for systems storage, believes dedupe has now come through the evangelistic selling stage and is being accepted as a must-have technology. “When we ask customers ‘Would you buy one?’ there is very little hesitance compared with six to eight months ago,” he says.

Deduping dedupe

While storage growth of 60% a year may make dedupe a suspiciously easy sell – especially considering many appliances can be plugged in and rescuing disk space within an hour or two of opening the box – there are a multitude of factors to consider before adoption.

Some vendors still market file-based SIS as deduplication, usually as an option for a piece of infrastructure such as a database server. “SIS provides anaemic dedupe compared with a fixed- and variable-length dedupe segment,” White explains.

Another decision faced when implementing dedupe is whether to choose ‘post-process’ or ‘in-line’ deduplication. Vendors divide into two camps on that point, and fervently argue about which way is better.

Post-process vendors, like FalconStor, Sepaton and Exagrid, write data to disk or a ‘staging area’ before deduping it. In contrast, Data Domain, Diligent/IBM and Avamar/EMC dedupe ‘in-line’, as the data is sent. Such post-process dedupe requires enough disk space for the initial write, and often entails a disk system purchase, but can offer greater flexibility as to when the CPU-intensive activity runs (overnight, for instance, if cycles are at a premium).

In-line dedupe saves that disk space at the potential cost of a performance hit, but is generally quicker, since the backup it is working on doesn’t need to be written twice. It is also highly scalable.

“As processors get faster, we get faster,” White says. “We rely on Intel, not Seagate [disk drives].”

IBM also swears by in-line: “It’s more compatible and easier to adapt to traditional technologies like encryption,” says Messina.

Quantum offers both, for the benefit of the indecisive.

Need for speed

Dedupe vendors like to boast about (and market) their dedupe speeds and compression rates, with Data Domain taking the record back from Diligent in May (170 Mbps for a single backup stream and 388 Mbps for multiple streams).

However, “your mileage may vary”, cautions Galpin, particularly when it comes to final compression rates. “Generally speaking, archives achieve 6:1 and backup 10:1. Some vendors promise between 20:1 and 50:1, but it depends on the type of data – some customers get 50:1 in a week, some will never see 10:1.”

The variation obviously depends on the amount of duplication present. “Dedupe needs redundant data, so things like seismic or geological data that are always new, or video data, don’t show a great effect,” White explains.

Data integrity

If anything has hindered the uptake of deduplication, it is a natural reticence among storage executives to start deleting current data.

Vendors claim that the fears about data integrity are unwarranted: “We have built checks into the system to ensure data gets written successfully; it’s there and recoverable all the way through. We encourage testers to abuse the system and kick the power cord – when the power comes back, the appliance works out what happened and starts where it left off.”

However, there is a ‘cosmically insignificant’ but finite possibility that the method many vendors use to dedupe – an approach known as ‘hashing’ – can cause a ‘hash collision’, resulting in corrupted data.

Hashing scans data and assigns hash numbers at a sub-file level, then watches for that number to reoccur during the deduping process. If hash numbers match then this indicates duplicate data. The duplicate is then removed and a pointer left to identify the original. But an inherent flaw in this process is the possibility of a hash collision, whereby the same hash number is coincidentally assigned to two different pieces of data, corrupting it unbeknownst to the system.

“There’s been a lot of discussion about hash collisions,” says Beth White. “The possibility is very, very remote. It’s much more likely that there will be other problems or errors and corruptions.”

Despite that remote chance, the potential could affect the legal admissibility of deduped data, according to Bob Farkaly, vice-president of worldwide sales for Overland Storage.

“The question seems to be whether or not a sub-file-level deduplicated document can be presented as an unaltered original to a court of law in a compliance situation,” says Farkaly.

Taped up

These concerns over data integrity, along with the desire of most backup managers to have more than a single instance of their data, mean dedupe’s disk-centric approach is unlikely to topple tape as a backup medium in the foreseeable future.

“Tape will have a place for a long time – it’s a good long-term archiving medium and a comfort choice,” says White, “but it’s notoriously unreliable. Gartner found that 15% of tape recoveries fail. If you’re in the middle of a lawsuit then that’s a big deal.”

In many opinions, dedupe’s efficiencies will encourage a move to disk backup. “We think there will be some movement towards tapeless VTL [virtual tape libraries] among our smaller customers, who can get away with storing less data, but we also think there will be continued use of tape with dedupe, hopefully making it more efficient,” says IBM’s Messina.

As for dedupe itself, Messina predicts that the niche will split into two strands of growth – established players in areas where the technology is already applicable, such as backup and disaster recovery, and smaller innovators experimenting with new applications of the technology.

“I think there’s room for a lot of innovation,” he says. “Although it doesn’t sound very ‘gee-whiz’. We plan to adapt it for mainframe VTLs, then potentially for data protection, and in the future possibly apply it to disk arrays.”

Whether dedupe is then seen as a feature or capability or standalone product, it is certainly going to be a central issue in the storage sector in coming years.

As NetApp’s Aad Dekker says: “It’s a phenomenon that can’t be stopped.”

Gnarly wipe out, dude!

Surf- and snow-wear company O’Neill was drowning in data. Credited as the inventor of the modern wetsuit – and now one of the most coveted brands in retail – it had so much data to deal with that it didn’t have enough time to back up each night.

“We used to back up 1.4 terabytes,” says the company’s global IT service and infrastructure manager, Peter Malijaars – a process that would take up to 14 hours.

The answer to that was not found in souped-up processing power or faster disk drives, but in removing duplicated data.

Introducing deduplication appliances from storage specialist Data Domain sucked out replicated data from the backup environments of five of its European sites. And in doing so it reduced its volume of stored data by a factor of 18, in the process cutting its backup window from 14 hours to just two.

“Now we [are able to] back up all the archives as well – 5.3 terabytes. Since we’re a clothing company, there are a lot of old designs, but now we only have to back up the changes,” says Malijaars.

Prior to adopting dedupe, O’Neill Europe used a disparate collection of tape machines and a physical service that collected tapes every day and stored them in a secure rented room.

Besides the long backup window, “We saw that every day we had more data, and often a lot of it was the same,” Malijaars says, adding that the tape backup occasionally failed “and you can’t verify exactly what’s on a tape after a 24-hour backup.”

“We knew we had a problem. ICT made the decision [to adopt dedupe], and the only thing business said was to make sure we avoided performance problems and that everything continued to be available 24/7.”

“Now I can verify the backup and restore, disaster recovery is much easier; it only takes two hours,” he says.

Further reading

Thin provisioning Thin provisioning can radically improve storage utilisation levels by allocating storage resources that do not exist.

The natural selection of storage London’s Natural History Museum has a physical storage challenge like no other organisation

Find more stories in the Storage Briefing Room

By JJ Robinson, jj.robinson@vitessemedia.co.uk