How to overcome repetitive patterns that can’t be discerned from data

According to Cloudtweaks the amount of big data is growing exponentially. Daniel Price writes that each day at least 2.5 quintillion bytes of data are created each and every day of the year.

By 2020, an estimate of 40 zettabytes will be in existence. It is thought that the web currently consumes 500 exabytes of data, which amounts to 5 billion gigabytes or half a zettabyte.

Price adds that Amazon is the biggest big data company in the world, hosting more than 1 billion gigabytes of data across more than 1.4 million servers. In comparison, Google and Microsoft operate a stable of about 1 million servers, but they have to date refused to reveal the exact numbers of them.

This data is being created, emailed and stored online by everything businesses and consumers do with their computers, tablets and smartphones. 

Each one of us creates data such as digital images, documents, voice and transactional data. With population growth, data volumes can only increase – but network latency still remains an issue that hasn’t yet been solved in spite of the availability of a number of WAN optimisation tools.

Network latency

For example, a customer wanted to transfer a video file of 32GB over a 500MB satellite link with 600ms of latency, but the process still took 20 hours to complete.

As file sizes get bigger and the need to transfer or back-up larger amounts of data across the web or a virtual private network (VPN) increases, the time to achieve what the company wanted to do would also surge.

Bridgeworks nevertheless was able to reduce the data transfer time down to just over 10 minutes with its WANrockIT solution.

In addition to latency, there is a need to overcome repetitive patterns that can’t be discerned from data. 

“Repetitive patterns are a natural part of data creation, such as fields In a spreadsheet that contain repeated numbers or a repeated sequence of characters which may be contained within a phrase,” explains David Trossell, CEO of Bridgeworks.

When data is compressed or deduplicated, he adds that these sequences “are substituted for a short reference number and put in a table that contains the same sequence”.

>See also: Want to be a data leader? Here are 8 attributes you'll need

Data compression and encryption at rest make it either difficult or impossible for the repetitive patterns to be discerned from the data. In the case of encryption, this occurs for security reasons because corporations want to keep prying eyes away from their data. As a result they have a requirement to encrypt it, creating an “inability for dedupe to occur”.

“Corporations use compression to reduce the amount of data sent over the link; so hardware and back-up vendors have begun to incorporate this feature and encryption for security into their products, and once compressed and encrypted you cannot add further layers of compression or deduplication as this can inflate the amount of data that is sent over the link compared to what was originally required,” says Trossell. He therefore thinks that third-party WAN dedupe or compression vendors no longer add as much value as they once did to the data transfer process.

Recognising patterns

Phil Taylor, director and founder of Flex/50, explains why it is important to recognise repetitive patterns: “When large image files are transmitted across the wide area network (WAN) links, the usual TCP/IP packet structures are wrapped around the data, and multiple packets are assembled to transmit a single image file.” Within the packets often lies redundant header and trailer information.

He adds that there are also bit structures that can in his view be spotted as repetitive patterns: “If sent without tuning or optimisation these packets, the protocol overhead can add significant delays to transmission.” Therefore organisations should lean heavily on increasing their levels of network reliability with, for example, an increased TCP/IP receive window size.

In contrast Claire Buchanan, chief commercial officer at Bridgeworks, argues that ten years ago the bandwidth of the external network connections that were available to users was small, and so organisations exploited their maximum performance in way that reduced the amount of data that was sent over the connections.

This gave birth to WAN optimisation, which continues, in her view, to do a good job when bandwidth is limited. All is not as it seems though as this often only gives an impression that performance has improved by reducing the amount of data sent over the pipe.

Bryan Foss, a visiting professor at Bristol Business School and an independent board advisor, nevertheless comments that there has been a substantial amount of investment made in R&D to look at “compression and decompression techniques that can be run on distributed and low-power processors, such as those included in your smartphone, tablet, TV set-top box or personal video recorder”.

He cites Motive Television’s Tablet TV video services as an example of how real-time technologies become possible to deliver over affordable network types.

>See also: Why the editor of The Times kicked News UK’s big data chief out of his office

Cold data, migration and storage

Buchanan moves on to talk about why cold data, migration and storage are significant to how companies can overcome repetitive patterns that can’t be discerned from data.

“In order to dedupe your data it needs to be run through tables several times to the gain the maximum performance because the tables have to learn the data strings,” she explains. With regards to cold data and migration, she says this is usually a one-time process and so the benefit of ‘dedupe’ are negated.

“Cold data is inactive data that is rarely accessed and that could be stored further down the storage hierarchy, so this class of data is often held for retention reasons – it is not generally part of the high-speed access storage sub-system,” argues Taylor. He explains that the problem is that if the data is permitted to build up “without proper hierarchical treatment, then it can become part of the volume of data that is subject to remote network back-up”. Amongst the issues that arise from this is wasted network bandwidth during bulk transmission jobs, but archiving to lower cost devices can help.

Taylor recommends that unnecessary data shouldn’t be sent over the network. He advises businesses to look at evaluating a wide range of solutions before deciding upon which one suits their needs the most.

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

Related Topics

Big Data