Don’t drown in a data lake, or rather a data swamp

Google the term ‘data lake’ and you’ll get nearly 14 million hits. It’s an intriguing topic that’s gained quite a bit of attention. The idea behind a data lake is to have one central platform to store and analyse every kind of data relevant to the enterprise, still in its native format. Being able to combine these various types of data has the potential to bring insights in ways never before imagined.

Many organisations are implementing data lakes as a way to ensure they have access to every type of data that may be useful now or in the future, so it’s no surprise that it’s one of the hottest topics in business intelligence today.

Data is being captured from every transaction and interaction – from both humans and machines. With 2.5 exabytes of data being created every day, what organisation wouldn’t want to get their hands on that to drive business engines and more quality decisions?

Until recently, storing high volumes of structured and unstructured data was cost prohibitive, and storing unstructured data was just not feasible. The recent evolution of storage technology has played a big role in the rise of big data and data lakes.

Higher density hard drives and solid state drives have significantly reduced the cost of storage. As an example, Amazon S3 cloud storage prices have dropped 86 percent since 2010. As storage has become cheaper, more data is being stored in its native format in the hopes of finding nuggets of information now and in the future.

So, with nearly 14 million articles in Google about data lakes, why is it so difficult to find true data lake success stories? That’s because there aren’t many. Unfortunately, the very same benefits that gave rise to data lakes are the very reasons companies struggle to get value out of them. Companies are storing all kinds of data, just because they can, creating massive repositories that are inaccessible to end users. It then becomes a ‘data graveyard’, the place where good data goes to die.

The dark side of data lakes

Data lakes have been sold as the way to get value out of large volumes of data. Finally, a way to store all kinds of data at relatively low costs, and be able to analyze it too. If only it were that easy. Unfortunately, most data lakes lack governance, lack the tools and skills to handle large volumes of disparate data, and many lack a compelling business case.

Storing data of all types and varieties in a central platform sounds good on the surface but can create additional issues. Gartner predicts that through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient. Even worse, data can be added to the data lake and accessed without regard to security or access control. This could create serious regulatory and data privacy law violations.

Although many tools can now handle both the variety and volume of data, they are either hard to use or do not have a complete feature set. Undoubtedly, there will be more options coming but there is still a lack of comprehensive and strong solutions. The high volume and velocity of data in data lakes makes it difficult for manual discovery to keep up. And, without a solid business use case, it becomes an exercise in chaos and disorder.

How to stay afloat in your data lake

With so many issues with data lakes, you might conclude you should shy away from them. Fortunately, that’s not completely true. With proper forethought and setup, organizations can get started with data lakes in ways that provide immediate value and set the data lake roadmap moving forward to align with the future.

First of all, if you are currently working under the “save all data” strategy thinking this will be a catch-all solution for data, stop now. It’s easy to get caught up in the notion to go after all data under the premise that it might be useful someday, but that is exactly what creates the murky data swamp. Adding more irrelevant data only makes it murkier.

The best strategy for data lakes is to only collect data that is useful now. Data loses its value over time and if you can’t find what you’re looking for in the mess that is the data swamp, it’s pointless to keep adding to it. Projects should only go after sources that can provide useful solutions to clearly defined business problems.

That leads to the second point, which is to create one or more business use cases that lay out exactly what will be done with the data that gets collected. Although this may seem contrary to the fluid and amorphous nature of a data lake, it will help drive immediate benefits and the vital experience needed. Doing so will allow the organization to mature their data lake over time and figure out which data to save.

Next, operationalise both the data lake and the offspring information it creates with a framework connecting it to the original business use case(s). This is where the rubber meets the road and where the data lake returns its value.

Manual data discovery of the data nuggets in the lake, while not impossible, is not generally scalable and therefore not operational. Machine learning, with tools like Apache Spark, is the predominant mechanism for automating discoveries in data lakes. As useful data is discovered, the modelled output can feed into business processes or be used in downstream analytic technologies for better insights and decision making.

Lastly, even data lakes need a periodic cleansing. Don’t wait until it becomes overloaded and overwhelming. Have a process to determine how long to keep data around, based on the amount of use and the necessary timeliness.

Last month’s sensor data may be worthless if the equipment has been changed out. Just like cleaning your closet where if you haven’t worn it in some time, throw it out. This will keep the data lake to a manageable size and keep it easy to operate.

Data lakes have great potential for discovering new insights, but most have not yet lived up to the hype. Instead, they have become costly and confusing. The challenge has never been in creating them, but in realising the value from them.

Many organisations, wanting to get in on big data, have rushed out and grabbed all the data they could find, drowning in the process. However, with a little forethought and constraint, organisations can slow down and grow their data lake as their knowledge develops and technology matures.

Sourced by Avi Perez, CTO of Pyramid Analytics

The UK’s largest conference for tech leadership, TechLeaders Summit, returns on 14 September with 40+ top execs signed up to speak about the challenges and opportunities surrounding the most disruptive innovations facing the enterprise today. Secure your place at this prestigious summit by registering here

Nick Ismail

Nick Ismail is a former editor for Information Age (from 2018 to 2022) before moving on to become Global Head of Brand Journalism at HCLTech. He has a particular interest in smart technologies, AI and... More by Nick Ismail

Don’t drown in a data lake

The dark side of data lakes

How to stay afloat in your data lake

Nick Ismail

Related Topics

Related Stories

How do you build an adaptable data platform?

Charting the AI-fuelled evolution of embedded analytics

Data maturity and the squeezed middle – the challenge of going from good to great

How to stop data mesh turning into a data mess

Related Stories

How do you build an adaptable data platform?

Charting the AI-fuelled evolution of embedded analytics

Data maturity and the squeezed middle – the challenge of going from good to great

Looking at the Earth with fresh eyes