How to turn a data swamp into a data lake: best and worst practices

It’s fairly early for best practices to emerge with Hadoop data lakes (or data hubs or data reservoirs).&

Nevertheless, there are some emerging best practices, as well as some worst practices.

Here are some tips from the big data front lines.

Best practices

1. Where’s the money?

Start with a data lake justification owned and signed off by the business leaders. Regardless of how inexpensive a data lake may seem, it still costs money.

So what is the value of the data lake? Focus as much energy as possible on getting a business VP sign off on specific use cases.

Onerous as this sometimes is, the ROI and business goals clarification are a huge help in keeping the project focused and on track.

Then recruit the VP’s staff directly into the development process – otherwise they will reject the data lake when it’s operational.

Once engaged, the users will say they don’t know what they need – but they will know it when they see it.

This is an iterative process; ideal for agile methodologies. Get users directly engaged in every step to the finish line. This makes data-lake success rates jump up.

2. No thanks – no SANs

Stick with direct attached disks for Hadoop data nodes. Avoid storage area networks (SANs) for now – direct attached disks are faster and cheaper.

Generally, taking hops across the LAN for disk IO slows performance roughly 15% to 20%. Some degradation is from network congestion; some latency is slogging through the SAN operating system.

Don’t put a ton of hardware and software between mappers, reducers, applications and the data – at least for the next few years.

3. Hadoop administrator (HDPA anyone?)

When Hadoop is first installed, the programmers like to play with Hadoop and the cluster.

Once the Hadoop cluster gets bigger than ten nodes, it needs a Hadoop administrator. Recruit a DBA or operations person to the project from the outset for this role.

Get them engaged during the proof-of-concept, through the test-dev cluster, and finally the production cluster.

Someone has to manage patches, recovery, release upgrade releases, downtime, hardware repairs, and expansions. The HDPA is also the home of cluster governance, connectivity and data-quality stewardship. An indispensable person.

4. Hadoop or Hadump?

The vision of the data lake holding every file ever produced makes no sense. Are you planning for 700 million HDFS files? You should be.

Some big sites have gotten into trouble creating huge files that are then used to create two copies and three derivative files.

Copying one terabyte five times adds up fast. Then imagine doing it nightly for ten different large files 365 times a year.

Storage may be cheap but it’s not free. In many countries and provinces, the cost of floor space, power and cooling far outweighs costs per terabyte over a year’s time.

Some files should be captured and retained as an archival. Think of this as insurance – the users think they might need it in five years. But many files grow cold – then colder – until they have no value.

First, establish filenames and lifecycle tags that fit the company. File names should help track ownership to an application or user department.

In the weekly agile-development meetings, get the business users to commit to retention goals on files.

Have the HDPA document those decisions where everyone can find it five years from now. Just a few notes with owner names and simple explanations will be invaluable in five years when there are 700 million files in HDFS.

Then, at some point, consider carefully purging truly cold and no longer useful files. This is especially needed when HDFS is used as a backup alternative to magnetic tape.

If you have an original file and can reproduce derivatives, consider deleting the derivatives. Not planning for surge, merge and purge can hurt data-lake costs and data quality.

Worst practices

1. Shiny new objects

Every week there’s a new Apache project or hot startup vendor aligned to Hadoop. Let’s rewrite our stuff with Spark. Maybe we should switch to Drill, or Blinkdb, or Presto, or Tajo, or Vendor-Y. Let’s change from Lucene to Solr then to elastic search. It can be paralysing.

Software takes a few years to mature, leaving the dilemma of rapid innovation. Jump in too fast and the shiny object may fade in popularity meaning you bet on a lemon. Move too slow and you have to re-architect applications.

Yet, not all of the Eclipse plug-ins and Apache web services projects became mainstream. Plus, a lot of what is circulating is alpha quality code moving towards beta quality.

The first rule – for those outside Silicon Valley – is to only use open source that’s backed by a commercial support contract.

Know who to call before adopting any new open source. Without the support contract, the code is a lab experiment.

Then watch the early adopters as they post on Github, Stack Overflow and other sites. Especially watch the bug reports, which offer insights into product limitations and stabilisation.

Lastly, don’t dive into shiny new objects unless there is an immediate and absolute need for that technology.

In most cases, stick with a reliable vendor distribution because the integration and patching work has been done for you. Thoroughly do your homework – then crawl, walk and run with the new software.

2. Raw data is good enough

A few people still proclaim the end of ETL and data integration. Just the opposite is true.

Data integration means easier to understand data and higher quality data. Without DI, users stop using the application and IT gets a black eye for producing garbage results.

This extends to metadata as well. Low quality data produces low quality companies.

Schema-on-read doesn't mean skip-data integration. It means skip some of it and respect the risks. It’s common for a corporation to have four different codes for marital status and five for gender (1, 0, M, F, null).

Think of data quality as a series of service level agreements. Since this is a data lake, we can break from data warehouse traditions somewhat.

Some data lake files need to be like purified water, others can be murky. Then ask: “How often will this data be used and shared?”

If the data is not shared, data integration efforts can be moderate. If the data is widely shared, transform and clean the data thoroughly.

Begin by intensely governing 5% to 20% of the data. Categorise the files – e.g. temperature, trust and composition – and teach the users what they are dealing with.

Transform fields and columns, standardise values across files, and do the complex correlations to find the most accurate values from multiple sources. Then expect to expand data quality efforts beyond the 5% to 20% of the files.

Minimise hand coding of ETL as much as possible. Do-it-yourself ETL saddles organisations with a mountain of legacy code, maintenance costs and no business sponsor.

One alternative is to find out if your company has licenses for data integration tools like Informatica Vibe, IBM DataStage, SAS and Talend. Use them if you can.

Plus, some Hadoop distros come with tools like Talend for free. Most ETL tools have been modified to push down processing into Hadoop.

3. Clusters sprouting like weeds

Hadoop clusters are like teenagers: four to seven is a party, but any more than that and they eat too much – more than ten typically means losing control.

Admittedly, some cluster proliferation is because YARN and Tez are not yet in widespread use. Nevertheless, every cluster brings with it a minimum administration labour cost and added complexity.

It’s not long before you can’t find enough funding. But this is not the worst part.

A data lake is not a series of disconnected ponds. Eventually it’s obvious that certain critical files need to be in multiple clusters.

With different extracts and ways of slicing and dicing the data, users end up with different views of the data. Files in the clusters quickly get out of sync.

This means users get different answers to the same questions. The fewest number of clusters is best – one big one with YARN is ideal if you can.

ThinkBig Analytics expert Jeffrey Breen said: “Just because Hadoop provides a monolithic file system doesn't mean you need to present it that way to users.

“We commonly segregate HDFS into multiple zones, such as a landing zone, to receive raw data from source systems, a work zone for use by ingestion, processing pipelines, and – most importantly – a publication zone to hold the enriched data sets to support the business goals of the system.”

Sourced from Dan Graham, Teradata

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and... More by Ben Rossi

How to turn a data swamp into a data lake: best and worst practices

Best practices

1. Where’s the money?

2. No thanks – no SANs

3. Hadoop administrator (HDPA anyone?)

4. Hadoop or Hadump?

Worst practices

1. Shiny new objects

2. Raw data is good enough

3. Clusters sprouting like weeds

Ben Rossi

Related Topics

Related Stories

How do you build an adaptable data platform?

Charting the AI-fuelled evolution of embedded analytics

Data maturity and the squeezed middle – the challenge of going from good to great

How to stop data mesh turning into a data mess

Related Stories

How do you build an adaptable data platform?

Charting the AI-fuelled evolution of embedded analytics

Data maturity and the squeezed middle – the challenge of going from good to great

Looking at the Earth with fresh eyes