Causation and correlation: a big (data) headache

Data is everywhere. Once of interest only to business analysts, data is now firmly on the minds of every consumer today. From the data they are putting out there to accessing the vast wealth of data available in today’s internet age.

Safely stored, properly understood and effectively leveraged, data is a goldmine that can improve every customer’s experience and inform every business of current habits and how to cater for future needs.

With the explosion of interest in big data over the last few years, businesses are becoming increasingly obsessive about spotting patterns or trends in their data to enable them to take “actionable insight”.

The days of obsession over causation and correlation are upon us. Two of the most conspicuous examples of big data in action are Google’s data-aggregating tools: Google Flu Trends and Google Dengue Trends.

These tools can famously detect the spread of flu and dengue fever by monitoring the frequency of search queries relating to relevant symptoms, and thereby provide an early warning system to governments to prevent pandemics.

However, using data trends based on search results is risky. It’s a universal truth that people can ‘google’ mindlessly with no real motivation for their procrastination.

Unsurprisingly, the operation has attracted much criticism for overestimating the potential scale and spread of the virus. The reason for this boils down to a fundamental truth in statistical analysis: causation doesn’t equal correlation.

While causation means ‘A causes B’, correlation means that A and B tend to be observed at the same time. These are very different things when it comes to big data and yet the differences so often get ignored.

Another, albeit more amusing, demonstration is Spurious Correlations, a website set up by statistical provocateur and Harvard student, Tyler Vigen, which proves that just because two trends seem to fluctuate in tandem, that does not mean that the correlation has any meaningful significance.

A classic example is that in summer, ice cream sales and murder rates rise. The two are correlated, but no one in their right mind would think that one causes the other.

The ‘what’ not ‘why’

Humans are hard-wired to look for causation. When they see lines rising in sync or results clustering together on a scatterplot, they have an innate desire to assign a reason.

They are conditioned to search out the roots of things so that outcomes can be explained. However, their own cognitive biases and blinkered experiences can often prevent them from seeing the bigger picture.

Misunderstanding causal links can also result in choosing ineffective actions, making poor business decisions and overlooking beneficial alternatives. People must wean off their causation addiction if they want to truly exploit all the great stuff that data can provide.

Traditionally, data science was all about creating a hypothesis and then analysing data to prove or disprove it. However, there is nothing traditional about what can be done with and learnt from day today. Therefore, with the right analytical tools in place, it is far more effective to collate the data first and ask why later.

Global retailing giant Waitrose dug into its growing customer behaviour data in order to improve conversions and the overall customer experience. Looking at its data over the last few years, there was a strong correlation between the number of people completing the purchasing cycle and the height of the screen they were purchasing from.

Similarly, by analysing product sales, Walmart determined that sales of Strawberry Pop-Tarts correlated closely with regional hurricane warnings. Of course, there could be underlying variables impacting these results, but at this stage understanding the reasons behind the correlations really isn’t that important.

What’s important is knowing that people behave in this way. This knowledge can help businesses make modifications and plan ahead for fluctuations in customer demand.

Big data helps businesses make better decisions – getting bogged down in an endless search for correlation or causation explanations will only prevent them from reaping all of data’s rewards.

Discovering the unexpected is one of the best things about big data. It is an incredible compass in a digital world that is so tricky to navigate. Advances in technology mean that businesses have the potential to discover more correlations, patterns and predictions than ever before.

Right now, correlation does supersede causation, and data science is proving that it can advance even without coherent models, unified theories, or any explanation at all.

The reality of what to do with the data can lead to misinterpretation as analysts are desperate to clutch at causation straws while they should be mindful of the value of correlation too.

The importance of capturing data is well established – businesses and institutions should be wary of creating causes where they don’t exist, or risk their data simply accumulating rather than illuminating them.

Sourced from Guy Mucklow, co-founder of PCA Predict and Triggar

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and... More by Ben Rossi

Causation and correlation: a big (data) headache

The ‘what’ not ‘why’

Ben Rossi

Related Topics

Related Stories

How do you build an adaptable data platform?

Charting the AI-fuelled evolution of embedded analytics

Data maturity and the squeezed middle – the challenge of going from good to great

How to stop data mesh turning into a data mess

Related Stories

How do you build an adaptable data platform?

Charting the AI-fuelled evolution of embedded analytics

Data maturity and the squeezed middle – the challenge of going from good to great

Looking at the Earth with fresh eyes