As the big data discussion continues in earnest, it's easy to forget the swathes of data already being generated through more traditional channels.
What's more, the jury is still out as to whether businesses are correctly interpreting and employing the existing data they have. Increasingly, standard frequentist interpretations of statistics are falling short and providing distorted results that may actually be damaging. The answer then? A lesser known branch of statistics based upon Bayesian inference.
>See also: Big data and mapping – a potent combination
For those who may not immediately recognise the distinction between frequentist and Bayesian statistics: the frequentist models that were considered "standard” during the 20th century base their statistical analysis on probability.
Dealing solely in hard numbers and classical objectivism, frequentists measure the probability of the pure maths involved in any calculation. Bayesian statistics, on the other hand, propound an entirely different kind of scientific reasoning. Not only is the initial problem considered, but so too any variables that may occur over time – augmenting and refining results with increasing accuracy.
For the non-mathematically-minded, the differences may verge upon the mundane, if not the absurd – the general belief being that all statistical information is simply "fact”.
However, since the turn of the century, Bayesian statistics have become integral to fields as disparate as biology and battery development, lab research and logistics, with an onus on interpretation pushing many industries forward in leaps and bounds.
These new results began to highlight glaring errors in many data sets previously considered reliable, and statisticians are only now beginning to cross check the two sets of results.
This has led to many questioning our concept of the "facts” with an eye towards combining both methods with a greater movement towards knowledge, evidence and prediction rather than cold statistical analysis.
Real world applications
A case in point comes from clothing retailer Zalando. Faced with a complex, labour-intensive, time-consuming and expensive distribution problem, Zalando turned to Bayesian statistics to simplify its operations.
Each item shipped from the Zalando warehouse needed to be weighed manually to identify postage costs. Naturally, when dealing with such a large logistical operation, the manual approach was eating into valuable resources – both labour and money was simply being wasted.
Zalando’s idea was simple: its data scientists were to make use of a rich vein of existing data automatically generated when shipping parcels through their chosen logistics partners.
However, its initial formulas were producing wildly varying results – essentially suggesting that some parcels had weighed less than zero grams. It seemed that Zalando’s simple idea was in need of something a little more complex.
Thankfully, along came the wonder of the Bayesian model. In simple terms, Zalando’s data scientists began to incorporate data from varying different sources, including such exotic information regarding the many, varying packaging materials. This, combined with the implementation of a confidence interval (where the data is compared to a "common sense” scale), began to generate surprisingly accurate results.
With this increase in accuracy, Zalando was able to automate the entire weighing process, increasing efficiency and saving millions of euros at the same time.
Any business thinking of applying Bayesian statistics to similar problems can find a more detailed account of Zalando’s formula on its company blog.
For Zalando, the process might not have been as a simple as originally conceived, but the payoff has most definitely been worth the headache.