Machine learning is rightly heralded as the next step in the big data analytics and business intelligence (BI) revolution. But, as many organisations are finding with early pilot projects, extracting value from machine learning is not just a case of plugging in a new tool and watching productivity, marketing or sales improvements rocket.
Successful machine learning projects – of all kinds – rely on a number of ingredients, including the right choice of the topic, readiness of the organisation to experiment, and – of course – the available data.
Many companies would acknowledge that the data on their customers’ transactions, sales or logs of machinery use is one of the most precious assets that they own. Especially as machine learning now offers businesses the opportunity to go far beyond traditional BI, providing accurate predictions of future sales or potential equipment faults, and prescribing automated actions to improve the bottom line and save on ad hoc repairs.
But what exact data do you need for machine learning projects to run? How much data is enough? What if you don’t have the right data?
Data big and small
We are used to assuming that more is better when talking about ‘big data’. While often being true – and essential for applications such as real-time online personalisation – the necessity for size varies from task to task and from sample to sample.
Although 10 gigabytes of logs may seem too little for machine learning to bring value, it can in fact be just enough. Reasonable and usable samples can start from just tens of thousands of entries. This number maybe be minuscule for the Googles of this world, but actually sufficient to make a massive tangible difference for a traditional offline company.
Take the application of machine learning in a HR department of a 75,000-strong organisation. If the company is attempting to predict turnover risk to inform future HR strategies and take preventive actions, it would start by examining employee records.
These change enormously every single day, reflecting working hours, role changes, training courses passed, sick days taken, and so on.
While the volume of this data may be regarded as too small, the variety of contributing factors means there is sufficient depth to it to move beyond simple statistics towards machine learning.
On the other extreme, some companies may think that they own tons of precious data – such as years of sales reports – only to discover that they are only available as aggregates, with no raw input stored. Machine learning needs the details to learn from – and quarterly or yearly aggregates are simply not enough for that task.
The historical period for which the data is available is an important point – it should be long enough to reflect all the relevant events and cyclical changes.
For example, if an organisation wants to build a working model to predict product demand for a retail company, it will need at least two or three years of historical data that accommodates the seasonal trends. But if it wants to foresee potential failures of expensive manufacturing equipment that breaks on average once in several years, it would need to have a far longer history to detect anomalies and anticipate the fault before it happens – perhaps twice or three times the period it is predicting for.
At the same time, when you head into sectors with huge customer bases and subscription business models – mobile phone networks, streaming services or online gaming – it’s entirely feasible to be able to begin a meaningful machine learning project with as little as six months’ worth of data, such as in predicting customer churn.
Often, companies struggle with the imperfections in how its data is organised and stored: unstructured, riddled with discrepancies and errors, or divided between a dozen of systems that are not yet integrated.
Still, this is not a reason to decline running a machine learning pilot. It could actually be the reason for having one. Getting rid of data silos, improving data quality and introducing a single repository is a valid goal, but this process is lengthy, very costly, and doesn’t bring immediate value. Machine learning does, while also helping to justify the further investments in infrastructure and guide data collection strategy.
What companies can and should be doing now is defining the scope and KPIs for proof of concept (PoC) machine learning pilots, and running them before engaging in large-scale infrastructure projects.
Even without an established data warehouse, a dataset can be created and the data scientist can cleanse and analyse it to kick off a number of machine learning projects – and prove the value that can be gained.
Where else can the data come from?
One thing that many organisations overlook is the option to buy in external data.
On the one hand, the strongest and most important signals are usually hidden in the data that the company owns. As such, the knowledge on the customer transactions in a bank is a better predictor of whether a customer is going to repay a loan than, for instance, their social media behaviour.
On the other hand, many companies underestimate the value of external factors, such as weather data. It influences many obvious cases such as the demand for ice cream, as well as less obvious scenarios such as the best times to offer tailored recommendations and offers to online gamers who may be more likely to stay at home and play when the weather is poor.
Machine learning is quickly transitioning from being a technical topic for the few to a management tool for many. To avoid missing the boat, organisations need to start designing their project scopes now to make sure they are ready for machine learning strategy in the future.
This will ensure that they understand what data is available, missing and needed, and can start collecting it sooner rather than later, giving them a greater chance of quicker ROI.
Sourced from Jane Zavalishina, CEO at Yandex Data Factory