Succeeding in data science projects — preparation, process and open source

For businesses, investment in machine learning, artificial intelligence (AI) and data science is growing. There is huge potential around data science to create new insights and services for internal and external customers. However, this investment can be wasted if data science projects don’t fulfil their promises. How can we make sure that these projects succeed?

Where we are today

According to McKinsey, around half of all the companies they served have adopted AI in at least one function, and there is already a small cohort of companies that can ascribe at least 20% of their earnings before interest and taxes to AI. Around $341.8 billion will be spent on AI solutions during 2021, a rise of 15.2 percent year over year, according to IDC.

IDC also found around 28% of AI and ML initiatives have failed so far. Based on the figure above, that would equate to $88.1 billion of spend on tooling associated with failed projects. The analyst firm identified reasons for this including the lack of staff with necessary expertise, and a lack of production-ready data as reasons for this. Alongside this, feeling unconnected and lacking an integrated development environment was another reason for projects not being successful.

To improve your chances of success around your projects, it is worth spending time to look at how data science works in practice, and how your organisation operates. While it includes the word ‘science’ in its title, in fact data science requires a blend of both art and science in order to produce the best results. Using this, it’s then possible to examine scaling up the results. This will help you successfully turn data science results into production operations for the business.

At the most simple level, data science involves coming up with ideas and then using data to test those theories. Using a mix of different algorithms, designs and approaches, data scientists can seek out new insights from the data that companies create. Based on trial, error and improvement, the teams involved can create a range of new insights and discoveries, which can then be used to inform decisions or create new products. This can then be used to develop machine learning (ML) algorithms and AI deployments.

Hot topics and emerging trends in data science

We gauged the perspectives of experts in data science, asking them about the biggest emerging trends in data science. Read here

Improvement #1 – know the expectations around business goals

The biggest risk around these projects is the gap between business expectations and reality. AI has received a huge amount of hype and attention over the past few years. This means that many projects have unrealistic expectations.

Unrealistic expectations can be in scope, speed, and/or technologies. Great project managers understand how to navigate challenges in scope and speed; it is the misinterpretation of the promises of AI technologies which have been causing the biggest problems for new projects. Rather than being focused on improving a process or delivering one insight, AI gets envisioned as changing how a company runs from top to bottom, or that a single project will deliver a change in profitability within months.

To prevent this problem, it’s important to set out how your projects will support overall business goals. You can then start small with projects that are easy to understand and that can show improvements. Once you have set out some ground rules around what AI can deliver – and punctured the hype balloon around AI to make this all ‘business as usual’ – you can keep the focus on the results that you deliver.

Improvement #2 – make your team part of the overall process

Another big problem is that teams don’t have the necessary skills to translate their vision into effective processes. While the ideas might be sound, a lack of understanding around the nuances of applying machine learning and statistics in practice can lead to poor outcomes. This issue is also due to the hype around AI and ML – the demand for data science skills means that there is a lot of competition for those with experience, while even those starting out can command big salaries. This lack of real world experience is what can lead to problems over time.

Even with a realistic vision and experienced staff in place, AI projects can still fail to deliver results. In this case, the reason is normally that poor processes, inconsistent communication, and gaps between teams exist.

To prevent these kinds of problems, it’s important to establish a smoothly operating engineering culture that weaves data science work into the overall production pipeline. Rather than data science being a distinct team, work on how to integrate your data scientists into the production deployment process. This will help minimise the gap from data research and development to production.

Improvement #3 – Use techniques in data driven hypothesis testing

While it is important to support creativity around data science, any work should have the business’ goals in mind. This should put the emphasis on what result you are looking to achieve or discover by using data to prove (or disprove) a hypothesis based on how well that business goal was met.

The team at Netflix has written about this, and how their approach to shared hypothesis testing helps keep the team focused. By concentrating on specific objectives, you can avoid getting lost or spending time on projects that won’t pay off.

Alongside this, it’s important to evaluate new technologies for any improvements in how they might help meet goals. Keeping at the cutting edge is important for data scientists, but it is essential to focus on how any new technology can help meet that specific and measurable business outcome.

Based on these ideas, you can help your data science team take their creativity and apply it to discover interesting results. Once this research starts to find insights, you can then look at how to push this into production. This involves creating bridges from the data science development and research team to those responsible for running production systems, so that new models can be passed across.

Leveraging data: what retailers can learn from Netflix

Jai Gandhi, vice-president of data and analytics at Ciklum, discusses what retailers can learn from Netflix when leveraging data to drive innovation and sales. Read here

Improvement #4 – Use the same open source tools from test to production

One critical element here is that you should encourage everyone to use the same tools on each side. One of the biggest hurdles can be when the data science team delivers a new model and workflow around data, and then those responsible for running the model in production have to re-develop that model to work with the existing infrastructure that is in place. The emphasis here is to avoid the old trope of “This worked on my laptop!” as laptops can’t be pushed to production and rework is expensive.

Using open source can help to achieve this consistency. From databases like Apache Cassandra, through to event streaming with Apache Pulsar, data enrichment with Apache Flink and analytics with Apache Spark, common tools used for working with data are mainly open source and easy to link together. Alongside this open data infrastructure, TensorFlow is important for how algorithms and machine learning models can be created and tested. You can use something like Apache Airflow to manage the workflow process that your team has in place. This makes it easier to build a stack that is common to everyone.

Alongside getting consistency on tools and infrastructure, both sides need to agree on common definitions and context. This involves setting the right goals and metrics so that everyone is aware of how the team will be evaluated over time. At the same time, it should also be an opportunity to keep re-assessing those metrics, so that the emphasis is always on delivering the right business outcomes. Anthropologist Marilyn Strathern described this as, “When a measure becomes a target, it ceases to be a good measure”. This sees teams concentrating too specifically on metrics and measurement to the detriment of the overall goal.

Lastly, the role of testing should not be overlooked. Once new models are developed that should have the desired impact, those models should be tested to ensure that they work as expected and are not falling foul of issues within the test data or any biases that were not accounted for. Testing – using the same tools and processes as will be used in production – not only helps solidify the value that data science creates, but makes it easier to scale that work out. Skipping this step or not giving it the right degree of rigour leads to problems over time.

Building the future for data science

Data science has huge potential to help businesses improve their operations. It can be used to develop new products, show where to invest, and help people make better decisions in their roles.

To avoid the risk of failure, look at how you can build on an open source data stack to make the process around moving from initial discovery through to full production easier. This consistency should make it easier for your data scientists to work around data, and for your operational staff to implement those insights into production.

Written by Denise Gosnell, chief data officer at DataStax

Editor's Choice

Editor's Choice consists of the best articles written by third parties and selected by our editors. You can contact us at timothy.adler at stubbenedge.com