A recipe for the modern data scientist

The world is churning out 2.5 quintillion bytes of data every day, with 90% of it created in the past two years. Most of this staggering volume comes from apps, social media sites, YouTube and other video sites, online searches, transactional data and, increasingly, machine-to-machine transactions.

The challenge that organisations face with big data is not only the collation and storage of it, but the analysis and output of the data that needs to be managed. The McKinsey Global Institute predicts the industry will need at least 190,000 more workers with analytics expertise and 1.5 million more data-savvy managers by 2018 in the US alone. 

The market is currently struggling to keep up with the ever-increasing demand for so-called data scientists – engineers trained to extract knowledge from the large volumes of data being generated – to help manage the data deluge. This means there is an opportunity for those that have the desire and aptitude in data processing and analysis.

Here is a step-by-step overview of the background and mind-set necessary to embark on a successful career in data science, in addition to some useful tips to get started.

1. Gather the ingredients

A lot of people ask: how do I become a data scientist? As with any technical role, it isn’t necessarily easy or quick, but if you’re smart, committed and willing to invest in learning and experimentation, the path to data science can become very rewarding.

Data science is a journey, not a destination. In order to thrive and be successful in a data science role, the candidate should possess a few essential ingredients.

The field of data science is continuously changing as new tools and techniques are introduced. The pace is staggering, therefore data scientists must keep up with the recent developments in order to remain competitive and effective.

In the real world of data science, data is often messy, disorganised and of low quality. It’s important to remember that 80% of the time will be spent exploring the data, understanding its quality and how to transform it into a form that will show the signal you are looking for.

Data science is very exciting, and with so many tools, algorithms and techniques it’s easy to drown in the details. It’s critical to always remember that, ultimately, data science serves some business goal, and that should be front-and-centre for all data science activities

2. Mix them together

Your approach to developing a data science skillset will likely depend on your previous experience. For instance, Java developers, who are already familiar with software engineering principles and thus thrive on crafting software systems that perform complex tasks, should begin to understand the various algorithms in machine learning. Specifically, which algorithms exist, which problems they solve and how they are implemented.

Additionally, it is useful to have experience with a modelling tool like R or Matlab, whereas libraries like WEKA, Vowpal Wabbit and OpenNLP provide well-tested implementations of many common algorithms. Those not already familiar with Hadoop would find that learning map-reduce, Pig, Hive and Spark will certainly be valuable.

If your background is SQL, you are likely to have been working with data for many years already and understand how to use data to gain business insights. Using Hive, which allows access to large datasets on Hadoop with familiar SQL primitives, is likely to be an easy first step into the world of big data analytics.

Statistics or machine learning is another fascinating background to build a data science career on. The first step would be to get familiar with one or more modern programming language such as Python or Java.

Despite the fact that R, Matlab or SAS are excellent tools for statistical analysis and visualisation, they are typically used for data exploration and model development, and rarely used in isolation to build production-grade data products.

In most cases, it is recommended to mix-in various other programming languages like Java or Python and integrate with data platforms like Hadoop, when building end-to-end data products.

Last but not least, Hadoop developers should gain a deep understanding of machine learning and statistics, and how these algorithms can be implemented efficiently for large datasets.

>See also: An angry rebuttal: data science is NOT dead

3. Bake

In the coming years, data scientists will be faced with the need to use larger and more diverse datasets as adoption of big data stores accelerates in the enterprise. It is therefore mandatory for data scientists to acquire Hadoop skills in order to incorporate these datasets into their processing flow, and take advantage of parallel processing platforms for substantially large clusters.

It’s important to remember that, as with many careers, the road to data science is not a walk in the park. Candidates must be willing to learn new disciplines, programming languages, and most importantly gain real-world experience. This takes time, effort and a personal investment, but it is definitely rewarding.


Sourced from Ofer Mendelevitch, director of data science, Hortonworks

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

Related Topics

Big Data