Applying data science

Forward Internet Group is a London-based digital agency with roots in online search and advertising. In 2009 it acquired, a price comparison site that allows users to switch utility, broadband and insurance suppliers.

Analysing large quantities of clickstream data has always been a part of Forward’s work. “Internet search advertising generates a lot of data,” explains lead developer Alex Farquhar. “We’re getting gigabytes of data from Google every single day.”

A few years ago, the volume of data it needed to analyse started to exceed the capacity of its existing database infrastructure. “Every time we wanted to ask a fundamental question the query would take two hours to run,” explains Farquhar.

Some of Forward’s senior developers began investigating alternative technologies. One of these was Hadoop, an open source framework for building distributed analytics systems based on non-relational databases, often referred to as ‘big data analytics’. They tried it out on a couple of desktop PCs, before going on to invest in a dedicated Hadoop cluster.

Getting to grips with the technology, firmly at the cutting edge back in 2009, was not easy. “At the time, Hadoop was still very new, so it was a little rough around the edges,” says Farquhar. “Fortunately, it was around then that [Silicon Valley start-up] CloudEra launched its distribution of Hadoop. Using that served us very well.

“There was also the mental shift of going from a relational database model to the Hadoop mindset, which is completely different,” he says.

“Once we’d got over that, though, it was obviously a lot better,” Farquhar says. “Jobs that used to take two to three hours would run in five minutes, and that was before any specialist tuning. It was obvious that this was going to be the future.”

While Forward was getting to grips with the Hadoop technology, it was also thinking about how it could make greater use of the data it collects in order to improve its products and services.

The company was developing skills in statistics programming framework R, and Farquhar (who has a PhD in genetics and statistics) and a colleague spent a month investigating how statistical analysis might be used to improve “We started looking at some real ‘blue-sky’ ideas, such as how the weather affects the likelihood of someone switching electricity suppliers,” he recalls.

Everyday component

Since then, statistical analysis has become an everyday component of Forward’s development culture. Analyses typically take one of two forms: monitoring day-to-day activity on a website, or modelling user behaviour in order to optimise site design and service offerings.

It is not always possible to use the results of this statistical modelling, says Farquhar, as’s ethos is to offer the fairest price to any visitor. “We couldn’t start profiling our users on the basis of their demographic data – that would be against everything we stand for.

“However, it’s still really useful to ask questions about who our users are and what they want,” he adds. “It helps us to offer better products.”

Forward’s growing reliance on statistical analysis means that the company requires what Farquhar describes as ‘data scientists’. “We need people who are not just good at statistics, but who also have the skills to bring together different sorts of data and clean it up so that they can do meaningful analysis.”

Why not just hire developers and statisticians separately? “We want people to be multi-skilled,” says Farquhar. “It’s a lot faster and more efficient if people can do many jobs. If we had separate teams, it wouldn’t scale because there would be too many lines of communication.”

However, these people are hard to find, Farquhar says, and the availability of suitably skilled employees may prove to be a bottleneck in the company’s use of data science.


Alan Dobie

Alan Dobie is assistant editor at Vitesse Media Plc. He has over 17 years of experience in the publishing industry and has held a number of senior writing, editing and sub-editing roles. Prior to his current...

Related Topics

Data Science