Putting the R in analytics
Open source statistical programming language R, already popular among academics, is making in-roads to the enterprise
Businesses crave meaning from their data more than ever before. Business intelligence and analytics ranked as the number one technology priority among CIOs in analyst company Gartner's latest poll, beating both mobile technology and cloud computing.
But for some organisations, the analytical capabilities of traditional BI tools - or even of that ubiquitous data analysis stalwart, Microsoft Excel - are proving insufficient to provide the kind of answers that confer competitive advantage.
The class of analytical technologies that offer businesses more than just the ability to spot trends in historical data is often described as advanced analytics.
Writing in December 2011, then-Forrester Research analyst James Kobielus (who has since joined IBM) asserted that growing interest in advanced analytics stemmed from “many users’ desire to take their BI investments to the next level of sophistication, leveraging multivariate statistical analysis, data mining, and predictive modeling”.
As these terms suggest, these analyses are not the stuff of management reports compiled by everyday employees. The ‘advanced analytics’ trend can be seen as the growing use of sophisticated statistical analysis in business.
Of course, many businesses – especially in the finance and pharmaceuticals industries –have employed statisticians for years. There is an established market for software tools for these statisticians to use, dominated by the SAS Institute and SPSS (which has also joined IBM).
But, as with most software markets these days, there is an established open source alternative available, in this case named R. And what R lacks in commercial backing, say its proponents, it makes up for in community contributions, and therefore the pace of innovation.
Already popular in universities, there are signs that R is finding increasing adoption in the enterprise. This promises to lower the barriers of entry for advanced analytics, and may accelerate the mathemitisation of business management.
The story of R
The predecessor to R is, counter-alphabetically, S, a statistical programming language created at US innovation powerhouse Bell Labs in 1976.
According to John Chambers, one of the programmers who worked on the project, the characteristics of S reflect the circumstances in which it was developed.
“To understand the nature of S, it's important to note the context and motivation,” Chambers writes in Bell Lab's official history of the S project. “We were looking for a system to support the research and the substantial data analysis projects in the statistics research group at Bell Labs. However, little or none of our analysis was standard, so flexibility and the ability to program were essential from the start.”
Unlike the statistical analysis applications that that still dominate today, S is a programming language. The key concept is to allow statisticians to “progam with data”, in other words to write repeatable routines that perform statistical analyses of the data they are exposed to.
In 1988, statistics professor R. Douglas Martin began selling a commercial version of S named S-PLUS. Twenty years later, the company that sprang from Martin's S- PLUS business, Insightful Software, was acquired by middleware vendor TIBCO.
“S-PLUS had pretty much levelled off as a business by then,” says David Smith, VP for marketing and community at Revolution Analytics, who worked for Insightful Software until a few months before the acquisition. “At the same time, advanced analytics were becoming increasingly important in industry. TIBCO saw the opportunity to acquire a company with good technology at a good price.”
The reason S-PLUS had levelled off, Smith says, is the emergence of R. Devised in 1991 by two statistics professors at the University of Auckland, R is an open source project based on the S language, that is available for free under the GNU General Public License.
There is a core team of 20 developers that work on R, which includes S inventor Chambers, but the open source nature of the language means there is a community of thousands of developers and academics building extensions and plug-ins.
Universities are typically the biggest users of statistical software, and according to Smith. R really began to take off in academic circles in around 2004.
Clearly, the fact that it is free was part of the appeal, he says, but it was not everything. “It wasn't like now when no universities have any money; back then institutions could still afford to buy software,” he says. “It was more the open aspect that was really important to R in academia. People could easily contribute to the project, adding new statistical methods for other people to use.”
This popularity in academia means that R is being taught to statistics students, says Matthew Aldridge, co-founder of UK- based data analysis consultancy Mango Solutions. “We're seeing a lot of academic departments using R, versus SPSS which was what they always used to teach at university,” he says. “That means a lot of students are coming out with R skills.”
NEXT>>>Into the enterprise
Page 2 of 3
Finance and accounting advisory Deloitte, which uses R for various statistical analyses and to visualise data for presentations, has found this to be the case. “Many of the analytical hires coming out of school now have more experience with R than with SAS and SPSS, which was not the case years ago,” says Michael Petrillo, a senior project lead at Deloitte's New York branch.
This is one reason why R adoption has started to pick up in the enterprise, says Mango's Aldridge. “If you've hired someone out of university and they say they want to use this tool, you are not going to stop them –especially if its free.”
Plus, like universities before them, enterprise adopters benefit from the ecosystem of plug-ins and extensions, each representing a particular analysis or data model that might help answer their questions.
Pugh and Aldridge report that most interest in R comes from pharmaceutical and financial services sectors.
“In pharmaceuticals, we see R being used in drug discovery and analysis,” says Aldrige. “In finance, it's often used in algorithmic trading: testing new trading strategies and so on.”
One large insurance company uses R as a reporting environment, he says. “They like the flexibility, and the whole idea of not being locked into a commercial vendor.”
But there is growing interest outside these core sectors, he adds, and Mango has worked on deployments for retailers and energy companies.
One disadvantage for R in the enterprise is that there is no commercial organisation to take responsibilty for the software. “Because R doesn’t have a company behind it that you can sue, our customers are often interested in some sort of insurance policy,” explains Aldridge.
This is especially true for pharmaceutical companies, whose drug development processes are highly regulated. “Proving that R gives the right answers is a crucial part of the work that we do for pharmaceutical companies,” he says.
Mango Solutions therefore has developed a process to demonstrate the quality of the software. “Rather than download R from the website, we'll give companies a version where we've run lots of tests to make sure that when you add one and one, you really get two, plus a whole documentation trail that shows we've gone about it in the right way,” says Pugh.
Aldridge adds, however, that the R community serves as a rapid testing environment for every new version to the code. “As soon as a new version has been released, you've got thousands and thousands of statisticians going over the code, whereas I’m pretty sure there are elements of commercial products that haven't been tested for some time.”
TIBCO, meanwhile, says that its accountability for the S-PLUS (now known as S+) code base is one of the reasons why an organisation use its commercial alternative to R. “S+ is a commercial product with well orchestrated, consistent versioning produced by full-time engineering and quality control personnel,” explains Brad Hopper, senior director for industy applications and innovation.
“In some industries, non-validated tools are simply not allowed for certain kinds of analysis,” he adds. “The risk of incorrect results and no single responsible organization for recourse can be a detractor for any open source product. We don’t suggest these risks are high, just that different organisations weigh them differently.”
NEXT>>>Addressing the shortcomings
Page 3 of 3
Another possible shortcoming of R is the fact that it analyses data in memory. The quantity of data that it can analyse is therefore constrained by the memory capacity of the machine that is running it.
“If you’ve got a very large amount of data you want to analyse, you can't just plug it into R directly,” says Mango’s Pugh.
R’s open source nature means that the community has built connectors for most database platforms, meaning it can be paired with other technologies to overcome this weakness. “The best way to use R is to play to its strengths, and to work around its weaknesses by integrating it with other technologies,” he says.
“Fortunately, its open nature makes that really easy.”
He warns, however, that not all connectors are made equal. “The quality of the connectors, depending on why it was developed in the first place,” he says. “There are over 3,600 add ons to R, and it can be quite difficult to tell the good ones from the bad ones.”
US company Revolution Analytics has taken a different approach to overcoming R’s weaknesses. The company has taken the core R engine and made certain modifications that it sells commercially.
“We have added to the core R algorithms so that firstly they can run in parallel, which means they can use multiple processors and therefore run faster,” explains David Smith. “Secondly, they are distributed and run out of memory, which means they are not limited by the memory size, and they can be spread across a big cluster or cloud.”
The company has also developed a server version of the software, which can expose the analytics functionality of the software a web service API. “This means it can integrate with a data layer, and do advanced statistical processing based on routines and analyses that have been uploaded by the statisticians.”
Smith argues that these enhancements are necessary if R is to be applied to ‘big data’, i.e. data whose volume, velocity and variability outstrip the capabilities of conventional relational databases.
Deloitte is currently preparing a big data pilot using Revolution Analytics’ enhanced R product. “We are using the server-based version of Revolution R to investigate big data analysis capabilities,” says Petrillo.
“We are looking at integration options to [big data programming platform] Hadoop, as well as ability to integrate R code into other applications via a web services framework.”
The most data-intensive application of the technology so far, Smith says, has been hedge funds using it to back test investment portfolio strategies, simulating the effect of a strategy on years’ worth of financial trading data.
Clearly, this is the realm of specialist statistical expertise. And while not all R deployments will demand these big data enhancements, the software is certainly for use by trained statisticians.
Still, with business demand for advanced analytics growing, and cash-strapped universities turning to free and open software, any organisation that employs statisticians can expect to see a copy of R in their IT environment sooner or later.