The foundations of big data

In December 2004, Google Labs published papers detailing its cluster computing algorithm and its file system. Software engineer Doug Cutting used these papers to create an open source framework for data-intensive, scalable, distributed computing, named Hadoop. The framework was given an additional boost in June 2009, when a prominent user, Yahoo, made the source code to the version of Hadoop it runs in its data centres publicly available. The next version of Apache Hadoop, expected later in 2011, aims to improve the utilisation, scheduling and management of resources.

A Google-patented software framework for distributed processing of large data sets on compute clusters. Hadoop MapReduce is an open source volunteer project under the Apache Software Foundation that was inspired by Google’s original paper.

A set of non-relational database management systems, popular with many big data advocates. Cassandra is one example, and was developed by social networking giant Facebook to handle its Inbox search feature. It is designed to handle massive amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure.

Another Facebook contribution to the big data toolset, Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data extract, transform and load functions. This provides a structure for the data that makes it possible to query and analyse the volume of data stored in Hadoop files. It was originally designed to help Facebook deal with explosive growth in its multi-petabyte data warehouse.

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

Related Topics