The Internet of Things will turn Hadoop architectures on their head

The Internet of Things is the concept of an ubiquitous network of devices to facilitate communication between the devices themselves as well as between the devices and humans.

Use cases can be grouped along the scope of an application: ‘personal IoT’ (with the focus on a single person), ‘group IoT’ (setting the scope on a small group of people), ‘community IoT’ (usually in the context of public infrastructure, such as smart cities), and the ‘industrial IoT’ (dealing with apps either within an organisation or between organisations).

Processing data that IoT devices generate lends itself to the big data approach, which means using scale-out techniques on commodity hardware – in a schema-on-read fashion along with open interfaces, such as the Apache Spark API.

In order to develop a full-blown IoT application, businesses need to be able to store all the incoming sensor data to build up historical references. Then, there are dozens of data formats in use in the IoT world, and none of the sensor data is relational per se. Many devices generate data at a high rate, which businesses cope with in an IoT context.

There are a number of common requirements for an IoT data processing platform.

Firstly, the platform should be able to natively deal with IoT data, both in terms of data ingestion and processing.

Second is support for a variety of workload types. IoT applications usually require that the platform supports stream processing from the get-go, as well as is capable of dealing with low-latency queries against semi-structured data items, at scale.

Third is business continuity. Commercial IoT applications usually come with SLAs in terms of uptime, latency and disaster recovery metrics. The platform should hence be able to guarantee those SLAs innately. This is especially critical in the context of IoT applications in domains such as health care, where people’s lives are at stake.

Finally, the platform must ensure a secure operation, which currently is considered to be challenging in an end-to-end manner. And the privacy of users must be warranted by the platform – from data provenance support over data encryption to masking of the data.

Having discussed general requirements for an IoT data processing platform in terms of workloads and operational aspects, it is also important to focus on the insight that every entity in the IoT will be uniquely identifiable and addressable.

This observation is likely more obvious for the participating devices than for, say, humans possessing or operating said devices.

Further, an essential functionality of an IoT data processing platform is authentication – that is, the process of confirming the identity claimed by a participating entity.

A concrete example of a large-scale, real-world deployment of an IoT system that provides authentication of human users based on biometric information is the Aadhaar project.

The Unique Identification Authority of India (UIDAI) is assigned with the task to provide a unique identifier for every of India’s 1.2 billion residents, through the Aadhaar project.

The underlying idea is to enable residents who did not have any sort of formal identification means to participate in the daily commercial business, such as opening a bank account.

Additionally, the system reduces the embezzlement of government subsidies by as much as $1.3 billion, mainly caused by fraudulent claims.

In a nutshell, Aadhaar is a biometric database, covering iris scans, digital fingerprints, a digital photo, and demographic data per resident, on an opt-in basis. Introduced in 2010, close to 500 million residents are registered with it.

Several measures to ensure the security of resident data have been taken – from the time it is captured all the way to how it is stored.

Usage of 2048-bit PKI encryption and tamper detection using HMAC ensures no one can decrypt and misuse the data, and resident data and raw biometrics are always kept encrypted.

The Aadhaar system performs routinely as many as 4.73 million authentications per minute. These authentication requests come with a latency SLA of 200 milliseconds or less over the potentially more than 1 billion residents.

It’s fair to assume that comparable requirements in terms of scale and latency will be found increasingly in IoT applications, especially ones with a large number of participating entities (humans and devices alike).

To conclude, a data-processing platform must meet a basic set of high-level requirements in order to be fit for the torrent of data from IoT devices. The Aadhaar project demonstrates it is feasible to successfully roll out IoT applications at scale.

Sourced from Michael Hausenblas, chief data engineer, MapR Technologies

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and... More by Ben Rossi

The Internet of Things will turn Hadoop architectures on their head

Ben Rossi

Related Topics

Related Stories

Is subscription-based networking the future?

5G Advanced: progressing towards 6G

Ushering in an era of pervasive intelligence, powered by 6G

Government trials Elon Musk’s Starlink for high-speed broadband

Related Stories

Is subscription-based networking the future?

5G Advanced: progressing towards 6G

Ushering in an era of pervasive intelligence, powered by 6G

Tata Communications launches private 5G centre of excellence