The definitive guide to the modern database

 

Data streams serve the modern world as a basis. To estimate the scale of this phenomenon when data processing is being done everywhere, here are two different examples.

First, when money is withdrawn from one account to appear on another, and there are numerous ways to do it. Second, when a mobile phone breaks, yet its data is already stored in the cloud to be recovered on a fresh device.

While the database management system (DBMS) concept is widely known, its research area has not been fully developed since its establishment in the 1970s. This has led to numerous vertical DBMS solutions that are applicable in particular cases.

Human-generated vs. others

With help from the internet, digital data has shown exponential growth. Every day, new data is generated from previously available information. In general, it corresponds to mathematically proven models of knowledge increment; knowledge acquisition processes have a self-accelerated nature. According to Eric Schmidt’s estimations, prior to 2003, 5 exabytes of digital content existed, while in 2010 that amount of content was generated in just a matter of days.

At the same time, data processing has become necessary for commercialisation in many domains since the end of the 1980s, and, since then, DBMS’s have played a central role. The main interest from the angle of commercialised business-applications belongs to human-generated data, such as banking transactions.

>See also: Next big thing: Preparing for the Internet of Things in the enterprise

Structured vs. unstructured

The concept of ‘structuredness’ becomes important when talking about human-generated data. To differentiate structured data from unstructured, one can think of a name, like “John”. If it is treated just as a sequence of four letters or one word, then it is an unstructured point of view.

The main source of such data today is the human-written part of the internet. In contrast, if it is known that this sequence of letters is a value of a “Name” in a particular CRM, then the data is structured as its “meaning” in a certain context is defined. Another good example of a structured set is any business-critical transactional database, because the contents of a transaction are always defined.

The observation of real-world structured datasets gives a contradictory fact: their sizes are fairly moderate in practice, so even if the growth of any particular set is exponential, the exponent is orders of magnitude smaller than those exabyte rates for the unstructured internet. As an example, a recent estimation of per-year Amazon transactions is only 56 gigabytes in total.

Transactions vs. analytics

The evolving big data phenomenon has highlighted the importance of DBMS partitioning for serving transactions (OLTP — online transactional processing) in a classic sense, and for performing data analysis (OLAP — online analytical processing) in a sense of deriving measurements and new observations from existing data sets.

The two technologies serve two points of view for the same data, from which certain tasks can be solved efficiently. Various write-conflicts may be frequent in transactions (like when two agents withdraw from the same banking account), while in analytics the whole data can often be treated statically to run read-queries at peak performance.

In fact, in many business applications, heavy analytics are done overnight on a snapshot of transactional data, often on a separate server to deliver results the next day. Sometimes in search engines or high-frequency trading, analytics have to be performed in parallel with current transactions. In this respect, hybrid solutions that perform well both in OLTP and OLAP are now considered cutting edge.

In-memory vs. disk-based

The reason of unrelenting attention to OLTP DBMS is because OLTP acts as a grounding technology for many existing and evolving businesses, including digital advertising and electronic trading.

As stated by Dr. Michael Stonebraker, a pioneer in DBMS R&D, the best way to organise data management is to run a specialised OLTP engine on current data. Today’s partitioning of DBMS into disk-based and memory-based is in fact quite odd, in contrast to the OLTP/OLAP dilemma.

Look at the cost for 1 MB of RAM and you’ll see that in 2005 there was a substantial drop, so all business-critical transactional databases have become in-memory. In fact, moderate-priced modern server configurations now include 128 GB of RAM.

Disk-based databases are still of high importance in the OLAP world, but it’s not the case for the majority of transactional applications, which greatly benefit from in-memory solutions. Actually, memory-optimised databases still use disks to implement data persistency and reliability features, but, in contrast to disk-based DBMS, they do it in a more “gentle” and efficient way.

>See also: Remember the titans: who will win the in-memory battle?

Hardware vs. software, scale-in

Following Dr. Stonebreaker, the input is a high velocity stream in a wide variety of OLTP applications, so speed remains the top concern for transactional purposes.

One way of increasing speed can come from better hardware. But if you try to increase server performance by adding faster CPUs, faster disks and/or faster memory, you might find that the speed isn’t increasing substantially or at all.

This can be explained by physics: even inside a powerhouse server, the distance data must pass through on wires and chips is basically the same. Data in an imaginary silicon cable on the Atlantic Ocean’s floor would be faster, because of the straight orientation.

Considering this observation, one has to put all pieces of data, as well as the software which processes them, as physically close as possible to gain top performance, which is called ‘scaling-in’.

This is about removing proxy layers like interfaces of memory-based DBMS. Putting this to a limit, one can imagine a case when all objects inside the application become database objects by option, so that they can be retrieved through SQL, persisted on a disk and written without concurrency conflicts.

This is the case when database engine and application are scaled-in together on the operating system level, so both of them use the same data without transmitting it through proxies. Also, a benefit of such approach is the increasing ease of use — monotonic code to deal with wrappers is no longer needed, and for some users this can become a more decisive factor than performance.

To improve hardware for performance purposes, adding faster standard components is insufficient. One might recall a recent fight against physics laws in high-frequency trading with microwave transmitters in the US, which has led to a few more nanoseconds per transaction.

The same problem of overcoming the transmission limits needs a solution in current microchip manufacturing. Introducing new technology as a replacement for silicon sounds too futuristic. The more realistic answer lies in the emerging Systems-on-Chip field, which is already manufacturing with modern media codecs implemented in silicon.

The same trend might captivate the transactional world, because it is directly about scaling-in at the level of hardware and subject domain.

Scale-in vs. scale-out, consistent vs. inconsistent

Another option to increase speed is to partition data to several computers, expecting the performance to grow linearly with the number of servers. This is called ‘scaling-out’.

The truth is a scaled-out system loses transactional consistency. Consistency means that the system resides in a non-conflicted state all the time. For example, when money is transferred from one account to another, this operation should either end with the same amount residing on the other account or discarded fully.

The fact that consistent and linearly scaled systems are physically impossible has been recently proved mathematically and is known as a CAP-theorem. In this respect, scaling-in to a single machine instead of scaling-out seems the most reasonable alternative for consistency-related transactional applications.

>See also: CIO-developer conflict, performance issues and talent shortage: Getting to the bottom of NoSQL with MongoDB

SQL vs. NoSQL, ‘classical’ vs. ‘modern’

The third option to gain performance is to abandon some of the DBMS features. Between 2007 and 2012 there was a shift from ‘classical’ DBMS systems with rich SQL-based syntax to NoSQL DBMS — lightweight, non-consistent and fast in-memory storages, providing sometimes substantially restricted querying.

It was mainly about offering a completely different alternative to widely used, but expensive, corporate DBMS. Revising the current state with NoSQL, DBMS users came to the conclusion that SQL is a key component to reduce costs of DBMS adoption and ownership even outside of large enterprises.

Now they request even richer syntax, supporting graph operations and basic OLAP right out of the box. This resulted in the emergence of NewSQL systems, which offer performance, elements of consistency, and some SQL features. The hybridisation of SQL and NoSQL solutions is an increasing trend, leading to even more competition and innovation in the market.

 

Sourced from database company Starcounter

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

Related Topics

Data