Mining the metadata and more: Tips for good AI data storage practices

It is now common knowledge that the accuracy and efficacy of applying AI is dependent upon the data used to develop, train, and test the models—and while more data can sometimes translate into better AI outcomes, that isn’t always the case. Data sets for AI require proper labelling to guide the training process, as well as sufficient variety to eliminate naive conclusions and bias. It should come as no surprise that having the right data is one of the most crucial components of AI success. What is surprising, however, is that the underlying storage that holds and manages all that precious data does not get the same scrutiny.

Data scientists, who tend to start and grow AI projects, rarely have deep experience with storage. In the majority of cases, AI starts small using local storage on a system with one to four GPUs. The project will typically then expand by tackling more data sets, creating related projects with the same data, or collaborating among team members. Inevitably, the project will require the team to purchase additional systems and centralized storage will become a requirement. Unfortunately, a standard AI team defaults to whatever is perceived to be a “fast array,” thinking that fast storage is the best answer. However, similar to the concept that more data is enough to get better AI, fast storage is not enough for fast results.

In today’s data-intensive world, having the right storage strategy to get the most out of your AI is table stakes. If you’re not evaluating your storage plans appropriately, are you really able to take advantage of all AI has to offer? The answer, quite frankly, is no. If you want to increase the productivity of your AI team and meet the performance requirements of AI workloads, then you need to take a closer look how your business is approaching data storage. We suggest five strategies to consider when choosing storage for AI.

Embrace the AI Data Pipeline

Developing AI is not a single workload, or application. It requires collecting data from multiple sources, organizing the data to make it useful, analysing the data with a variety of frameworks and then delivering the model to be used across the organisation. The majority of the time spent on developing AI is spent collecting and managing the data, instead of fast training with GPUs. Notice that the pipeline has multiple steps with different data tools being used at different times.

Storage for AI should be ready to support the entire data pipeline. It may need file support for the AI frameworks and Hadoop Distributed File System (HDFS) protocol support for Spark tools used to collect and ingest the data. A single data repository eliminates overhead of data copied from one system to another. It encourages good data practices of labeling and organisation, which also simplifies team collaboration. It also reduces the total cost of data.

Use your Metadata

Keeping track of data usage across multiple projects and teams who use different applications or frameworks is a challenge. Modern storage systems can simplify tracking and reporting on data use by leveraging metadata. Metadata is the data about the data being used which can track attributes such as when a piece of data was last modified and by whom.

Extensible metadata that adds custom data labeling is common in object storage and available in some distributed filesystems. The metadata can be used to track data origin, add labels and even tag data used for different AI models. There is emerging technology in this area using data governance tools or metadata management solutions to automate data tagging and indexing that use APIs and that span different types of storage.

Balance sheets and staff remuneration — the value of data is rocketing

The best organisations, or so Greg Hanson from Informatica recently told Information Age, remunerate people based upon their ability to demonstrate good culture and good activity around managing data. Is it time then to give more thought to the value of data, how it is managed, and how it sits on balance sheets? Read here

Consider Data Tiering

Storage for AI seems to always be growing. Once collected and organised, it is easier to keep the data than to recreate it. New projects can be built upon existing data sets. Training and validating new models on old data is typical. However, keeping a large and growing data set on fast primary storage can bust budgets. Automated data tiering built from usage patterns is widely available. Choose tiering, rather than archiving data, because it keeps that data available for the data scientists. Typical architectures tier flash to hard drives or tier fast file to scalable object storage. Data remains accessible at lower costs.

Containers and Cloud Mobility

AI projects are scripted and incorporate different libraries and frameworks. Good AI development practices use containers as the development and deployment standard. Containers not only provide some version control, but they can also be deployed as sets of services that work together. Containers also provide relatively easy packaging to move AI applications, ingest or training to the public cloud or edge networks.

Using containers is convenient for the developers, but can be a challenge for storage and backup. Storage for AI should have support for the evolving Container Storage Interface (CSI) standard being developed in the open source community and supported by Red Hat OpenShift and others. The standard enables self-deployment, snapshot management and backup that integrates with Kubernetes.

You will also need a software-defined option that can be deployed in the cloud with the same management, data security and metadata tracking that is available on-premises. Flexing development or delivering edge services in the cloud should operate as an extension of your primary data storage, not a separate silo. Some solutions provide background, automatic data placement into the cloud and automatically writing back to the main storage. This provides the agility the developers will want with the data protection the business requires.

Why the term ‘cloud’ could be obsolete by 2025

New research from Citrix suggests that the term ‘cloud’ may soon be relegated to the buzzword graveyard. Read here

Use Flash and Fast Networks

While storage performance was not enough on its own, it is still vital. Expensive systems and teams need readily accessible data on-demand. A multi-GPU server can crunch through a GB per second of data. For large scale AI projects, such as being used in research and autonomous driving, the throughput must also scale with the number of servers being used. Parallel filesystems are common in AI clusters because data can be distributed across servers, storage and network. The storage and network needs to be sized appropriately for these workloads.

It isn’t all bandwidth; the AI pipeline requires performance on mixed workloads. High throughput on large files may be required to ingest the data into the AI storage while low latency and scalable meta data performance speeds up data labeling and organization. Using AI models, called inference, needs to be low latency and uses only a relatively small amount of data. Understanding AI data pipeline will help characterize the performance priorities needed.

A scalable tier of all-flash storage that is much larger than the expected working size of the typical data set is recommended. All-Flash Arrays should not deteriorate in performance as the filesystem becomes full and should provide low latency, throughput on sequential and random file access.

Choosing storage for AI is similar to any project. Understanding your applications and workloads and planning for growth are fundamental, as is putting plans in place for data governance and data backup. Working with, and asking questions of, the data science team will guide you to choose storage strategies that raises their productivity, accelerates AI adoption across the organisation and provides the flexibility to handle future deployments.

Written by Vincent Hsu, VP, IBM Fellow, CTO for Storage and SDE at IBM

Editor's Choice

Editor's Choice consists of the best articles written by third parties and selected by our editors. You can contact us at timothy.adler at stubbenedge.com More by Editor's Choice

Mining the metadata and more: Tips for good AI data storage practices

Embrace the AI Data Pipeline

Use your Metadata

Balance sheets and staff remuneration — the value of data is rocketing

Consider Data Tiering

Containers and Cloud Mobility

Why the term ‘cloud’ could be obsolete by 2025

Use Flash and Fast Networks

Editor's Choice

Related Topics

Related Stories

Data storage problems and how to fix them

Combining Qumulo integration with open source backup software

Combining block, file and object storage in one cluster technology

Overcoming data loss from embedded devices

Related Stories

Why data observability is the missing layer of modern networking

Is subscription-based networking the future?

Why and how to craft an effective hyperscale cloud exit strategy

Why cloud computing is losing favour