Artificial intelligence (AI) is one of the fastest-growing enterprise technologies.
According to IBM, 42% of firms with more than 1,000 employees now use AI in their business. A further 40% are testing or experimenting with it.
Much of that innovation is being driven by generative AI (GenAI), or large language models (LLM), such as ChatGPT. Increasingly, these forms of AI are being used in enterprise applications or via chatbots that interact with customers.
Most GenAI systems are, for now, cloud-based, but suppliers are working to make it easier to integrate LLMs with enterprise data.
LLMs and more “conventional” forms of AI and machine learning need significant compute and data storage resources, either on-premise or in the cloud.
Here, we look at some of the pressure points around data storage, as well as the need for compliance, during the training and operational phases of AI.
AI training puts big demands on storage I/O
AI models need to be trained before use. The better the training, the more reliable the model – and when it comes to model training, the more data the better.
“The critical aspect of any model is how good it is,” says Roy Illsley, chief analyst in the cloud and datacentre practice at Omdia. “This is an adaptation of the saying, ‘Poor data plus a perfect model equals poor prediction,’ which says it all. The data must be clean, reliable and accessible.”
As a result, the training phase is where AI projects put the most demand on IT infrastructure, including storage.
But there is no single storage architecture that supports AI. The type of storage will depend on the type of data.
For large language models, most training is done with unstructured data. This will usually be on file or object storage.
Meanwhile, financial models use structured data, where block storage is more common, and there will be AI projects that use all three types of storage.
Another factor is where the model training takes place. Ideally, data needs to be as close to the compute resources as possible.
For a cloud-based model, this makes cloud storage the typical choice. Bottlenecks in input/output (I/O) in a cloud infrastructure are less of a problem than latency suffered moving data to or from the cloud, and the hyperscale cloud providers now offer a range of high-performance storage options.
The reverse also applies. If data is on-premise, such as in a corporate database or enterprise resource planning system, it might make sense to use local compute to run the model. This allows AI developers more control over hardware configuration.
AI models make extensive use of graphics processing units (GPUs), which are expensive, so making storage keep pace with GPU demands is key. However, in some cases, central processing units are more likely to be a bottleneck than storage. It comes down to the type of model, the data it’s being trained on and the available infrastructure.
“It needs to be as efficient as possible,” says Patrick Smith, field chief technology officer for EMEA at Pure Storage. “That’s the bottom line. You need a balanced environment in terms of the capability and performance of GPUs, the network and the back-end storage.”
The way a business plans to use its AI model will also influence its choice of local or cloud storage. Where the training phase of AI is short-lived, cloud storage is likely to be the most cost-effective, and performance limitations less acute. The business can spin the storage down once the training is complete.
However, if data needs to be retained during the operational phase – for fine-tuning or ongoing training, or to deal with new data – then the on-demand advantages of the cloud are weakened.
AI inference needs low latency
Once a model is trained, its demands on data storage should reduce. A production AI system runs user or customer queries through tuned algorithms, and these can be highly efficient.
“The model that results from AI training is generally small compared with the scale of compute resources used to train it, and it doesn’t demand too much storage,” says Christof Stührmann, director of cloud engineering at Taiga Cloud, part of Northern Data Group.
Nonetheless, the system still has data inputs and outputs. Users or applications input queries to the model and the model then provides its outputs similarly.
In this operational, or inference phase, AI needs high-performance I/O to be effective. The volume of data required can be orders of magnitude smaller than for training, but the timescales to input data and return queries can be measured in milliseconds.
Some key AI use cases, such as cyber security and threat detection, IT process automation, and biometric scanning for security or image recognition in manufacturing, all need rapid results.
Even fields where GenAI is used to create chatbots that interact like humans, the system needs to be fast enough for responses to seem natural.
Again, it’s down to looking at the model, and what the AI system is looking to do. “Some applications will require very low latency,” says Illsley. “As such, the AI must be located as close to the user as possible and the data could be a very small part of the application. Other applications may be less sensitive to latency but involve large amounts of data, and so need to have the AI located near storage, with the capacity and performance needed.”
Data management for AI
The third impact of AI on storage is the ongoing need to collect and process data.
For “conventional” AI and machine learning, data scientists want access to as much data as possible, on the basis that more data makes for a more accurate model.
This ties into the organisation’s wider approach to data and storage management. Considerations here include whether data is stored on flash or spinning disk, to where archives are held and policies for retaining historic data.
AI training and the inference phase will draw data from across the organisation, potentially from multiple applications, human inputs and sensors.
AI developers have started to look at data fabrics as one way to “feed” AI systems, but performance can be an issue. It’s likely data fabrics will need to be built across different storage tiers to balance performance and cost.
For now, GenAI is less of a challenge, as LLMs are trained on internet data, but this will change as more firms look to use LLMs using their own data.
AI, data storage and compliance
Enterprises need to be sure their AI data is secure and kept in accordance with local laws and regulations.
This will influence where data is kept, with regulators becoming more concerned about data sovereignty. In cloud-based AI services, this brings the need to understand where data is stored during training and inference phases. Organisations also need to control how they store the model’s inputs and outputs.
This also applies to models that run on local systems, although existing data protection and compliance policies should cover most AI use cases.
Nonetheless, it pays to be cautious. “It is best practice to design what data goes into the training pool for AI learning, and to clearly define what data you want and don’t want retained in the model,” says Richard Watson-Bruhn, a data security expert at PA Consulting.
“When firms use a tool like ChatGPT, it can be absolutely fine for that data to be held in the cloud and transferred abroad, but contract terms need to be in place to govern this sharing.”