Author's note: For simplicity, machine learning, deep learning, and similar techniques are all referred to as artificial intelligence in this post.

While writing this article I found myself reflecting on the state of IT when I started at NetApp in 2012. The world was a different place. GCE and Azure were facing an uphill battle against incumbent AWS, we were talking about OpenStack for private cloud, and Siri was still in beta. However, even though 2012 is an anagram of 2021, I think we can all agree that the entire world has been fundamentally reshaped in ways we could never have envisioned back then.

Amid this never-ending churn, I found myself drawn to identifying and solving the biggest, baddest storage problems I could find. There's no shortage of those, and the general availability of artificial intelligence presents the latest opportunity for companies to reinvent themselves and more efficiently use the goldmine of data they find themselves sitting on.

Having been well entrenched in helping solve traditional high-performance computing challenges around storage, I already knew that BeeGFS solves big data problems-but what about for AI specifically?

We can roughly divide organizations running 'AI workloads' into two broad categories:

  • Traditional high-performance computing (HPC). In this category, think national labs, research institutes, pretty much anyone who runs what would traditionally be thought of as a 'supercomputer.'
  • Enterprise HPC. This category has seen much more growth in recent years as AI becomes more prolific and organizations find value in wielding the power of their data, aided by these techniques.

Traditional HPC has been helping mature these techniques for decades, so it's valuable to examine the tried-and-true infrastructure they've used to achieve success. On the other hand, IT has been fundamentally reshaped over the last decade by the shift to cloud, and modern AI has largely been born in a cloud-native world. Success when developing infrastructure to support modern AI lies in finding the right combination of traditional and cloud-native technologies.

Enter parallel file systems: Versatile storage for AI and big data

For decades, parallel file systems have been crucial in building storage environments that are capable of keeping up with the most demanding supercomputers. From storing massive, mission-critical datasets, to providing high-speed scratch space, to storing the results of long-running and expensive-to-reproduce computations, parallel file systems are vital. Furthermore, the pricing model is typically designed for scale, helping to ensure that they fit within constrained research budgets. Sound anything like AI requirements?

In fact, parallel file systems are a good fit for use throughout an AI data pipeline, whether you're coming from an HPC or an enterprise background. Here are some examples:

  • Data lakes, especially as cost-effective storage behind big data platforms like Hadoop and Spark.
  • Scratch space for data preparation and preprocessing tasks with I/O patterns that are difficult to manage, such as supporting extract, transform, load (ETL) workflows.
  • Storing and retrieving training datasets, especially when you need to keep a large number of GPUs busy.
  • High-speed ingest, inferencing, and real-time analytics where low latency and high bandwidth are critical.

Obviously, parallel file systems are a powerful tool that can bring relief to storage pain points. But perhaps one of the biggest benefits may not be immediately obvious. Many organizations start small with AI, pilots, and proof of concepts; understandably, they're hesitant to invest heavily until these approaches have proven their worth. Of course, the challenge comes when they take off, and then suddenly it's necessary to scale. Scaling compute and GPU resources is a lot easier than scaling storage, so you need to start with a storage solution that allows your data and consumption to grow in all directions without limits.

We'll get deeper into this subject in the next section. To sum up the discussion so far, parallel file systems are a good choice to solve AI data challenges because they offer a flexible solution to a problem that has a lot of unknowns.

BeeGFS explained: Why BeeGFS makes sense for AI

Some parallel file systems have traditionally been seen as complex to deploy and manage, but BeeGFS has been designed from the beginning for flexibility and simplicity. That, along with its cost-effective performance, is why we choose to offer support alongside the NetApp® E-Series storage systems

.

Shameless plugs aside, why does BeeGFS specifically make sense for AI? Let me paint you a picture.

You want to provide space for your end users (possibly data scientists in this context) to store a bunch of arbitrary data, somewhere. We'll call that somewhere a storage namespace. Ideally when users are ready to train, they don't have to copy this data to each of the GPU nodes. Copying data adds extra time to the workflow, and the dataset may exceed the node's internal storage capacity anyway. So the storage namespace must be able to serve the same files to many GPU nodes, regardless of whether the files are very large or very small.

The design of some storage solutions requires serving the same file from a single storage node. That quickly becomes painful when dealing with tens, hundreds, or even thousands of compute or GPU nodes reading the same file, especially if it's large. To avoid a single overworked piece of hardware, your ideal storage namespace should stripe each file across multiple storage nodes. But what about the other end of the spectrum, where users need to work with a large number of small files? Ideally the storage namespace stores information about files and directories (metadata) separate from file contents, so looking up a bunch of little files doesn't place extra strain on the storage nodes designed to stream the distributed file contents. To avoid bottlenecks, this metadata should also be distributed across multiple nodes. This distribution also needs to be intelligent enough that one node doesn't end up owning the entirety of a massive directory tree, defeating the purpose of a 'distributed file system.'

These sorts of challenges are essentially what parallel file systems were designed to overcome, and BeeGFS solves them all eloquently. The design of BeeGFS delivers a storage namespace that can flexibly adapt to meet evolving AI storage requirements. As your compute and GPU footprint grows, you can scale the performance and overall capacity of your storage namespace to match. If concurrent file access becomes a bottleneck, or the number of files and directories you need in the storage namespace increases, those can be expanded independently.

In particular, BeeGFS excels at data processing use cases-for example, where you need to take one large image file and break it into hundreds or thousands of smaller files. Because BeeGFS is POSIX compliant, the subsequent files are natively accessible to machine learning and deep learning frameworks, without requiring additional data movement or expensive code changes. In short, BeeGFS provides flexibility that is key to ensuring that your storage namespace is right-sized to fit your uniquely evolving AI data requirements, while ensuring that data is accessible without requiring more work.

Conclusions

I was thinking about calling this article 'Why parallel file systems like BeeGFS go beyond HPC and should be regularly considered for AI in enterprise.' But my editors say that concise titles are better for SEO. However, I do hope that is your main takeaway from this blog post. I see the characteristics of parallel file systems like BeeGFS continuing to make them valuable tools as organizations tackle AI initiatives. And I'd hate to see that value overlooked in a sea of flashy alternatives with exaggerated claims to greatness. Of course, to make BeeGFS fit in a cloud-native world, it needs to support cloud-native platforms like Kubernetes. Stay tuned for future developments.

Let me close by trying to sum up this blog post in a sentence: 'BeeGFS is a good fit for AI anywhere you need to meet or exceed the speed of NAS, be able to scale like object storage, and want to spend more on GPUs than on storage.'

I'd love to hear about your challenges around data. If you want to continue the conversation, drop me a line at joe.mccormick@netapp.com.

Attachments

  • Original document
  • Permalink

Disclaimer

NetApp Inc. published this content on 11 February 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 11 February 2021 16:04:00 UTC.