Among the dizzying array of launches at this year’s AWS re:Invent conference, two in particular have stood out as being potential game-changers for HPC on AWS: FSx for Lustre and Elastic Fabric Adapter.
At BioTeam, we’ve been building cloud-based HPC systems for life sciences for a decade, and we’ve seen architectures grow up from a loose federation of EC2 instances sharing an NFS volume to clusters of thousands of cores running high-performance parallel filesystems. As workloads have grown, customers have demanded higher performance out of every component of the stack – CPU, memory, networking and shared storage. While AWS has been driving hard for a long time on instance CPU and memory performance (most recently with the C5 instance family), they have historically not invested as heavily in the areas of low-latency networking and shared storage. This has made it hard for users to easily stand up an on-demand cluster that has comparable performance to equivalent on-premises hardware. However, with the announcements this week, AWS seems to be acknowledging this need by making significant investments in supporting high-performance parallel filesystems through FSx as well as low-latency interconnects through EFA.
FSx for Lustre
In the past, if you wanted an efficient shared filesystem on AWS, you had to roll your own. This, of course, meant a huge investment in infrastructure architecture, and the system was only as resilient as you built it. AWS acknowledged this gap initially by developing EFS, with a straightforward NFS interface and pay-by-the-byte pricing. While EFS showed that Amazon’s mindset was changing, it became obvious fairly quickly that EFS’s performance characteristics were not really what was needed to be to drive substantial HPC-oriented workloads. The “burst credit” system that EFS uses to accommodate short-term workloads can easily get swamped by bulk parallel operations, and performance can be a tradeoff between bulk throughput (megabytes per second) and latency (time to respond to a request). Bulk throughput was also determined by the size of the filesystem (which scaled according to the amount of data stored), leading to bizarre recommendations like “just dd a bunch of junk data to your EFS filesystem to get better performance.”
While EFS works well for many workloads, AWS has recognized for many years that customers sometimes prefer to use a third-party product in a managed fashion rather than using yet another AWS-custom product. Therefore, this year they announced both FSx for Windows and FSx for Lustre. The Lustre product has some pretty compelling top-line statistics: throughput of hundreds of GB/s, millions of IOPS and sub-millisecond latency, all at $0.14/GB-month — cheaper than EFS.
AWS has released a straightforward workshop that walks through the creation and use of a FSx Lustre filesystem, as well as generating some benchmarks. BioTeam was one of the first public customers to participate in this workshop, and we learned some important things that you should know about when and how to use FSx Lustre.
How it works
Creating a filesystem is as straightforward as you would expect – through the web console or the AWS CLI. You specify how big you want the filesystem to be, and what VPC, AZ and subnet it should be attached to. Optionally, you can connect it to an S3 bucket, and it will scan that bucket for objects and load the metadata for those objects into your new filesystem. Loading the actual data from S3 happens lazily (on first access), or can be triggered using Lustre’s HSM operations.
Once the filesystem is created, you mount it using a standard Lustre client, pointing to a DNS name provided by FSx. The instances powering the filesystem aren’t visible in your VPC. Data within the filesystem is not replicated, but you can perform an HSM operation to write the data back to the S3 bucket if one was configured at creation time.
What works well
Performance is fast, as you would expect. BioTeam is planning to do a more in-depth characterization of the system, but an initial test with a 3.6 TB volume against a single c5.2xlarge instance showed at least 5000 IOPS and 15,000 file stats per second, and was able to easily saturate the provisioned filesystem performance at 600 MB/s. This filesystem should perform very well for HPC cluster shared-scratch use cases.
The S3 integration, particularly for source data, also was seamless, making it perfect for loading reference and input data sets, such as genomes, images and clinical observations. Data quickly loaded on-demand from S3, and read performance of data pulled from S3 (on the second request) was the same as data newly written. The output mechanism for FSx to write back to S3 is designed to support multiple filesystems connecting to the same bucket, so a multi-cluster strategy, each with its own filesystem, should work reasonably well.
If you have data already on your on-premises filesystem (Lustre or NAS) that you’d like to make available to your cloud cluster through FSx, the best path would be through AWS’s new DataSync service, which can transfer data between on-premises filesystems and S3. Once your data is in S3, simply specify the desired bucket at FSx launch time to have high-performance shared access to your cloud cluster.
What you should watch out for
The most important thing to note is that FSx Lustre is designed for one-time scratch, not home directories or other permanent use cases. There are several “gotchas” in the current implementation that make this true:
- You pay for provisioned capacity, not utilized capacity. While we mentioned that the cost per GB-month of FSx Lustre was cheaper than EFS, in reality your EFS costs will probably tend to be lower, simply because you always must over-provision your FSx filesystem. The smallest FSx Lustre filesystem will still cost over $500/month.
- While you can push output data from FSx Lustre to S3 via Lustre HSM calls, the data will appear in a subfolder within your main S3 bucket. This means that, if you try to save costs by flushing your filesystem to S3, tearing it down and then later launching it again, your data will not be in the same place you previously left it! While this might be an annoyance for a per-project workflow, it is untenable for home directories.
- Finally, even if you’re willing to pay the cost over time and never shut down your filesystem, since none of the data is replicated for resiliency, there is a risk of data loss if the filesystem fails. Periodic flushing of the data to S3 would mitigate this, but given the point above, bringing the filesystem back after a failure would be quite frustrating.
There are some additional good-to-have features that don’t yet exist, which you should be aware of:
- Encryption is included at rest, but not in flight. This may be a limitation within Lustre itself, and is something to consider, particularly for sensitive HIPAA/PII data.
- You can’t have multiple source S3 buckets, you can’t have a different input and output S3 bucket, and you can’t add or remove objects from S3 after the filesystem has been created. This is particularly relevant if your workflow involves continually adding objects to S3 — only those objects that existed at the time the filesystem is created will show up in the filesystem metadata.
- CloudFormation-based creation of filesystems is not currently supported, but you can expect that to come fairly soon.
Finally, today it appears that the biggest filesystem that can be created is around 25 TB in size. This has implications on the size of workflows that will operate with FSx.
Elastic Fabric Adapter (EFA)
The second major HPC-oriented announcement from AWS this week was the Elastic Fabric Adapter, which promises to deliver super-low-latency communication between instances in an HPC cluster. This capability has been demanded by customers running MPI workloads, where cross-node communication is critical. In the life sciences industry, this is most common for molecular dynamics simulations, parallel BLAST and NONMEM, but other large scale simulations such as weather modeling and computational fluid dynamics can strongly benefit from a very fast interconnect between cluster nodes. Elastic Fabric Adapter promises to provide an “operating system bypass” to give as direct access as possible to the low-level EC2 networking hardware.
This latest feature has been a bit more difficult to get information on, as it is currently in limited preview. However, through conversations with AWS technical staff, we were able to learn some high-level points that are important to share.
- Essentially, EFA gives your MPI library (through libfabric) direct, lower-level access to the underlying interconnect that powers the EC2 Nitro-based instances. You should not need a patch to your preferred MPI library if it already works with libfabric, so OpenMPI and MPICH should work out of the box.
- It is currently only supported on a very limited set of instance types: C5n.9xlarge, C5n.18xlarge, and P3dn.24xlarge.
- There is not a separate interconnect dedicated to EFA – it is utilizing the bandwidth otherwise available to normal TCP/IP traffic.
- It is definitely not Infiniband.
As for the remaining details (how to use it, what performance characteristics to expect, any other gotchas), those remain to be seen as we request early access and perform some benchmarks. We will keep you posted!