What Is the Holy Grail of HPC Disaster Recovery?
What Is the Holy Grail of HPC Disaster Recovery?
Disaster Recovery (DR) is a vital process to ensure the rapid recovery of an organisation’s applications, data and hardware that are critical to operations in the event of a natural disaster, network or hardware failure or human error. In this article, we explore how the public cloud is making ideal DR a reality.
Although DR has been around since the birth of computing, the interest in applying it to research computing first spiked around ten years ago. With the advent and availability of the public cloud, interest in DR has been reignited and many more organisations involved in research computing are awakening to the possibilities. The need for DR in high performance computing (HPC) has emerged alongside the increased role of Chief Information Officers in academia and research organisations who have been recognizing the value and the vital part HPC plays in keeping their organisations afloat.
It is well known that the top five percent of commercial organisations using HPC invest heavily in DR and are paying high prices for a cold cluster that sits in an offsite data center waiting to be used on that rare occasion that a disaster might occur. Not only does this approach incur huge overheads, it also presents the challenge of how to make sure the cold cluster stays up to date and the data is always available.
Public cloud opens new possibilities
With the availability of the public cloud, overheads can be severely reduced and organisations don’t have to buy all their hardware upfront; so potentially you could have something that looks a lot like your cluster sitting in the public cloud.
In the past, research institutions have utilized software as a service offerings from major public cloud players which provide a whole range of applications running on their cloud. This allows your users the facility to run the same applications in a disaster recovery scenario.
However, this approach is limited, as you may have the same applications, but you need the exact version of the application for your specific cluster. For example, a Computational Fluid Dynamics company may be assured that ANSYS or OpenFOAM is running on the cloud, but is it their actual version of ANSYS; does it have the exact libraries that it needs; and does it have the same environment? The last thing you want to introduce is an extra layer of variability, in what is likely to be in a difficult period of time if a disaster occurs.
The holy grail of HPC DR is having an exact mirror of your HPC installation available to you if needed. So that the DR cluster sitting on the public cloud is exactly the same environment as your live cluster; with the same libraries and specific versions of your applications that your results are being derived from under normal circumstances.
Organisations need to carefully examine what they want from a DR service. They need to ask themselves is it a replacement service, is it an exact like for like, or is it to recover from that disaster instantly? These decisions will have a massive impact on the cost implications of putting together a disaster recovery plan.
Managing data storage in a disaster
A major challenge for HPC DR in the cloud is data storage. A lot of our customers want to shape their public cloud strategy and leverage public cloud storage as a tier of their storage infrastructure, together with a DR strategy.
If you already have a public cloud strategy, it isn’t too hard to make sure you have the right data sets available in the right circumstances. However, it becomes more challenging if you don’t already have a public cloud strategy. It is important to consult with an experienced systems integrator who can establish that your data can move to the cloud in the event of an emergency or more appropriately, ensure the right data is being periodically fed out to the public cloud, so when the emergency hits, you have data already in the public cloud.
Benefits of mirroring
Replicating your existing HPC infrastructure in the public cloud goes beyond supporting DR, and can provide multiples of your HPC system on the public cloud for different requirements, whether DR, busting capacity during planned downtime, testing, development or expansion. For example, if a new technology or piece of software needs to be trialed, it could be tested out on the ‘mirror’ - HPC infrastructure on the public cloud – if it runs successfully, it could be then be introduced to the on-premises HPC system.
Increased appetite for DR in HPC
Over the past five years, the costs for public cloud have become a lot more palatable, so it has become more accessible to more organisations. Over this time frame, the types of products being offered by public cloud platforms have also greatly expanded. Amazon Glacier is now a commonplace storage service and most know that it is the cheapest way to store your data in the public cloud, so people are now looking to take advantage of that.
HPC disaster recovery can be a reality now. Replicating your existing HPC infrastructure is something everyone who is running high performance or research computing should be looking into because data and the ability to work with that data is what keeps an organisation running. There are always concerns about cost implications, but it is important to consider what you really need in the event of a disaster. Identifying requirements, priority workloads and priority data sets is central to making sure that you have the appropriate DR in place, particularly for HPC.
Mahesh Pancholi is Research Computing Specialist at high performance compute, storage and data analytics integrator, OCF.