What Does a Big-Storage Future Look Like in HPC?
Article Nov 20, 2018 | by Laurence Horrocks-Barlow, Lead Storage Consultant at OCF
Samsung announced earlier in the year its 30.72TB drives, positioning them as enterprise SSDs, which, along with their huge capacity, have around four times the read and three times the write capabilities of its consumer SSD. But at a price point of between $10,000-$20,000 who would actually use them?
Is bigger better in storage?
Clearly these drives are targeted at those organizations with pretty significant budgets, so how do you continue to take best advantage of the largest capacity drives on a budget?
Bigger is better in the storage industry – we always want more of it. Many organizations will choose larger drives and usually rely on traditional hard disk drives because of cost implications. The alternative, SSDs, are both costly and have limited capacity.
Whilst it’s great having 30TB worth of capacity, something that is often ignored is how this amount of storage will impact performance. If a customer requests a certain amount of storage capacity, but also needs performance above a particular rate, you have to consider that most traditional hard drives peak out in performance at around 300MB/s. As you start putting bigger and bigger drives into a system, you are reducing the number of drives required to meet capacity. Inadvertently, this will decrease the performance you can get out of a system, resulting in the need for more capacity than required, just to attain the desired performance figure.
Often people fail to acknowledge that the larger the drive becomes, the more data is potentially at risk should the drive fail. This is the same whether using SSD, tape or hard disk drives. With a failure, you could potentially lose all the data on that particular drive.
Challenge of recovery time
Traditional RAID (redundant array of independent disks) technologies haven’t really moved on since the 1980s when they were first developed. RAID storage uses multiple disks in order to provide fault tolerance, to improve overall performance, and to increase storage capacity in a system. It protects an organization from hardware failure, in particular hardware or SSD failure. This is in contrast with older storage devices that used only a single disk drive to store data. There are a lot of industries still using RAID 6, which allows for two disk failures within the RAID set before any data is lost. However, due to failure rates and rebuild times you are limited by the number of drives that are in that particular RAID group and you are also limited by their speed in trying to rebuild the missing drive and its data.
As the capacity of drives continues to grow at an exponential rate, it will take much longer to rebuild. It already takes days to rebuild drives on the capacity we already have, so with drives of around 30TB capacity, it could take over a week to reconstruct a failed drive. With such a long recovery time, this increases the failure risk of another drive in the RAID group.
These challenges started to be addressed a few years ago in HPC and the cloud, where rather than using traditional RAID, organizations are using de-clustered arrays which essentially places many more drives into the same pool and data is distributed more widely across more disks. This lessens the impact of a drive failure, meaning only a proportion of the data is lost rather than its entirety. It also allows part of the missing data to be re-built before complete drive failure and all drives to participate in the reconstruction on a single drive failure.
Convergence of compute and storage
Another noticeable difference in how storage systems are being created and utilized is through the convergence of both compute and storage. With the availability of fast network interconnects, such as Infiniband and the advent of 100 Gigabit Ethernet, it has become possible to populate individual compute nodes with large capacity drives and have them participate in the storage subsystem. This allows for practically linear scaling of storage and performance each time you scale your compute.
In traditional high-performance computing (HPC), this newer approach hasn’t quite caught on yet, and there’s still the use of separate storage and compute system elements. Whereas when you are looking at cloud platforms, these are becoming more converged in the use of technologies, so the failure of an individual component becomes less of an issue. When running HPC on premise, the components are more important, particularly with storage, when using traditional RAID. Using 30TB drives and significantly increasing capacity will drive the HPC market to look more at de-clustered arrays, to allow faster re-build times in the event of a drive failure.
We’ve seen this recently with IBM, Lenovo and NetApp all offering their own version of de-clustered array products. This will be a more realistic option for organizations looking for larger capacity on a budget.
The human brain sequesters many mysteries. How does cognitive development take place? How does it help us learn? What causes brain diseases? An exciting venture involving researchers from Argonne National Laboratory, the University of Chicago, Harvard University, and Princeton University is preparing to unleash a $500-million supercomputer, dubbed Aurora, in the pursuit of these answers.READ MORE