How South Africa CHPC Responded to Unprecedented Computing Needs To Address the COVID-19 Pandemic
Complete the form below to unlock access to ALL audio articles.
The South Africa Center for High Performance Computing (CHPC) has joined the ranks of HPC centers who are converging their supercomputing and cloud infrastructures. The 1.3 petaFLOPS Lengau supercomputer has been the fastest machine on the continent. Adding its OpenStack Production Cloud in early 2020 allowed users to begin taking advantage of orchestrated general purpose computing and storage resources. But, when South Africa went into lockdown on March 26, 2020, due to COVID-19, the country’s computing needs vastly outgrew the resources provided by Lengau and the OpenStack Production Cloud.
Supercomputing at CHPC
As a key center for large-scale computing in Africa, South Africa CHPC supports both academic and industry research, and most recently was part of the effort to identify the South Africa variant of SARS-CoV-2. Installed in 2016, the Lengau cluster and its Lustre parallel file system have been used on several flagship projects with supercomputing-level resources, including advanced weather modeling energy storage materials, and the MeerKAT array. It has also contributed resources to commercial projects to support efforts through the South Africa Development Co-Operative (SADC) and in other countries in Africa, including Ghana and Kenya. In 2017, CHPC joined the Square Kilometer Array (SKA) project to provide computational capacity for the SKA’s Science Data Processor (SDP). The large part of the SKA is being built in South Africa.
Artist's impression of the 5km diameter central core of Square Kilometre Array (SKA) antennas (Author: SPDO/TDP/DRAO/Swinburne Astronomy Productions, courtesy of the University of South Africa).
Since Lengau’s installation, a growing number of CHPC research and industry users have impacted CHPC’s computing landscape, requiring a new look at their infrastructure.
“In addition to supercomputing, researchers also needed non-HPC, general purpose computing support,” said Dora Thobye, technical manager for HPC resources. “They wanted to store their data remotely, so they needed a more typical processing and storage environment rather than Lengau and the Lustre parallel file system.”
CHPC began meeting those needs with a virtualized environment built on VMware virtual machines, while still using Lustre. But growing demands congested the Lustre filesystem, which slowed Lengau performance 30 to 40 percent, according to CHPC. Addressing storage challenges took CHPC architects in a different direction.
“Much like data from the Large Hadron Collider’s Atlas detector, computation for SDP data will be shared across many countries and users,” explained Dr Happy Sithole, CHPC’s director.
The computing model to support Atlas is based on cloud services, which led CHPC toward an on-premises, private cloud.
“There were several reasons to consider a private cloud,” stated Sithole. “Since we support many governments and businesses, we needed to address their concerns, such as where instances would be deployed and data sovereignty. We desired greater control over its architecture, access, and security. The option of a private cloud gave our stakeholders more confidence.”
Like the UK Science Cloud at Cambridge University, which is also a main member of the SKA project, the CHPC cloud was built on OpenStack and OpenStack CEPH storage software.
“OpenStack provides a transparent environment for users around the world to analyze SDP data,” added Sithole. “And OpenStack offers a foundation for our existing heterogeneous computing needs and for a future converged infrastructure that provides both supercomputing and general purpose services.”
The new system was built on Supermicro TwinPro servers with 2nd Gen Intel Xeon Scalable processors and 3 TB of memory per node. 1.5 petabytes of mechanical disks and more than 220 TB of Intel SSD drives created a CEPH storage cluster with hierarchical storage architecture for short- and long-term storage.
“The new cloud system was designed to support many virtual jobs related to ongoing research, such as custom workflows, pleasingly parallel workloads, and web hosting,” commented Thobye.
Commissioned on March 23, 2020, CHPC technician’s began migrating users off the VMware system to their new OpenStack Production Cloud system. Then, on March 26, 2020, the country went into lockdown due to COVID-19 and everything changed.
Dealing with a pandemic
Agencies across the government found themselves scrambling for computing capacity. The Department of Health required enormous computing and storage resources for processing population tracking and tracing and other related data. The Department of Higher Education and Training needed resources to address remote learning programs, plus television whitespace analytics and analysis of bandwidth available and necessary to reach outlying communities. Other compute- and data-intensive projects to address SARS-CoV-2 research included DNA sequencing and virus research. Lengau was utilized as much as possible, but the OpenStack Production Cloud, originally designed with adequate resources for a much smaller population of users, was overwhelmed.
“Because of the pandemic and all the new users it brought to us, we were running out of compute and storage resources,” explained Thobye.
CHPC turned to Intel and Dell to help upgrade their brand new cloud system. The OpenStack Production Cloud expansion included the following:
· 15 new compute nodes using Dell PowerEdge R640 servers with Intel Xeon Gold 6248 processors
· 26 new storage nodes using Dell PowerEdge R740XD2 servers with Intel Xeon Silver 4208, 4210, and 4214 processors
· 60 TB of hot data storage using Intel SSD DC drives
· 480 TB of mechanical storage
The expansion was completed mid-2020 and went into production with a total capacity of 2212 compute cores, 1.3 PB of cold storage, and 130 TB of hot storage (Intel SSDs). The additional storage and compute capacity on top of the existing OpenStack Production Cloud infrastructure gave users the resources and response times they needed.
“Instead of continuous 100 percent utilization,” commented Dr Sithole, “workloads now consume from 60 to 100 percent of the compute capacity, depending on the activities.
The expanded cloud supports ongoing pandemic activities by the Department of Higher Education and Training, Department of Health, university research, and other public and private projects to address needs from the pandemic. But it also paves the path for South Africa CHPC’s future.
Paving a new path forward
A growing number of HPC centers around the world are creating hybrid infrastructures. Compute-intensive, parallel performance clusters are converging with data analytics, artificial intelligence/machine learning (AI/ML), and private cloud architectures to address a wide range of user needs under one infrastructure umbrella. Part of the UK Science Cloud’s mission is to support the SDP, and Simon Fraser University in British Columbia built their cloud to process data from the LHC Atlas detector.
“OpenStack provides a different offering for users of the data center,” said Sithole. “This implementation is a step in the right direction to revolutionize our data center as a converged environment. We see this as a continuum between compute-intensive and data-intensive computing. It allows us to easily support both HPC research and general purpose cloud computing in the same infrastructure.”
According to Dr Sithole, the cloud also brings many new tools that will allow users to take advantage of the new environment. Artificial Intelligence (AI) and machine learning (ML) libraries, containerization, and other resources will help users who want to implement AI workloads and explore new approaches to their scientific problems.
“The cloud platform further enables CHPC to gather the necessary technical and operational expertise to develop, provision, and operate a national federated OpenStack platform,” stated Thobye. “It will allow for global connectivity in a virtual environment for mega projects, like the Square Kilometer Array and similar in stature.”
Before the pandemic struck South Africa, CHPC was piloting other Intel technologies, such as Intel Optane persistent memory and Intel Optane storage. CHPC expects these technologies for hierarchical memories and storage can improve large-memory processing performance and efficiencies by keeping more data closer to the processing complex. Such proximity is important with workloads that interact with massive amounts of data like the SKA. These technologies can also accelerate genome sequencing and assembly.
Once the population has been vaccinated and the virus under control, CHPC’s OpenStack Production Cloud will be able to support many other activities. More commercial members of the South Africa Development Co-operative (SADC) can take advantage of easy access to computing and storage resources. Newer weather models are being explored that will help Africa understand and deal with its unique weather events, such as tropical cyclone Chalane that hit Mozambique late last year and Eloise’s landfall early this year, and the effects of climate change.
“Once COVID is beyond us,” concluded Dr Sithole, “we have different challenges in Africa. The OpenStack platform gives us AI and other tools that will help find solutions for Africa's unique problems. One of those challenges is the issue of communicable diseases. Ebola, for example, but Ebola is not the worst disease that Africans face. And what we have learned with COVID is that you cannot solve such problems alone. There has to be a concerted effort from everybody together to find cures for the problems that we have. Hopefully, that will accelerate the uptake of the CHPC platform so we can find solutions for those unique African problems as well.”
This article was produced as part of Intel’s editorial program, with the goal of highlighting cutting-edge science, research and innovation driven by the HPC and AI communities through advanced technology. The publisher of the content has final editing rights and determines what articles are published.