High Performance Computing - DevOps

What is HPC?

The hight performance computing uses supercomputer or computer cluster to solve the advanced computing problem. The common scenarios are:

  • Simulation & Brute-force search
  • Machine Learning & Neural network
  • Big Data Analysis

Simple Parallel Model

For this type of models, the subtasks are totally independent. The execution time depends on the resources and the number of tasks. More worker nodes are involved, more sub-tasks are executed in parallel, and less time the task totally execute. The typical model is Monte-carlo simulation.

Figure 1: Monte Carlo Simulation

MapReduce

For this kind of model, the common solution is MapReduce programming model. A MapReduce program is composed by a map procedure/job, which performs filtering and sorting, and a reduce method, which performs a summary operation. [1]

The first step of this model is divide the input data into different place. Then, each mapper task reads the different input, works out the result, and stores to the predefined output place. Finally, the reduce function collect all the results from the mappers and summarize the final output.

Figure 2: MapReduce programming model workflow[2]

The vanila MapReduce programming model has some challenges.

  • It is very hard to track the status of the mapper procedures. Normally, the mapper producer takes a long time to finish its task. During the execution, it may be terminated by various reasons. For example, the worker node is evicted from the cluster or the node is crashed by accident. One of the solution is to introduce a job tracker tracing the mapper progress and status.
  • The other is the big data sharing among the worker nodes. Since the mapper procedures are running on different worker nodes, a low-latency, multi-write supported network storage is required. Now, the Network Fils System (NFS) can help us.

Complex Parallel Model

The complex parallel means the subtasks are not totally independent, and need synchronize the parameters among them to finish the further steps. The typical model is neural network training.

When the neural network executes the forward propagation and gets the unexpected result, it will execute the backward propagation to update the parameters of the neural network to meet the expectation. Then, execute the other trainings.

Figure 3: Neural Network Traning [3]

The TensorFlow splits the training set into the small batches. Everytime the batch is completed, the parameters will be updated, and the worker nodes fetches the updated parameter to start new batch.

Figure 4: TensorFlow Distributed Training [4]

Complex Parallel vs. Simple Parallel

Simple Parallel Complex Parallel
Simple Parallel Complex Parallel
The output is uncertain. It depends on the mapper procedurer and reduce function. The output is certain. It’s the parameter in the parameter device.
The sub-tasks are independent. The sub-tasks are not independent. They shares the parameters.

Some big data analysis will mix the simple and complex parallel into a bigger analysis system. It will be involve more complex resource scheduler. For example, Hadoop provides a scheduler to start the reducer even when the mapper procedures wasn’t finished. In most situation, the complex scenarios are customized by the users.

HPC Cluster

A complete HPC cluster contains the following components.[5]

  • A cluster provisioner that ensures node homogeneity.
  • Nodes, the services reserving some resources in the cluster.
  • A scheduler that queues up workloads against the cluster resources.
  • A network for communication between the nodes.
  • A general-purpose storage solution used to store the applications and user data.
  • A high-speed, low-latency clustered file system generally used for computational storage.
  • Identity management to keep user access consistent throughout a cluster.
  • An observeability & monitoring system that provides insight into workload resource utilisation for scheduler.
  • Object storage, in some cases.

Figure 5: HPC cluster resources[5]

Let’s look into each components in detail.

Cluster provisioner

Node homogenerity is very important in HPC to ensure the workload consistency. In most scenario, the provisioner will be deployed as a service, which collect the information from the infrastructure, rank them and manage the workload of each resource. This is called as Metal-As-A-Service (MAAS) in the cloud management. Kubernetes can natually provide this functionality, and will make our work easily.

Head nodes

The head nodes act a entry point of the HPC cluster. It’s where users interact with the input and output of their workloads and get access to the local storage system available to the cluster.

The scheduler is the part functionality of the head nodes. The scheduler acts the brain of the whole cluster. It receives the request, enqueue the tasks, assign the tasks to the available nodes and track the job status. Schedulers are aware of any resource availability and utilisation and do their best to consider any locality that might affect performance. Their main purpose is to schedule comput jobs based on optimal workload distribution.

Compute nodes

Compute nodes are the processing component of an HPC cluster. They execute the workload using local resource, like CPU, GPU, TPU, etc. Also, the other resources are considered, such as memory, storage and network adapter. Depending on how the workload uses those components, it can be limited by one or more of those during execution. For example, some workloads that use a lot of memory might be limited on memory bandwidth or capacity. Workloads that either use a lot of data or generate a large amount of data during computation might be limited in their processing speed due to network bandwidth or storage performance constraints – if that data is written down to storage as part of the computation of that workload. Some workloads might just need plenty of computational resources and be limited by the processing ability of the cluster. 

Networks

If the job runs in the complex parallel mode and need inter-process communication across the nodes, the network becomes an important factor on building the cluster. The ideal network is low-latency, huge-bandwidth, but it is not realizable in the cloud environment. The cloud provider will limit the bandwidth of the nodes. What we can do is to choose network-optimized nodes, and hope it can bring low-latency and enough bandwidth.

Storage

There are three types of storage, mentioned in the list.

The general-purpose storage is used to store the available application binaries and their libraries, and the system of the node, like local journal logs and hardware monitoring. Normally, the block storage provided by the cloud is necessary for us.

The cluster file system plays more important role in the HPC cluster. It is a shared file system, which serves storage resources from multiple servers and can be mounted and used by multiple clients at the same time. We are used the file system to store the initial big data and computing result, which can bypass the network limitation of the node, resulting in low latency and high performance.

The last type is object storage. It just like a cold storage for the HPC. We can archive our old result to the block storage for the other tasks to run. For example, when a concurrent neural network (CNN) is trained well, the model can be saved to the object storage. The application can load the CNN model from the object storage to provide the corresponding service.

Auxiliary Services

The auxiliary services provide the security functions, monitoring the cluster status. They don’t provide any computing, but it will make the system work stable.

HPC over Kubernetes

Although HPC can run on a bundle of physical machines or VMs on the cloud, we choose setup the HPC over Kubernetes as our solution. This solution has some advantages. First, Kubernetes includes a powerful set of tools to control the life cycle of applications, e.g. parameterised redeployment in case of failures, state management, etc. Furthermore, Kubernetes incorporates an advanced scheduling system which can even specify different schedulers for each job. Kubernetes supports software defined infrastructures

and resource disaggregation by leveraging container-based deployment and particular drivers (e.g. Container Network Interface driver) based on standardised interfaces. These interfaces enable the definition of abstractions for fine-grain control of computation, states and communication in multi-tenant Cloud environment along with optimal usage of the underlying hardware resources. [6]

Volcano

Volcano is a cloud native batch system and CNCF’s first batch computing project. Huawei is the major contributor of this project. It is developed to extend cloud native from micro services to big data, artificial intelligence, and HPC by providing the following capabilities:

  • Full lifecycle management for jobs
  • Scheduling policies for high-performance workloads
  • Support for heterogeneous hardware
  • Performance optimization for high-performance workloads[7]

This solution meets most our requirements, but still leaves several gaps.

First, there is not good jobflow mechanism in Volcano. The jobflow is to define the relationship among several jobs. In our case, we want to spread the monte carlo simulation on different machines to improve the simulation performance, but we still need an application to summarize the results. It need 2 dependent jobs: a job to map the simulation out; and a job to summarize the results into final result. Unfortunately, the latest volcano version does not support this function natively. We need setup our customized job flow platform by ourselves, or use the 3rd party solution, like BoCloud solution, to orchestrate the jobs.[8]

Second, there are no Python libraries for volcano platform. Our quant developer has no time to investigate the complex platform and modify their code to adopt the new cluster. Thererfore, we need to provide a library, by which the quant developer can write and debug their code easily on their personal PC, but after change some settings, their code can be running on the volcano platform seamlessly.

Reference

  1. https://en.wikipedia.org/wiki/MapReduce
  2. https://www.geeksforgeeks.org/hadoop-architecture/
  3. https://gfycat.com/miniaturedependentcob
  4. https://data-flair.training/blogs/distributed-tensorflow/
  5. https://maas.io/blog/hpc-cluster-architecture-part-4
  6. https://journalofcloudcomputing.springeropen.com/articles/10.1186/s13677-021-00231-z
  7. https://www.cncf.io/blog/2022/04/07/cloud-native-batch-system-volcano-moves-to-the-cncf-incubator/
  8. https://github.com/BoCloud/JobFlow