Virtual Open Systems Scientific Publications
The 2017 International Conference on High Performance Computing & Simulation (HPCS-2017), Genoa, Italy.
Virtualization, HPC, RDMA, API Remoting, Disaggregated datacenter
This work was supported by the Exanest project. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 671553. The work presented in this paper reflects only authors' view and the European Commission is not responsible for any use that may be made of the information it contains.
Remote DMA (RDMA) engines are widely used in clusters/data-centres to improve the performance of data transfers between applications running on different nodes of a computing system. RDMAs are today supported by most network architectures and distributed programming models. However, with the massive usage of virtualization most applications will use RDMAs from virtualized systems, and the virtualization of such I/O devices poses several challenges. This paper describes a generic para-virtualization framework based on api-remoting, providing at the same time the flexibility of software based virtualization, and the low overhead of hardware-assisted solutions. The solution presented in this paper is targeting the KVM hypervisor, but is not bound to any target network architecture or specific RDMA engine, thanks to the virtualization at the level of the programming API. In addition, two of the major limitations of para-virtualization are addressed: data sharing between host and guest, and interactions between guests and hypervisor. A set of experimental results showed a near to native performance for the final user of the RDMA (i.e., maximum transfer bandwidth), with a higher overhead only to simulate the API functions used to initialize the RDMA device or allocate/deallocate RDMA buffers.
Data transfers have always been a main concern in large clusters and data-centres, amplified by a constant request by applications in terms of higher bandwidth and lower communication latency. Remote DMA (RDMA) engines are the response to this type of problem, enabling Network Interface Cards (NIC) to perform DMA-like memory data transfers between nodes of the same computing system. Various network interconnection protocols used in data-centres, such as Infiniband and Ethernet through RDMA over Converged Ethernet (RoCE), are already providing support for RDMA engines. The main advantage of this approach is the drastic reduction of latency, reduced involvement of CPU and thus higher bandwidth compared to other communication paradigms such as network sockets. User libraries based on RDMA transfers are being used for databases, scientific computing and cloud computing in order to optimize communication and data movement between application instances. In parallel, large clusters/data-centres are extensively relying on virtualization as a tool for improved utilization of system resources (e.g., memory, disk, CPU) and hardware consolidation. This is achieved by running multiple virtual instances of a system on the same hardware machine. In addition, virtualization is used for resilience thanks to facilities like virtualized systems live migration and snapshots.
The virtualization of an I/O peripheral such as an RDMA can be implemented mainly in the following ways: direct device passthrough, exploiting hardware support from the hardware with PCI Single-Root I/O Virtualization (SR-IOV) or by para-virtualization. Direct device pass-through, although enabling almost native performance, creates a 1-to-1 mapping between the device and one virtual machine. This means that an RDMA device on a compute node could not be shared among multiple virtual machines, losing the benefits of virtualization in terms of better distribution of available hardware resources. PCI SR-IOV overcomes the problem of sharing the device between multiple virtual machines, but requires support from the hardware that is not always available and reduces the effectiveness of snapshots and live migration. Finally, para-virtualization offers the highest level of flexibility compared with the previous two solutions by being a software-based technique, but suffers from a major drawback: high virtualization overhead due to frequent interactions with the hypervisor and data-sharing handling. RDMA devices can usually be programmed either at the bare metal level, or via a dedicated user-space API. Virtualizing the bare metal interface would lead to a dedicated virtualization solution for each device on the market, while virtualizing at the API level creates a generic solution that can be easily adapted to new APIs, and enables devices using the same programming API to be virtualized with the same implementation of the framework.
In this paper we present a generic and lightweight RDMA para-virtualization framework for the KVM hypervisor that overcomes the limitation of hardware assisted solutions, eliminating also the overheads of para-virtualization. The solution virtualizes the RDMA engine at the user-space library level by using a technique called API Remoting, based on an API interception mechanism and a split-driver architecture. The benefits of this solution are threefold: 1. One virtualization framework core for multiple devices/APIs. 2. Native sharing of the device among multiple virtual machines. 3. Low virtualization overhead due to reduced interactions between guests and hypervisor.
From the application perspective, the virtualization framework will be completely transparent, since the part of the driver installed in each guest frontend will export the same stub as the original programming API. Internally the frontend intercepts API function calls and re-directs them to the host. On the host side, the second part of the virtualization framework backend is in charge of collecting requests from the various guest frontends to be relayed on the physical device. It then becomes the responsibility of the backend to orchestrate requests from multiple guests, creating the illusion of multiple RDMA devices available on the platform. This approach separates the specific API implementation from the virtualization core, making it easy to extend API Remoting to new APIs. The communication between frontend and backend is ensured by a third component, the transport layer, in charge of actually passing the requests from frontends to the backend process.
However, this is not enough for a full solution, since for the virtualization of a device like an RDMA additional factors should be taken into account: interactions with the hypervisor, and guest-host data sharing. The former starts becoming a performance issue when the frequency of interactions is high, and should be minimized since every interaction with hypervisor (hypercall) implies a guest exit that is a renowned expensive operation. The solution presented in this paper reduces the interaction between the virtual machine and hypervisor to the control-plane only, while completely avoiding such interactions during regular RDMA operations. The control-plane is implemented in the proposed solution using virtio, a well-known para-virtualization framework using circular buffers vrings in shared memory for guest-host communication. The second problem, guest-host data sharing also known as the data-plane, is also of utmost relevance and in this paper is addressed within the RDMA transport layer. RDMA operations involve data transfer of buffers allocated by user-space applications, and in the bare metal operation of the device do not imply data copies since the RDMA hardware have direct access to those buffers. When virtualization comes into the picture data buffers have to be shared between guest user-space and the RDMA device, and guest-host data copies should be avoided in order minimize the performance loss due to virtualization. In this paper guest-host data sharing is implemented with a zero-copy mechanism, enabling true memory sharing between guest and host extended down to the RDMA device.
The RDMA virtualization solution presented in this paper has been tested with an FPGA implementation of an RDMA engine designed for the Unimem Global Address Space (GAS) memory system. Unimem has been developed within the Euroserver FP7 project, to enable a system-wide shared memory abstraction between the nodes of a data-centre. The prototyping system is based on ARMv8-A architecture, but it should be noted that the API Remoting RDMA virtualization framework has no limitations with respect to the target host processor.
The rest of the paper is organized as follows: Section II provides a comparison with the state-of-the-art solutions for the virtualization of RDMA devices. Section III describes the target RDMA device and its userspace API. Section IV provides the detail of the API Remoting based RDMA virtualization. In Section V a set of experimental results is presented to validate the proposed solution. Finally, Section VI concludes the paper and identifies possible extensions.
Access the full content of this publication
Login or register to access full information