vScale
SMP virtual machines (VMs) have been deployed exten- sively in clouds to host multithreaded applications. A widely known problem is that when CPUs are oversubscribed, the scheduling delays due to VM preemption give rise to many performance problems because of the impact of these delays on thread synchronization and I/O efficiency. Dynamically changing the number of virtual CPUs (vCPUs) by consid- ering the available physical CPU (pCPU) cycles has been shown to be a promising approach. Unfortunately, there are currently no efficient mechanisms to support such vCPU- level elasticity.
We present vScale, a cross-layer design to enable SMP- VMs to adaptively scale their vCPUs, at the cost of only mi- croseconds. vScale consists of two extremely light-weight mechanisms: i) a generic algorithm in the hypervisor sched- uler to compute VMs’ CPU extendability, based on their pro- portional shares and CPU consumptions, and ii) an efficient method in the guest OS to quickly reconfigure the vCPUs. vScale can be tightly integrated with existing OS/hypervisor primitives and has very little management complexity. With our prototype in Xen/Linux, we evaluate vScale’s perfor- mance with several representative multithreaded applica- tions, including NPB suite, PARSEC suite and Apache web server. The results show that vScale can significantly re- duce the VM’s waiting time, and thus can accelerate many applications, especially synchronization-intensive ones and I/O-intensive ones.
Comparison of the Three CPU Schedulers in Xen
The primary motivation for enterprises to adopt virtualization technologies is to create a more agile and dynamic IT infrastructure — with server consolidation, high resource utilization, the ability to quickly add and adjust capacity on demand — while lowering total cost of ownership and responding more effectively to changing business conditions. However, effective management of virtualized IT environments introduces new and unique requirements, such as dynamically resizing and migrating virtual machines (VMs) in response to changing application demands. Such capacity management methods should work in conjunction with the underlying resource management mechanisms. In general, resource multiplexing and scheduling among virtual machines is poorly understood. CPU scheduling for virtual machines, for instance, has largely been borrowed from the process scheduling research in operating systems. However, it is not clear whether a straight-forward port of process schedulers to VM schedulers would perform just as well. We use the open source Xen virtual machine monitor to perform a comparative evaluation of three different CPU schedulers for virtual machines. We analyze the impact of the choice of scheduler and its parameters on application performance, and discuss challenges in estimating the application resource requirements in virtualized environments.
CPU SCHEDULERS FOR VIRTUAL MACHINES
Before describing Xen’s CPU schedulers, we first establish some terminology and classical ways of classifying CPU schedulers. A large fraction of CPU schedulers belong to a popular class of schedulers called Proportional Share (PS) schedulers. There are compelling reasons to use proportional share (PS) scheduling for CPU scheduling in VMs. PS scheduling allocates CPU in proportion to the number of shares (weights) that VMs have been assigned. This gives end users a very natural way of thinking about CPU allocations, and scales well to large number of VMs and processors.
CPU schedulers can be further distinguished as supporting workconserving (WC-mode) and/or non work-conserving (NWC-mode) modes. In the WC-mode, the shares are merely guarantees, and the CPU is idle if and only if there is no runnable client. It means that in a case of two clients with equal weights and a situation when one of these clients is blocked, the other client can consume the entire CPU.
With the NWC-mode, the shares are caps, i.e., each client owns its fraction of the CPU. It means that in a case of two clients with equal weights, each client will get up to 50% of CPU, but the client will not be able to get more than 50% even if the rest of the CPU is idle.
We also distinguish preemptive and non-preemptive CPU schedulers. Preemptive schedulers rerun the scheduling decision whenever a new client becomes ready. If the new client has “priority” over the running client, the CPU preempts the running client and executes the new client. Non-preemptive schedulers only make decisions when the running client gives up the CPU. Non-preemptive schedulers allow every running client to finish its CPU slice. Having a preemptive scheduler is important for achieving good performance of I/O intensive workloads in shared environment. These workloads are often blocked waiting for I/O events, and their performance could suffer when competing with CPU intensive jobs if the CPU scheduler is non-preemptive. However, choosing a right quantum size may alleviate this problem.
CPU SCHEDULERS IN XEN
Xen is unique among VM platforms because it allows users to choose among different CPU schedulers. But this choice comes with the burden of choosing the right scheduler and configuring it. Over the course of last three years, three different CPU schedulers were introduced, all allowing users to specify CPU allocation via CPU shares (weights). Below, we briefly characterize their main features that motivated their inclusion in Xen at the time.
Borrowed Virtual Time (BVT)
is a fair-share scheduler based on the concept of virtual time, dispatching the runnable VM with the smallest virtual time first. Additionally, BVT provides low-latency support for real-time and interactive applications by allowing latencysensitive clients to “warp” back in virtual time to gain scheduling priority. The client effectively “borrows” virtual time from its future CPU allocation.
The scheduler accounts for running time in terms of a minimum charging unit (mcu), typically the frequency of clock interrupts. The scheduler is configured with a context switch allowance C, which is the real time by which the current VM is allowed to advance beyond another runnable VM with equal claim on the CPU ( the basic time slice or time quantum of the algorithm). C is typically some multiple of mcu. Each runnable domain 2 receives a share of CPU in proportion to its weight weighti. To achieve this, the virtual time of the currently running Domi is incremented by its running time divided by weighti.
In summary, BVT has the following features: • preemptive (if warp is used), WC-mode only; • optimally-fair: the error between fair share and actual allocation is never greater than context switch allowance C plus one mcu ; • low-overhead implementation on multiprocessors as well as uni-processors.
The lack of NWC-mode in BVT severely limited its usage in a number of environments, and led to the introduction of the next scheduler in Xen.
Simple Earliest Deadline First (SEDF)
uses real-time algorithms to deliver guarantees. Each domain Domi specifies its CPU requirements with a tuple (si, pi, xi), where the slice si and the period pi together represent the CPU share that Domi requests: Domi will receive at least si units of time in each period of length pi. The boolean flag xi indicates whether Domi is eligible to receive extra CPU time (WC-mode). SEDF distributes this slack time fairly manner after all runnable domains receive their CPU share. One can allocate 30% CPU to a domain by assigning either (3 ms, 10 ms, 0) or (30 ms, 100 ms, 0). The time granularity in the definition of the period impacts scheduler fairness For each domain Domi, the scheduler tracks two additional values (di, ri): • di - time at which Domi’s current period ends, also called the deadline. The runnable domain with earliest deadline is picked to be scheduled next; • ri - remaining CPU time of Domi in the current period. In summary, SEDF has the following features: • preemptive, WC and NWC modes; • fairness depends on a value of the period. • implements per CPU queue: this implementation lacks global load balancing on multiprocessors.
Credit Scheduler
is Xen’s latest PS scheduler featuring automatic load balancing of virtual CPUs across physical CPUs on an SMP host. Before a CPU goes idle, it will consider other CPUs in order to find any runnable VCPU. This approach guarantees that no CPU idles when there is runnable work in the system.
Each VM is assigned a weight and a cap. If the cap is 0, then the VM can receive any extra CPU (WC-mode). A non-zero cap (expressed as a percentage) limits the amount of CPU a VM receives (NWC-mode). The Credit scheduler uses 30 ms time slices for CPU allocation. A VM (VCPU) receives 30 ms before being preempted to run another VM. Once every 30 ms, the priorities (credits) of all runnable VMs are recalculated. The scheduler monitors resource usage every 10 ms. To some degree, Credit’s computation of credits resembles virtual time computation in BVT. However, BVT has a context switch allowance C for defining a different size of the basic time slice (time quantum), and an additional low-latency support (via warp) for real-time applications. In summary, Credit has the following features: • non-preemptive, WC and NWC modes; • global load balancing on multiprocessors. In the next section, we present results of a performance study comparing these schedulers and their features in more detail.