4th USENIX Symposium on Networked Systems Design &
Pp. 229242 of the –Proceedings
Black-box and Gray-box
Strategies for Virtual
Timothy Wood, Prashant Shenoy, Arun Venkataramani, and Mazin Yousif
Univ. of Massachusetts Amherst
Virtualization can provide significant benefits in data centers by enabling virtual machine migration to eliminate hotspots. We present Sandpiper, a system that automates the task of monitoring and detecting hotspots, determining a new mapping of physical to virtual resources and initiating the necessary migrations. Sandpiper implements a black-box approach that is fully OS- and application-agnostic and a gray-box approach that exploits OS- and application-level statistics. We implement our techniques in Xen and conduct a detailed evaluation using a mix of CPU, network and memory-intensive applications. Our results show that Sandpiper is able to resolve single server hotspots within 20 seconds and scales well to larger, data center environments. We also show that the gray-box approach can help Sandpiper make more informed decisions, particularly in response to memory pressure.
Data centers--server farms that run networked applications--have become popular in a variety of domains such as web hosting, enterprise systems, and e-commerce sites. Server resources in a data center are multiplexed across multiple applications--each server runs one or more applications and application components may be distributed across multiple servers. Further, each application sees dynamic workload fluctuations caused by incremental growth, time-of-day effects, and flash crowds [
need to operate above a certain performance level specified in terms of a
effective management of data center resources while meeting SLAs is a complex task.
One possible approach for reducing management complexity is to employ
applications run on virtual servers that are constructed using virtual machines, and one or more virtual servers are mapped onto each physical server in the system. Virtualization of data center resources provides numerous benefits. It enables application isolation since malicious or greedy applications can not impact other applications co-located on the same physical server. It enables server consolidation and provides better multiplexing of data center resources across applications. Perhaps the biggest advantage of employing virtualization is the ability to flexibly remap physical resources to virtual servers in order to handle workload dynamics. A workload increase can be handled by increasing the resources allocated to a virtual server, if idle resources are available on the physical server, or by simply migrating the virtual server to a less loaded physical server. Migration is transparent to the applications and all modern virtual machines
support this capability [6,15]. However, detecting workload hotspots and initiating a migration is currently handled manually. Manually-initiated migration lacks the agility to respond to sudden workload changes; it is also error-prone since each reshuffle might require migrations or swaps of multiple virtual servers to rebalance system load. Migration is further complicated by the need to consider multiple resources--CPU, network, and memory--for each application and physical server.
To address this challenge, this paper studies automated black-box and gray-box strategies for virtual machine migration in large data centers. Our techniques automate the tasks of monitoring system resource usage, hotspot detection, determining a new mapping and initiating the necessary migrations. More importantly, our black-box techniques can make these decisions by simply observing each virtual machine from the outside and without any knowledge of the application resident within each VM. We also present a gray-box approach that assumes access to a small amount of OS-level statistics in addition to external observations to better inform the migration algorithm. Since a black-box approach is more general by virtue of being OS and application-agnostic, an important aspect of our research is to understand if a black-box approach alone is sufficient and effective for hotspot detection and mitigation. We have designed and implemented the Sandpiper system to support either black-box, gray-box, or combined techniques. We seek to identify specific limitations of the black-box approach and understand how a gray-box approach can address them.
Sandpiper implements a hotspot detection algorithm that determines
and a hotspot mitigation algorithm that determines
after the migration. The hotspot detection component employs a monitoring and profiling engine that gathers usage statistics on various virtual and physical servers and constructs profiles of resource usage. These profiles are used in conjunction with prediction techniques to detect hotspots in the system. Upon detection, Sandpiper's migration manager is invoked for hotspot mitigation. The migration manager employs provisioning techniques to determine the resource needs of overloaded VMs and uses a greedy algorithm to determine a sequence of moves or swaps to migrate overloaded VMs to underloaded servers. We have implemented our techniques using the Xen virtual machine [
experimental evaluation on a testbed of two dozen servers using a mix of CPU-, network- and memory-intensive applications. Our results show that Sandpiper can alleviate single server hotspots in less than 20s and more complex multi-server hotspots in a few minutes. Our results show that Sandpiper imposes negligible overheads and that gray-box statistics enable Sandpiper to make better migration decisions when alleviating memory hotspots.
The rest of this paper is structured as follows. Section
present our design of Sandpiper. Section
and 9 present related work and our conclusions, respectively.
2 Background and System Overview
Existing approaches to dynamic provisioning have either focused on dynamic
of servers allocated to an application is varied, or dynamic
to an application is varied; none have considered application
provisioning, primarily since migration is not a feasible option in the absence of virtualization. Since migration is transparent to applications executing within virtual machines, our work considers this third approach--resource provisioning via dynamic migrations in virtualized data centers. We present a system for automated migration of virtual servers in a data center to meet application SLAs. Sandpiper assumes a large cluster of possibly heterogeneous servers. The hardware configuration of each server--its CPU, network interface, disk and memory characteristics--is assumed to be known to Sandpiper. Each physical server (also referred to as a physical machine or PM) runs a
more virtual machines. Each virtual server runs an application or an application component (the terms virtual servers and virtual machine are used interchangeably). Sandpiper currently uses Xen to implement such an architecture. Each virtual server is assumed to be allocated a certain slice of the physical server resources. In the case of CPU, this is achieved by assigning a weight to the virtual server and the underlying Xen CPU scheduler allocates CPU bandwidth in proportion to the weight. In case of the network interface, Xen is yet to implement a similar fair-share scheduler; a best-effort FIFO scheduler is currently
used and Sandpiper is designed to work with this constraint. In case of memory, a slice is assigned by
allocating a certain amount of RAM to each resident VM. All storage is assumed to be on a network file
system or a storage area network, thereby eliminating the need to move disk state during VM migrations
Figure 1: The Sandpiper architecture.Sandpiper runs a component called the
on each physical server; the
nucleus runs inside a special virtual server (domain 0 in Xen) and is responsible for gathering resource usage statistics on that server (see Figure 1). It employs a
monitoring engine that gathers processor,
network interface and memory swap
statistics for each virtual server. For gray-box approaches, it implements a daemon within each virtual server to gather OS-level statistics and perhaps application The nuclei periodically relay these statistics to the Sandpiper control plane. The control plane runs on a distinguished node and implements much of the intelligence in Sandpiper. It comprises three components: profiling engine, a hotspot detector
migration manager (see Figure 1). The
profiling engine uses the statistics from the nuclei to construct resource usage profiles for each virtual server and aggregate profiles for each physical server. The hotspot detector continuously monitors these usage profiles to detect hotspots --informally, a hotspot is said to have occurred if the aggregate usage of any resource (processor, network or memory) exceeds a threshold or if SLA violations occur for a ``sustained'' period. Thus, the hotspot detection component determines
to signal the need for migrations and invokes the migration manager upon
hotspot detection, which attempts hotspot mitigation via dynamic migrations. It implements algorithms that determine
virtual servers to migrate from the overloaded servers, where to move them,
how much of a resource to allocate the virtual servers once the migration is complete (i.e., determine a new resource allocation to meet the target SLAs). The migration manager assumes that the
virtual machine monitor implements a migration mechanism that is transparent to applications and uses this mechanism to automate migration decisions; Sandpiper
currently uses Xen's migration mechanisms that were presented in .
and Profiling in Sandpiper
This section discusses online monitoring and profile generation in Sandpiper. 3.1 Unobtrusive
Black-box Monitoring The monitoring engine is responsible for tracking the processor, network and memory usage of each virtual server. It also tracks the total resource usage on each physical server by aggregating the usages of resident VMs. The monitoring engine tracks the usage of each resource over a measurement interval and reports these statistics to the control plane at the end of each interval.
In a pure black-box approach, all usages must be inferred solely from external observations and without relying on OS-level support inside the VM. Fortunately, much of the required information can be determined directly from the Xen hypervisor or by monitoring events within domain-0 of Xen. Domain-0 is a
distinguished VM in Xen that is responsible for I/O processing; domain-0 can host device drivers and act as a ``driver'' domain that processes I/O requests from other domains [3,9]. As a result, it is
possible to track network and disk I/O activity of various VMs by observing the driver activity in domain-0 . Similarly,
since CPU scheduling is implemented in the Xen hypervisor, the CPU usage of various VMs can be determined by tracking scheduling events in the hypervisor [Thus, black-box monitoring can be implemented in the nucleus by tracking various domain-0 events and without modifying any virtual server. Next, we discuss CPU, network and memory
monitoring using this approach.
CPU Monitoring: By instrumenting the
Xen hypervisor, it is possible to provide domain-0 with access to CPU scheduling events which indicate when a VM is scheduled and when it relinquishes the CPU. These events are tracked to
determine the duration for which each virtual machine is scheduled within each measurement interval . The Xen 3.0 distribution includes a monitoring application called XenMon  that tracks
the CPU usages of the resident virtual machines using this approach; for simplicity, the monitoring engine employs a modified version of XenMon to gather CPU usages of resident VMs over a configurable measurement interval .
It is important to realize that these statistics do not capture the CPU overhead incurred for processing disk and network I/O requests; since Xen uses domain-0 to process disk and network I/O requests on behalf of other virtual machines, this processing overhead gets charged to the CPU utilization of domain 0. To properly account for this request processing ovehead, analogous to proper accounting of interrupt processing overhead in OS kernels, we must apportion the CPU utilization of domain-0 to other virtual machines. We assume that the monitoring engine and the nucleus impose negligible overhead and that all of the CPU usage of domain-0 is primarily due to requests processed on behalf of other VMs. Since domain-0 can also track I/O request events based on the number of memory page exchanges between domains, we
determine the number of disk and network I/O requests that are processed for each VM. Each VM is then charged a fraction of domain-0's usage based on the proportion of the total I/O requests made by that VM. A more precise approach requiring a modified scheduler was proposed in [Network Monitoring: Domain-0 in Xen
implements the network interface driver and all other domains access the driver via clean device abstractions. Xen uses a virtual firewall-router (VFR) interface; each domain attaches one or more virtual interfaces to the VFR . Doing so enables
Xen to multiplex all its virtual interfaces onto the underlying physical network Consequently, the monitoring engine can
conveniently monitor each VM's network usage in Domain-0. Since each virtual interface looks like a modern NIC and Xen uses Linux drivers, the monitoring engines can use the Linux /proc interface (in
/proc/net/dev) to monitor the
number of bytes sent and received on each interface. These statistics are gathered over interval and returned to the control Memory Monitoring: Black-box
monitoring of memory is challenging since Xen allocates a user-specified amount of memory to each VM and requires the OS within the VM to manage that memory; as a result, the memory utilization is only known to the OS within each VM. It is possible to instrument Xen to observe memory accesses within each VM through the use of shadow page tables, which is used by Xen's migration mechanism to determine which pages are dirtied during migration. However, trapping each memory access results in a significant application slowdown and is only enabled during
]. Thus, memory usage
statistics are not directly available and must be inferred.
The only behavior that is visible externally swap activity. Since swap partitions reside on a network disk, I/O requests to swap partitions need to be processed by domain-0 and can be tracked. By tracking the reads and writes to each swap partition from domain-0, it is possible to detect memory pressure within each VM. The recently proposed Geiger system has shown that such passive observation of swap activity can be used to infer useful information about the virtual memory subsystem such as working set sizes [Our monitoring engine tracks the number of read and write requests to swap partitions within each measurement interval and reports it to the control plane. Since substantial swapping activity is indicative of memory pressure, our current black-box approach is limited to reactive decision making and can not be proactive. 3.2 Gray-box
Black-box monitoring is useful in scenarios
where it is not feasible to ``peek inside'' a
VM to gather usage statistics. Hosting
environments, for instance, run third-party
applications, and in some cases, third-
party installed OS distributions. Amazon's
Elastic Computing Cloud (EC2) service, for
instance, provides a ``barebone'' virtual
server where customers can load their own
OS images. While OS instrumentation is
not feasible in such environments, there
are environments such as corporate data
centers where both the hardware
infrastructure and the applications are
owned by the same entity. In such
scenarios, it is feasible to gather OS-level
statistics as well as application logs, which
can potentially enhance the quality of
decision making in Sandpiper.
Sandpiper supports gray-box monitoring,
when feasible, using a light-weight
monitoring daemon that is installed inside
each virtual server. In Linux, the
monitoring daemon uses the /proc
interface to gather OS-level statistics of
CPU, network, and memory usage. The
memory usage monitoring, in particular,
enables proactive detection and mitigation
of memory hotspots. The monitoring
daemon also can process logs of
applications such as web and database
servers to derive statistics such as request
rate, request drops and service times.
Direct monitoring of such application-level
statistics enables explicit detection of SLA
violations, in contrast to the black-box
approach that uses resource utilizations as
a proxy metric for SLA monitoring.
Profile generation in Sandpiper
The profiling engine receives periodic reports of resource usage from each nucleus. It maintains a usage history for each server, which is then used to compute a profile for each virtual and physical server. A profile is a compact description of that server's resouce usage over a sliding time window . Three black-box profiles are maintained per virtual server: CPU utilization, network bandwidth utilization, and swap rate (i.e., page fault rate). If gray-box monitoring is permitted, four additional profiles are maintained: memory utilization, service time, request drop rate and incoming request rate. Similar profiles are also maintained for each physical server, which indicate the aggregate usage of resident VMs. Each profile contains a distribution and a time series. The distribution, also referred to as the distribution profile, represents the probability distribution of the resource usage over the window . To compute a CPU distribution profile, for instance, a histogram of observed usages over all intervals contained within the window is computed; normalizing this histogram yields the desired probability
While a distribution profile captures the variations in the resource usage, it does not capture temporal correlations. For instance, a distribution does not indicate whether the resource utilization increased or decreased within the window . A time-series profile captures these temporal fluctuations and is is simply a list of all reported observations within the window . For instance, the CPU time-series profile is a list of the reported utilizations within the window . Whereas time-series profiles are used by the hotspot detector to spot increasing utilization trends, distribution profiles are used by the migration manager to estimate peak resource requirements and provision accordingly.
4 Hotspot Detection
The hotspot detection algorithm is responsible for signaling a need for VM migration whenever SLA violations are detected implicitly by the black-box approach or explicitly by the gray-box approach. Hotspot detection is performed on a per-physical server basis in the back-box approach--a hot-spot is flagged if the aggregate CPU or network utilizations on the physical server exceed a threshold or if the total swap activity exceeds a threshold. In contrast, explicit SLA violations must be detected on a per-virtual server basis in the gray-box approach--a hotspot is flagged if the memory utilization of the VM exceeds a threshold or if the response time or the request drop rate exceed the SLA-specified values. To ensure that a small transient spike does not trigger needless migrations, a hotspot is flagged only if thresholds or SLAs are exceeded for a sustained time. Given a time-series profile, a hotspot is flagged if at least out the most recent observations as well as the next predicted value exceed a threshold. With this constraint, we can filter out transient spikes and avoid needless migrations. The values of and can be chosen to make hotspot detection aggressive or conservative. For a given , small values of cause aggressive hotspot detection, while large values of imply a need for more sustained threshold violations and thus a more conservative approach. Similarly, larger values of incorporate a longer history, resulting in a more conservative approach. In the extreme, is the most aggressive approach that flags a hostpot as soon as the threshold is exceeded. Finally, the threshold itself also determines how aggressively hotspots are flagged; lower thresholds imply more aggressive migrations at the expense of lower server utilizations, while higher thresholds imply higher utilizations with the risk of potentially In addition to requiring out of violations, we also require that the next predicted value exceed the threshold. The additional requirement ensures that the hotspot is likely to persist in the future based on current observed trends. Also, predictions capture rising trends, while preventing declining ones from Sandpiper employs time-series prediction techniques to predict future values [
Sandpiper relies on the auto-regressive family of predictors, where the -th order predictor uses prior observations in conjunction with other statistics of the time series to make a prediction. To illustrate the first-order AR(1) predictor, consider a sequence of observations: , , ..., . Given this time series, we wish to predict the demand in the th interval. Then the first-order AR(1) predictor
the previous value , the mean of the the time series values , and the parameter which captures the
]. The prediction is given by:
As new observations arrive from the nuclei, the hot spot detector updates its predictions and performs the above checks to flag new hotspots in the system.
5 Resource Provisioning
A hotspot indicates a resource deficit on the underlying physical server to service the collective
workloads of resident VMs. Before the hotspot can be resolved through migrations, Sandpiper must first
additional resources are needed by the overloaded VMs to fulfill their SLAs; these estimates are then used to locate servers that have sufficient idle resources.
5.1 Black-box Provisioning
The provisioning component needs to estimate the peak CPU, network and memory requirement of each overloaded VM; doing so ensures that the SLAs are not violated even in the presence of peak Estimating peak CPU and network bandwidth needs: Distribution profiles are used to estimate the peak
CPU and network bandwidth needs of each VM. The tail of the usage distribution represents the peak usage over the recent past and is used as an estimate of future peak needs. This is achieved by computing a high percentile (e.g., the percentile) of the CPU and network bandwidth distribution as an Since both the CPU scheduler and the network packet scheduler in Xen are work-conserving, a VM can use more than its fair share, provided that other VMs are not using their full allocations. In case of the CPU, for instance, a VM can use a share that exceeds the share determined by its weight, so long as other VMs are using less than their weighted share. In such instances, the tail of the distribution will exceed the guaranteed share and provide insights into the actual peak needs of the application. Hence, a high percentile of the distribution is a good first approximation of the peak needs.
VMs are using their fair shares, then an overloaded VM will not be allocated a share that exceeds its guaranteed allocation, even though its peak needs are higher than the fair share. In such cases, the observed peak usage (i.e., the tail of the distribution) will equal its fair-share. In this case,
under-estimate the actual peak need. To correct for this under-estimate, the provisioning component must scale the observed peak to better estimate the actual peak. Thus, whenever the CPU or the network interface on the physical server are close to saturation, the provisioning component first computes a high-percentile of the observed distribution and then adds a
Consider two virtual machines that are assigned CPU weights of 1:1 resulting in a fair share of 50% each. Assume that VM is overloaded and requires 70% of the CPU to meet its peak needs. If VM is underloaded and only using 20% of the CPU, then the work-conserving Xen scheduler will allocate 70% to VM . In this case, the tail of the observed distribution is a good inddicator of VM 's peak need. In contrast, if VM is using its entire fair share of 50%, then VM will be allocated exactly its fair share. In this case, the peak observed usage will be 50%, an underestimate of the actual peak need. Since Sandpiper can detect that the CPU is fully utilized, it will estimate the peak to be . The above example illustrates a fundamental limitation of the black-box approach--it is not possible to estimate the true peak need when the underlying resource is fully utilized. The scale-up factor is simply a guess and might end up over- or under-estimating the true peak.
Xen allows a fixed amount of physical memory to be assigned to each resident VM; this allocation represents a hard upper-bound that can not be exceeded regardless of memory demand and regardless of the memory usage in other VMs. Consequently, our techniques for estimating the peak CPU and network usage do not apply to memory. The provisioning component uses observed swap activity to determine if the current memory allocation of the VM should be increased. If swap activity exceeds the threshold indicating memory pressure, then the the current allocation is deemed insufficient and is increased by a constant amount . Observe that techniques such as Geiger that attempt to infer working set sizes by observing swap activity [
better estimate of memory needs; however, our current prototype uses the simpler approach of increasing the allocation by a fixed amount whenever memory pressure is observed.
5.2 Gray-box Provisioning
Since the gray-box approach has access to application-level logs, information contained in the logs can be utilized to estimate the peak resource needs of the application. Unlike the black-box approach, the peak needs can be estimated even when the resource is fully utilized.
To estimate peak needs, the peak request arrival rate is first estimated. Since the number of serviced requests as well as the the number of dropped requests are typically logged, the incoming request rate
is the summation of these two quantities. Given the distribution profile of the arrival rate, the peak rate is simply a high percentile of the distribution. Let denote the estimated peak arrival rate for the
An application model is necessary to estimate the peak CPU needs.
Applications such as web and database servers can be modeled as G/G/1 queuing systems [behavior of such a G/G/1 queuing system can be captured using the following queuing theory result where is the mean response time of requests, is the mean service time, and is the request arrival rate. and are the variance of inter-arrival time and the variance of service time, respectively. Note that response time includes the full queueing delay, while service time only reflects the time spent actively While the desired response time is specified by the SLA, the service time of requests as well as the variance of inter-arrival and service times and can be determined from the server logs. By substituting
, a lower bound on request rate that can be serviced by the virtual server is obtained. Thus, represents the current capacity of the VM.
To service the estimated peak workload , the current CPU capacity needs to be scaled by the factor . Observe that this factor will be greater than 1 if the peak arrival rate exceeds the currently provisioned capacity. Thus, if the VM is currently assigned a CPU weight , its allocated share needs to be scaled up by the factor to service the peak workload.
The peak network bandwidth usage is simply estimated as the product of the estimated peak arrival rate and the mean requested file size ; this is the amount of data transferred over the network to service the peak workload. The mean request size can be computed 6 Hotspot Mitigation
Once a hotspot has been detected and new allocations have been determined for overloaded VMs, the migration manager invokes its hotspot mitigation algorithm. This algorithm determines
in order to dissipate the hotspot. Determining a new mapping of VMs to physical servers that avoids threshold violations is NP-hard--the multi-dimensional bin packing problem can be reduced to this problem, where each physical server is a bin with dimensions corresponding to its resource constraints and each VM is an object that needs to be packed with size equal to its resource requirements. Even the problem of determining if a valid packing exists is NP-hard. Consequently, our hotspot mitigation algorithm resorts to a heuristic to determine which overloaded
such that migration overhead is minimized.
(i.e., the amount of data transferred) is important, since Xen's live migration mechanism works by iteratively copying the memory image of the VM to the destination while keeping track of which pages are being dirtied and need to be resent. This requires Xen to intercept all memory accesses for the migrating domain, which significantly impacts the performance of the application inside the VM. By reducing the amount of data copied over the network, Sandpiper can minimize the total migration time, and thus, the performance impact on applications. Note that network bandwidth available for application use is also reduced due to the background copying during migrations; however, on a gigabit
Once the desired resource allocations have been determined by
either our black-box or gray-box approach, the problem of finding servers with sufficient idle resource to house overloaded VMs is identical for both. The migration manager employs a greedy heuristic to determine which VMs need to be migrated. The basic idea is to move load from the most overloaded servers to the least-overloaded servers, while attempting to minimize data copying incurred during migration. Since a VM or a server can be overloaded along one or more of three dimensions-CPU,