PowerNap: Eliminating Server Idle Power
David Meisner? email@example.com
Brian T. Gold? firstname.lastname@example.org
Thomas F. Wenisch? email@example.com
Advanced Computer Architecture Lab The University of Michigan
Computer Architecture Lab Carnegie Mellon University
Data center power consumption is growing to unprecedented levels: the EPA estimates U.S. data centers will consume 100 billion kilowatt hours annually by 2011. Much of this energy is wasted in idle systems: in typical deployments, server utilization is below 30%, but idle servers still consume 60% of their peak power draw. Typical idle periods?ª though frequent?ªlast seconds or less, confounding simple energy-conservation approaches. In this paper, we propose PowerNap, an energy-conservation approach where the entire system transitions rapidly between a high-performance active state and a near-zeropower idle state in response to instantaneous load. Rather than requiring ?ne-grained power-performance states and complex load-proportional operation from each system component, PowerNap instead calls for minimizing idle power and transition time, which are simpler optimization goals. Based on the PowerNap concept, we develop requirements and outline mechanisms to eliminate idle power waste in enterprise blade servers. Because PowerNap operates in lowef?ciency regions of current blade center power supplies, we introduce the Redundant Array for Inexpensive Load Sharing (RAILS), a power provisioning approach that provides high conversion ef?ciency across the entire range of PowerNap??s power demands. Using utilization traces collected from enterprise-scale commercial deployments, we demonstrate that, together, PowerNap and RAILS reduce average server power consumption by 74%. Categories and Subject Descriptors C.5.5 [Computer System Implementation]: Servers General Terms Design, Measurement
lion kWh at a cost of $7.4 billion per year . Unfortunately, much of this energy is wasted by systems that are idle. At idle, current servers still draw about 60% of peak power [1, 6, 13]. In typical data centers, average utilization is only 20-30% [1, 3]. Low utilization is endemic to data center operation: strict service-level- agreements force operators to provision for redundant operation under peak load. Idle-energy waste is compounded by losses in the power delivery and cooling infrastructure, which increase power consumption requirements by 50-100% . Ideally, we would like to simply turn idle systems
off. Unfortunately, a large fraction of servers exhibit frequent but brief bursts of activity [2, 3]. Moreover, user demand often varies rapidly and/or unpredictably, making dynamic consolidation and system shutdown dif?cult. Our analysis shows that server workloads, especially interactive services, exhibit frequent idle periods of less than one second, which cannot be exploited by existing mechanisms. Concern over idle-energy waste has prompted calls for a fundamental redesign of each computer system component to consume energy in proportion to utilization . Processor dynamic frequency and voltage scaling (DVFS) exempli?es the energy-proportional concept, providing up to cubic energy savings under reduced load. Unfortunately, processors account for an ever-shrinking fraction of total server power, only 25% in current systems [6, 12, 13], and controlling DVFS remains an active research topic [17, 30]. Other subsystems incur many ?xed power overheads when active and do not yet offer energy-proportional operation. We propose an alternative energy-conservation approach, called PowerNap, that is attuned to server utilization patterns. With PowerNap, we design the entire system to transition rapidly between a high-performance active state and a minimal-power nap state in response to instantaneous load. Rather than requiring components that provide ?ne-grain
power-performance trade-offs, PowerNap simpli?es the system designer??s task to focus on two optimization goals: (1) optimizing energy ef?ciency while napping, and (2) minimizing transition time into and out of the low-power nap state. Based on the PowerNap concept, we develop requirements and outline mechanisms to eliminate idle power waste in a high-density blade server system. Whereas many mechanisms required by PowerNap can be adapted from mo-
Keywords power management, servers
Data center power consumption is undergoing alarming growth. By 2011, U.S. data centers will consume 100 bilPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the ?rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci?c permission and/or a fee. ASPLOS??09, March 7?C11, 2009, Washington, DC, USA. Copyright c 2009 ACM 978-1-60558-215-3/09/03. . . $5.00
100 80 Time (%) 60 40 20
% Server Power
IT Web 2.0
80% 60% 40% 20% 0%
CPU Fans I/O & Disk Memory Other
40 50 60 70 Utilization (%)
Figure 1: Server Utilization Histogram. Real data centers are under 20% utilized.
Figure 2: Server Power Breakdown. No single component dominates total system power.
Table 1: Enterprise Data Center Utilization Traces.
Workload Web 2.0 IT Avg. Utilization 7.4% 14.2% Description ??Web 2.0?? application servers Enterprise IT Infrastructure apps
models, we determine that PowerNap is effective if state transition time is below 10ms, and incurs no overheads below 1ms. Furthermore, we show that PowerNap provides greater energy ef?ciency and lower response time than solutions based on DVFS.
Ef?cient PowerNap power provisioning with RAILS.
bile and handheld devices, one critical subsystem of current blade chassis falls short of meeting PowerNap??s energyef?ciency requirements: the power conversion system. PowerNap reduces total ensemble power consumption when all blades are napping to only 6% of the peak when all are active. Power supplies are notoriously inef?cient at low loads, typically providing conversion ef?ciency below 70% under 20% load . These losses undermines PowerNap??s energy ef?ciency. Directly improving power supply ef?ciency implies a substantial cost premium. Instead, we introduce the Redundant Array for Inexpensive Load Sharing (RAILS), a power provisioning approach where power draw is shared over an array of low-capacity power supply units (PSUs) built with commodity components. The key innovation of RAILS is to size individual power modules such that the power delivery solution operates at high ef?ciency across the entire range of PowerNap??s power demands. In addition, RAILS provides N+1 redundancy, graceful compute capacity degradation in the face of multiple power module failures, and reduced component costs relative to conventional enterpriseclass power systems. Through modeling and analysis of actual data center workload traces, we demonstrate:
Analysis of idle/busy intervals in actual data centers.
Our analysis of commercial data center workload traces demonstrates that RAILS improves average power conversion ef?ciency from 68% to 86%
in PowerNapenabled servers.
2. Understanding Server Utilization
It has been well-established in the research literature that the average server utilization of data centers is low, often below 30% [2, 3, 6]. In facilities that provide interactive services (e.g., transaction processing, ?le servers, Web 2.0), average utilization is often even worse, sometimes as low as 10% . Figure 1 depicts a histogram of utilization for two production workloads from enterprise-scale commercial deployments. Table 1 describes the workloads running on these servers. We derive this data from utilization traces collected over many days, aggregated over more than 120 severs (production utilization traces were provided courtesy of HP Labs). The most striking feature of this data is that the servers spend the vast majority of time under 10% utilization. Data center utilization is unlikely to increase for two reasons. First, data center operators must provision for peak rather than average load. For interactive services, peak utilization often exceeds average utilization by more than a factor of three . Second, to provide redundancy in the event of failures, operators usually deploy more systems than are actually needed. Though server consolidation can improve average utilization, performance isolation, redundancy, and service robustness concerns often preclude consolidation of mission-critical services. Low utilization creates an energy ef?ciency challenge because conventional servers are notoriously inef?cient at low loads. Although power-saving features like clock gating and
We analyze utilization traces from production servers and data centers to determine the distribution of idle and active periods. Though interactive servers are typically over 60% idle, most idle intervals are under one second.
Energy-ef?ciency and response time bounds. Through
queuing analysis, we establish bounds on PowerNap??s energy ef?ciency and response time impact. Using our
100 90 80 70 Percent Percent Web Mail DNS Shell Backup Cluster 10
100 90 80 70 60 50 40 30 20 10
60 50 40 30 20 10 0 1 10 10 Busy Period (ms)
0 1 10
Web Mail DNS Shell Backup Cluster 10
10 Idle Period (ms)
Figure 3: Busy and Idle Period Cumulative Distributions.
Table 2: Fine-Grain Utilization Traces.
Workload Utilization Web Mail DNS Shell Backup Cluster 26.5% 55.0% 17.4% 32.0% 22.2% 64.3% Avg. Interval Busy 38 ms 115 ms 194 ms 51 ms 31 ms 3.25 s Idle 106 ms 94 ms 923 ms 108 ms 108 ms 1.8 s Department web server Department POP and SMTP servers Department DNS and DHCP server Interactive shell and IMAP support Continuous incremental backup server 600-node scienti?c computing cluster Description
ularity. To our knowledge, our study is the ?rst to report server utilization data measured at such ?ne granularity. We classify an interval as busy or idle based on how the OS scheduler accounted the period in its utilization tracking. The traces were collected over a period of a week from seven departmental IT servers and a scienti?c computing cluster comprising over 600 servers. We present the mean idle and busy period lengths, average utilization, and a brief description of each trace in Table 2. Figure 3 shows the cumulative distribution for the busy and idle period lengths in each trace. The key result of our traces is that the vast majority of idle periods are shorter than 1s, with mean lengths in the 100??s of milliseconds. Busy periods are even shorter, typically only 10??s of milliseconds. 2.2 Existing Energy-Conservation Techniques The rapid transitions and brief intervals of server activity make it dif?cult to conserve idle power with existing approaches. The recent trend towards server consolidation  is partly motivated by the high energy cost of idle systems. By moving services to virtual machines, several services can be time-multiplexed on a single physical server, increasing average utilization. Consolidation allows the total number of physical servers to be reduced, thereby reducing idle inef?ciency. However, server consolidation, by itself, does not close the gap between peak and average utilization. Data centers still require suf?cient capacity for peak demand, which inevitably leaves some servers idle in the average case. Furthermore, consolidation does not save energy automatically; system administrators must actively consolidate services and remove unneeded systems. Although support for sleep states is widespread in handheld, laptop and desktop machines, these states are rarely used in current server systems. Unfortunately, the high restart latency typical of current sleep states renders them unaccept-
dynamic voltage and frequency scaling (DVFS) nearly eliminate
processor power consumption in idle systems, presentday servers still dissipate about 60% as much power when idle as when fully loaded [4, 6, 13]. Processors often account for only a quarter of system power; main memory and cooling fans contribute larger fractions . Figure 2 reproduces typical server power breakdowns for the IBM p670 , Sun UltraSparc T2000 , and a generic server speci?ed by Google , respectively. 2.1 Frequent Brief Utilization Clearly, eliminating server idle power waste is critical to improving data center energy ef?ciency. Engineers have been successful in reducing idle power in mobile platforms, such as cell phones and laptops. However, servers pose a fundamentally different challenge than these platforms. The key observation underlying our work is that, although servers have low utilization, their activity occurs in frequent, brief bursts. As a result, they appear to be under a constant, light load. To investigate the time scale of servers?? idle and busy periods, we have instrumented a series of interactive and batch processing servers to collect utilization traces at 10ms gran-
zzz PowerNap Transition
DRAM Fans SSD
DRAM Fans SSD
DRAM Fans SSD
Server operates at full performance System components nap to ?nish existing work while server is idle
The NIC detects the arrival of work
Server returns to full performance to ?nish work as quickly as possible
Figure 4: PowerNap.
able for interactive services; current laptops and desktops require several seconds to suspend using operating system interfaces (e.g., ACPI). Moreover, unlike consumer devices, servers cannot rely on the user to transition between power states; they must have an autonomous mechanism that manages state transitions. Recent server processors include CPU throttling solutions (e.g. Intel Speedstep, AMD Cool??n??Quiet) to reduce the large overhead of light loads. These processors use DVFS to reduce their operating frequency linearly while gaining cubic power savings. DVFS relies on operating system support to tune processor frequency to instantaneous load. In Linux, the kernel continues lowering frequency until it observes ??20% idle time. Improving DVFS control algorithms remains an active research area [17,30]. Nonetheless, DVFS can be highly effective in reducing CPU power. However, as Figure 2 shows, CPUs account for a small portion of total system power. Energy proportional computing  seeks to extend the success of DVFS to the entire system. In this scheme, each system component is redesigned to consume energy in proportion to utilization. In an energy-proportional system, explicit power management is unnecessary, as power consumption varies naturally with utilization. However, as many components incur ?xed power overheads when active (e.g., clock power on synchronous memory busses, leakage power in CPUs, etc.) designing energy-proportional subsystems remains a research challenge. Energy-proportional operation can be approximated with
non-energy-proportional systems through dynamic virtual machine consolidation over a large server ensemble . However, such approaches do not address the performance isolation concerns of dynamic consolidation and operate at coarse time scales (minutes). Hence, they cannot exploit the brief idle periods found in servers.
these brief idle periods. Hence, we propose an approach to power management that enables the entire system to transition rapidly into and out of a low-power state where all activity is suspended until new work arrives. We call our approach PowerNap. Figure 4 illustrates the PowerNap concept. Each time the server exhausts all pending work, it transitions to the nap state. In this state, nearly all system components enter sleep modes, which are already available in many components (see Section 4). While in the nap state, power consumption is low, but no processing can occur. System components that signal the arrival of new work, expiration of a software timer, or environmental changes, remain partially powered. When new work arrives, the system wakes and transitions back to the active state. When the work is complete, the system returns to the nap state. PowerNap is simpler than many other energy conservation schemes because it requires system components to support only two operating modes: an active mode that provides maximum performance and a nap mode that minimizes power draw. For many devices,
providing a low-power nap mode is far easier than providing multiple active modes that trade performance for power savings. Any level of activity often implies ?xed power overheads (e.g., bus clock switching, power distribution losses, leakage power, mechanical components, etc.) We outline mechanisms required to implement PowerNap in Section 4. 3.1 PowerNap Performance and Power Model To assess PowerNap??s potential, we develop a queuing model that relates its key performance measures?ªenergy savings and response time penalty?ªto workload parameters and PowerNap implementation characteristics. We contrast PowerNap with a model of the upper-bound energy-savings possible with DVFS. The goal of our model is threefold: (1) to gain insight into PowerNap behavior, (2) to derive requirements for PowerNap implementations, and (3) to contrast PowerNap and DVFS.
Although servers spend most of their time idle, conventional energy-conservation techniques are unable to exploit
Work in Queue
xxx xxx xxx
xx xx xxx xxx xxx
Arrival (a) PowerNap
xx Suspend xx
Work in Queue
xxx xxx xxx
Figure 5: PowerNap and DVFS Analytic Models.
We model both PowerNap and DVFS under the assumption that each seeks to minimize the energy required to serve the offered load. Hence, both schemes provide identical throughput (matching the offered load) but differ in response time and energy consumption. PowerNap Model. We model PowerNap as an M/G/1 queuing system with arrival rate ?Ë, and a generalized service time distribution with known ?rst and second moments E[S] and E[S 2 ]. Figure 5(a) shows the work in the queue for three job arrivals. Note that, in this context, work also includes time spent in the wake and suspend states. Average server utilization is given by ?Ñ = ?ËE[S]. To model the effects of PowerNap suspend and wake transitions, we extend the conventional M/G/1 model with an
exceptional ?rst service time . We assume PowerNap transitions are symmetric with latency Tt . Service of the ?rst job in each busy period is delayed by an initial setup time I. The setup time includes the wake transition and may include the remaining portion of a suspend transition as shown for the rightmost arrival in Figure 5(a). Hence, for an arrival x time units from the start of the preceding idle period, the initial setup time is given by: I= 2Tt ? x Tt if 0 ?Ü x < Tt if x ?Ý Tt
= = e
Tt ?Þ ??Ët dt + Tt (t ? 0 (0)?Ëe E[S]+E[I] 1 1??ËE[S] + ?Ë ??ËTt
Tt )?Ëe??Ët dt
(1 ? ?ËE[S]) 1 + ?ËE[I]
The response time for an M/G/1 server with exceptional ?rst service is due to Welch : E[R] =
?ËE[S 2 ] 2(1??ËE[S])
2E[I]+?ËE[I 2 ] 2(1+?ËE[I])
Note that the ?rst term of E[R] is the Pollaczek-Khinchin formula for the expected queuing delay in a standard M/G/1 queue, the second term is additional residual delay caused by the initial setup time I, and the ?nal term is the expected service time E[S]. The second term vanishes when Tt = 0. DVFS model. Rather than model a real DVFS frequency control algorithm, we instead model the upper bound of energy savings possible with DVFS. For each job arrival, we scale instantaneous frequency f to stretch the job to ?ll any idle time until the next job arrival, as illustrated in Figure 5(b), which gives E[f ] = fmax ?Ñ. This scheme maximizes power savings, but cannot be implemented in practice because it requires knowledge of future arrival times. We base power savings estimates on the theoretical formulation of processor dynamic power consumption PCP U = 1 2 2 CV Af . We assume C and A are ?xed, and choose the optimal f for each job within the range fmin < f < fmax . We impose a lower bound fmin = fmax /2.4 to prevent response time from growing asymptotically when utilization is low. We chose a factor of 2.4 between fmin and fmax based on the frequency range provided by a 2.4 GHz AMD Athlon. We assume voltage scales linearly with frequency (i.e., V = Vmax (f /fmax)), which is optimistic with respect to current DVFS implementations. Finally, as DVFS only reduces the CPU??s contribution to system power, we include a parameter FCP U to control the fraction of total system power affected by DVFS. Under these assumptions, average power Pavg is given by:
E[f ] Pavg = Pmax (1 ? FCP U ( fmax )3 )
The ?rst and second moments E[I] and E[I 2 ] are:
I?Ëe??Ëx dx = 2Tt + I 2 ?Ëe??Ëx dx
1 ??ËTt 1 ? e ?Ë ?Ë
E[I ] = =
4Tt2 ? 2Tt2 e??ËTt ? 4Tt 2 + 2 1 ? (1 + ?ËTt )e??ËTt ?Ë ?Ë
We compute average power as Pavg = Pnap ?? Fnap + Pmax (1 ? Fnap ), where the fraction of time spent napping Fnap is given by the ratio of the expected length of each nap period E[N ] to the expected busy-idle cycle length E[C]:
DVFS DVFS DVFS
FCPU = 100% FCPU = 40% FCPU = 20%
PowerNap Tt = 100 ms PowerNap Tt = 10 ms PowerNap Tt = 1 ms
DVFS PowerNap Tt = 100 ms
3.5 Relative response time
PowerNap Tt = 10 ms PowerNap Tt = 1 ms
100% Avg. Power (% max power) 3.0 80% 60% 40% 20% 0% 0% 20% 40% 60% 80% 100% % utilization
1.0 0% 20% 40% 60% 80% 100% % utilization
(a) Power Scaling
(b) Response Time Scaling
Figure 6: PowerNap and DVFS Power and Response Time Scaling.
Response time is given by:
Table 3: Per-Workload Energy Savings.
E[R] = E
Workload Web Mail DNS Shell Backup Cluster
PowerNap Energy Savings 59% 35% 77% 55% 61% 34%
DVFS Energy Savings 23% 21% 23% 23% 23% 18%
where Rbase is the response time without DVFS. 3.2 Analysis Power Savings. Figure 6(a) shows the average power (as a fraction of peak) required under PowerNap and DVFS as a function of utilization. For DVFS, we show power savings for three values of FCP U . FCP U = 100% represents the upper bound if DVFS were applicable to all system power. 20% < FCP U < 40% bound the typical range in current servers. For PowerNap, we construct the graphs with E[s] = 38ms and E[s2 ] = 3.7E[s], which are both estimated from the observed busy period distribution in our Web trace. We assume Pnap is 5% of Pmax . We vary ?Ë to adjust utilization,