Analysis of Self-Similarity in I/O Workload Using Structural Modeling
Mar?äa E. G?ä mez and Vicente Santonja ? o Departamento de Inform?ä tica de Sistemas y Computadores. a Universidad Polit?ä cnica de Valencia. e megomez, visan @disca.upv.es
This paper demonstrates that disk-level I/O requests are self-similar in nature. We show evidence, visual and mathematical, that I/O accesses are consistent with self-similarity. For this analysis we have used two sets of disk activity traces collected from various systems over different periods of time. In addition to studying the aggregated I/O workload that is directed to the storage system, we perform a structural modeling of the workload, in order to understand the underlying causes that produce the observed self-similarity. This structural modeling shows that self-similar behavior can be explained by combining two different approaches: the ON/OFF source model and Cox??s model. The former applies to those processes that remain active during the whole trace, the latter to the sources that show a very short activity time.
Many researchers (see  for example) have pointed out the pressing need for representative I/O workloads which can be used in performance models to compare different designs, or to predict the ef?cacy of new proposals. There are two different approaches to solve this problem: using traces extracted from real systems; or generating a synthetic workload. The ?rst approach presents problems: such as security of the data contained in the trace; the overhead introduced in the system while the traces are being captured; the representativity of the trace; the large amount of space needed to store the traces; and the lack of ?exibility to vary parameters in order to conduct a sensitivity study. The second approach is to use a synthetic workload generator, i.e. an application which generates a sequence of disk requests. Sometimes, these generators include many simplifying assumptions (e.g., Poisson arrivals). As a consequence, the analysis performed with these workloads are unrealistic, and could lead to wrong predictions. The results of feeding the system with a trace are some-
times suf?ciently valuable that the researchers would probably pay the price of keeping the full traces, but in many cases they would like to have an analysis of a trace with suf?cient detail to allow the original trace ?ªor something very much like it?ª to be reconstructed on demand.
Maybe the most dif?cult part is to identify a set of characteristic parameters that accurately represent certain loads. A disk request is de?ned by a set of values. Some of them identify the access characteristics: device, type (read or write), start location, and size. The evolution of these parameters for a sequence of accesses is known as the access pattern. Another essential parameter is the time these accesses are presented to the storage subsystem. The sequence of these time values is the arrival pattern . This paper deals with the analysis and modeling of the arrival pattern. The model obtained could be combined with a model of the access pattern to build a synthetic storage workload generator. This generator will be able to generate large amounts of trace data on the ?y, rather than using a stored trace. For our analysis, we have used two sets of traces of the disk activity collected from different systems over two periods of time: two months in 1992 and one week in November 1996. Traces in the ?rst set are described in detail in . In their paper, Ruemmler and Wilkes present an exhaustive analysis of the traces and come to the conclusion that disk accesses are very bursty. Traces in the second set have been extracted from the same environment, but four years later. Ganger  makes a start towards the characterization of storage workloads and workload synthesis. He shows that commonly used simplifying assumptions about storage loads are incorrect. In particular, by analyzing arrival patterns from access patterns in isolation, he concludes that the disk arrival process does not match that generated by a Poisson process. In fact, disk arrivals are neither independent nor exponentially distributed, as is commonly assumed in many studies. Moreover, Ganger  points to the possibility of self-similarity in I/O traces. In  the authors present evidence con?rming that I/O workload is consistent with self-similarity. They also developed a model for the
Time Unit =50000 msec
self-similar I/O arrival patterns based on ON/OFF sources. But in their paper, the behavior of ON/OFF sources is not analyzed since the traces used in the study do not contain information about the process generating each disk request. The study of the self-similarity phenomenon was initially applied to computer networks, successfully characterizing network traf?c. Further investigations have demonstrated that the aggregated traf?c on Ethernet LANs  and on WANs  is self-similar. Other studies have shown that
variable-bit-rate (VBR) video traf?c shows long-range dependencies, which is also an indicator of self-similarity . Most recently, the existence of self-similarity in highlevel ?le system events has been veri?ed . All these studies offer results indicating that using a Poisson process introduces a serious failure in the modeling when the workload has a self-similar property. Subsequently, some studies [9,
10, 11] offered a physical explanation for the existence of self-similarity in LAN and WAN traf?c. They demonstrate that the self-similarity of Ethernet traf?c could be attributed to the ON/OFF behavior of the traf?c sources within their system. The superposition of many ON/OFF sources with strictly alternating ON- and OFF-periods, with heavytailed lengths, produces a self-similar aggregated workload. The study of the basic component characteristics of the aggregated workload giving rise to the self-similar phenomenon is called structural modeling. Structural models for WAN traf?c [10, 11] also show empirically observed selfsimilarity in aggregated traf?c with in?nite variance phenomena in individual applications, e.g., telnet, ftp, and http. In this paper, we are interested both in the study of selfsimilarity in aggregated I/O workloads directed to a storage system; and in the structural modeling of the workload, in order to understand the underlying causes that produce this phenomenon. The paper is organized as follows. Section 2 is a short introduction to self-similarity and presents different methods for detecting and estimating the level of self-similarity in a set of observations. In section 3 we describe the traces used in our study. In section 4, we analyze the presence of the self-similarity phenomenon in the I/O workload. Section 5 presents a structural modeling of the I/O workload. Finally, section 6 concludes the paper.
0 0 400 100 200 300 Time Unit =5 sec 400 500 600
0 0 35 200 400 600 800 1000 1200 1400 1600 1800 Time Unit =0.5 sec
0 0 200 400 600 800 1000 1200 1400 1600 1800
Figure 1. Pictorial demonstration of selfsimilarity: Disk accesses per time unit on six different time scales.
independent of the time-scale at which it is displayed. This fact provides strong evidence of the self-similar nature of the process. On the other hand, the process models that are commonly used to model disk arrivals (e.g., based on Poisson inter-arrival times) look uniform at large time scales.
2.1. The mathematics of self-similarity
We have just presented an intuitive graphical description of self-similarity. Now, we brie?y present a formal mathematical de?nition. An extended treatment can be found elsewhere [6, 7, 11]. Consider a covariance stationary stochastic process ?ä ? ? ?Ì with mean , variance ? , and autocorrelation ?. In particular, has an autocorrelafunction ??ä ?Ì ?? ??ä??Ì for ? ? ?, tion function of the form ??ä ?Ì ? and ??ä??Ì is a slowly varying function at in?nity, as i.e., D ?? ? ?ä??ä????Ì ??ä??Ì?Ì ?, ?? ?. ?ä??Ì A new covariance stationary time series ?ä??Ì ?ä ??Ì for ? ? (with autocorrelation function ??ä??Ì ) is obtained by averaging non-overlapping blocks
2. Self-similar stochastic processes
Intuitively, a self-similar process is a process that looks similar across all time-scales. Figure 1 provides pictorial evidence of the self-similar nature of I/O traces. Each plot within the ?gure represents the number of disk accesses per time unit. Plot (a) uses a time unit of 50 seconds; plot (b) uses 5 seconds; and plot (c) 0.5 seconds. Observe that it is dif?cult to distinguish between the plots, the disk access arrival process appears to exhibit a bursty behavior that is
of size ? from the original series
. In other words,
The process is said to be exactly second-order self? ? ? ?, if for any similar with Hurst parameter ?? ? ?, var?ä ?ä??Ì?Ì ? ??? and ??ä??Ì
?ä ?Ì ??ä ?Ì ? The process is said to be asymptotically second-order self-similar with Hurst parameter ?? ? ? ? ? if for large ??ä ?Ì as ? ? enough , ??ä??Ì ?ä ?Ì Mathematically, self-similarity manifests itself in several equivalent ways: (i) The variance of the sample mean decreases more slowly than the reciprocal of the sample size (slowly decaying variances), i.e., var?ä
different manifestations of the same property, that is, that the process is asymptotically or exactly second-order self-similar. So the problem of estimating the degree of self-similarity can be approached from three different points of view: (i) analysis of the R/S statistic for different block sizes, (ii) analysis of the variances of the aggregated processes ?ä??Ì , and (iii) periodogram-based analysis in the frequency domain. Below, we brie?y review two of the techniques that will be used in our analysis of I/O traces. R/S analysis. R/S analysis provides a method of inferring the Hurst parameter . We have de?ned the R/S statistic and seen that the relationship between the R/S statistic and ?? is given by (2). Taking the logarithm of both sides of the relation we get
, and ? denotes a ?nite positive constant. with ? ? In the rest of the paper is a ?nite positive constant. (ii) The autocorrelations decay hyperbolically rather than exponentially, implying a non-summable autocorrelation func?? ??ä ?Ì ? (long-range dependence). (iii) tion , i.e., The spectral density ?ä??Ì obeys a power law near the ori?- , as ?, with ? - ? and gin, i.e., ?ä ?Ì ? - ? ? ?. As described in , self-similar processes provide an explanation for an empirical law known as the Hurst effect. The re-scaled adjusted range (R/S) statistic for a set of ob? ? ???Ì having mean ?ä???Ì and servations ?ä sample variance ? ? ?ä???Ì is given by the relation
?º?ä???Ì ? ?ä???Ì ?Ì
??D?? ?ä???Ì ?? D??
as ?? ?. Thus we can estimate ?? by plotting D?? ?ä ?º?ä???Ì ? ?ä???Ì ?Ì versus D?? ?ä???Ì for varying values of ??. This plot, known as Pox plot, should result in a roughly linear graph with a slope equal to ?? . So a least-squares linear ?t estimates the Hurst parameter. Given a set of ? ? ? ?Ì, it is divided into observations ?ä ? disjoint subsets of length ?ä? ? ?Ì. The R/S statistic ?º?ä? ???Ì ? ?ä? ???Ì is then computed for the starting points ? of the ? disjoint subsets and for values of ?? satisfying the relation ?ä? ? ??Ì ?? ?? ? . Variance-time
plot. We have seen in section 2.2 that for a self-similar process, the relation between the variance of the aggregated process ?ä??Ì and the block size ? is given by (1). Taking the logarithm of both sizes of the equation results in the relation
?º ?ä ???Ì ? ?ä???Ì
? ???ä? ?? ? ?ä???Ì ? ? ???ä? ?? ??
where ? ?ä ??? ??? ?? ?Ì? ?ä???Ì, ?. While short-range dependent sets of observations seem to ? ? as ?? ?, long-range satisfy ?º?ä???Ì ? ?ä???Ì ??? dependent processes (e.g., self-similar processes) follow
D?? ?ä ? ?Ì ? ? D?? ?ä??Ì
?º?ä???Ì ? ?ä???Ì
. This is known as the Hurst effect.
2.2. Estimating the Hurst parameter
We now present two different methods of detecting and estimating the level of self-similarity in a set of observations. The Hurst parameter, ?? , is a measure of the level of self-similarity of a time-series, and a value in the range ?ä? ??Ì indicates self-similarity. We have seen that slowly decaying variances, long-range dependence, and spectral density obeying a power law are
as ? ?. Thus we can derive an estimate of ? by plotting D?? ?ävar?ä ?ä??Ì ?Ì?Ì versus D?? ?ä??Ì for many values of ?. This plot, known as a variance-time plot, should result in a linear series of points with slope ?? . An estimate of ?? can be obtained by calculating ? and using the relation ?? ? ? ? ?. Slopes between ?? and ? correspond to Hurst parameters between ? and ? , so the process is self-similar. It should be emphasized that both R/S analysis and variance-time plots provide estimates of the level of selfsimilarity in a series. This explains the discrepancy between the estimates of ?? obtained using the two methods.
3. The I/O traces
In our study we have used two sets of I/O traces, both collected in the same systems but over different periods of
time. Traces in the ?rst set contain a detailed characterization
of every low-level disk access generated by two systems over a two-month period. The traced systems were: cello, a timesharing system with a storage subsystem consisting of eight disks, used by a small group of researchers at HP Laboratories; and snake acting as a ?le server for an HP-UX cluster of nine clients at Berkeley, which has a storage subsystem of three disks. cello was traced from 4/18/92 to 6/20/92, and snake from 4/25/92 to 6/27/92. A detailed description of these traces, including a thorough analysis of their main characteristics, can be found in . The second set of traces was collected from one of the systems traced in 1992, cello, but were obtained four years later. The set contains all disk accesses generated by cello from 11/13/96 to 11/18/96. At that time, the storage subsystem of cello consisted of six disks. The determination of the self-similarity characteristics of the I/O workload was performed using both sets of traces. For the structural analysis, only the second set can be used because only these traces contain information relating each access with a process (pid). This pid corresponds to the process that was the ?nal responsible of the I/O access being sent to the disk. Note that the process that causes the physical access to disk is not necessarily the same that requests the data. This is due to the interference of the ?le cache and the operating system.
than ? . Table 1 shows the results of applying the variance method to some traces.
day 5/3/92 5/4/92 5/5/92 5/6/92 5/7/92 5/8/92 5/9/92 5/10/92 disk 5 0.74 0.73 0.83 0.88 0.63 0.74 0.79 0.74 disk 6 0.73 0.70 0.64 0.66 0.65 0.64 0.70 0.69 disk 22 0.67 0.65 0.63 0.78 0.65 0.68 0.78 0.73
Table 1. ?? using the variance method for the eight hours of greatest activity during the days between 5/3/92 to 5/10/92, for the three disks in the snake system
4. Self-similarity in disk arrival patterns
Initially we present some results obtained from the analysis of the ?rst set of traces. A large amount of data from this set has been analyzed, but in this paper we only show the results obtained by the analysis of the traces corresponding to the snake system from 5/3/92 to 5/10/92, and to the cello system from 6/12/96 to 6/19/96. We chose these subsets of data because they captured periods of high activity. In ?gure 1, we have given a pictorial evidence of the existence of self-similarity in disk arrival patterns of the I/O traces. Now, using the R/S and variance statistical methods mentioned above, we demonstrate the self-similar nature of disk arrival patterns in a more rigorous way. First, we investigate the self-similar nature of the accesses presented to a single disk. Then, we analyze the existence of self-similarity in the arrival pattern directed to the whole storage system, i.e., considering the requests directed to all the disks.
Finally, we separately studied the self-similar phenomenon in reads and writes. To analyze the accesses that are presented to a single disk, we applied the two mentioned methods for estimating the Hurst parameter, repeating the study for each disk in the snake system. All results indicate the existence of selfsimilarity for all the days considered and for the three disks of the system. We obtained estimates of ?? always greater
We present the plots obtained when using the two methods for the trace corresponding to disk 5 and for the eight hours of greatest activity on 5/10/92. Figure 2.a depicts the R/S analysis. Fitting a least-square line leads to a estimated value of ?? of ? ?. The variance-time plot, shown in ?gure 2.b, provides an estimate ?? of ? . Both methods give very close values for ?? . A similar study was made for the arrival pattern of the accesses directed to the whole snake storage subsystem on the same days, offering values of ?? between 0.7 and 0.92 for the eight hours of greatest activity in the storage subsystem. This con?rms the existence of self-similarity in the arrival pattern of the complete storage subsystem. Moreover, for these accesses we identi?ed three periods during the day with different levels of I/O activity. Repeating the study for each period, we conclude that the value of ?? is larger when the storage subsystem shows more activity. For example: for the 5/3/92 trace we obtained an estimate ?? of 0.72 for the period with greatest activity; 0.57 for the period with least activity; and 0.63 for the intermediate period. All these values were obtained using the variance method (Table 2.a). In the cello system, the level of activity remains at similar levels throughout the day. So we studied the accesses over the whole day as a simple period. As the level of activity increases in cello, so the Hurst parameter increases. Again this con?rms that the parameter ?? depends on the level of I/O activity. Studying reads and writes in isolation, the analysis con?rmed the self-similar nature of read and write arrival patterns, obtaining greater values of ?? for writes, i.e. writes are more self-similar than reads. For both systems, cello and snake, it seems that the self-similar nature of I/O workload is due mainly to writes. For example, in cello for the 6/14/92 trace, we obtained an estimate ?? of 0.85 for the
R/S Analysis 3
day 5/3/92 Busy Period Normal Period Low Period 5/6/92 Busy Period Normal Period Low Period day 6/12/92 6/13/92 6/14/92 6/15/92 6/16/92
R+W 0.72 0.63 0.57 0.87 0.7 0.45 R+W 0.9 0.87 0.85 0.83 0.84
Reads 0.63 0.47 0.61 0.6 0.61 0.3
Writes 0.73 0.51 0.64
3 log(m) Variance-Time Analysis
Reads 0.75 0.75 0.68 0.78 0.8
0.88 0.7 0.4 ( a) Writes 0.885 0.84 0.84 0.8 0.84 ( b)
Table 2. (a) ?? using variance method for the traces corresponding to snake. (b) ?? using variance method for the traces corresponding to cello.
2.5 log(m) 3 3.5 4 4.5
Figure 2. Graphical methods for testing the self-similarity phenomenon on 5/10/92
load composed of writes and reads - the estimate of ?? obtained was 0.68 for reads and 0.84 for writes, i.e., the degree of self-similarity estimated for writes was very close to the degree of self-similarity estimated for the global workload. Table 2.b shows the values of ?? obtained applying the variance method, using the traces corresponding to the days from 6/12/92 to 6/16/92 in system cello. Below, we present some results obtained using the second set of traces. These results con?rm that four years later and despite changes in the operating and storage system, the I/O workload is self-similar. So, self-similarity seems to be an intrinsic property of the I/O workload for these kind of systems. In this case, we studied the existence of self-similarity in the arrival patterns of the accesses presented to the whole storage
system. We have made an exhaustive analysis of a three-day trace (from 11/13/96 to 11/15/96), hour by hour. For each hour, we also separately analyzed the self-similar nature of reads and writes. Figure 3 shows the results of applying the R/S and variance methods for the trace of hour 0 of 11/13/96. In this example, the value of ?? obtained for the whole workload with the R/S method is 0.91 (?gure 3.a), and 0.90 (?gure 3.b) with the variance method. When the analysis is applied
only to read accesses, we obtained a value of ?? of 0.96 with the R/S method and 0.93 with variance method. And, analyzing the write accesses, we obtained a value of 0.73 using the R/S method and 0.77 with the variance method. As ?gure 4.a shows, for all the one-hour traces considered in the study, we obtained estimated values for H greater than 0.5. The values obtained for the whole workload (writes and reads) ?uctuate between 0.55 and 0.95 for the one-hour traces. When we separately analyzed reads and writes we obtained the following results: for reads the hourly values vary between 0.57 and 0.98; and for writes, we obtained values from 0.48 to 0.84. In the analysis of the write access pattern, only in one case (one hour) did we obtain a value smaller than 0.5. As ?gure 4.a shows the values of ?? have a daily cycle that follows approximately the same pattern as the number of accesses to the storage system shown in ?gure 4.b. Thus, the number of accesses per hour seems to be an important factor in the value of ?? obtained for that hour. For example, when the number of accesses in one hour is greater than 400,000 the estimated value for H is over 0.9. When the number of accesses is smaller than 75,000 the estimated values of ?? are always under 0.6. The largest values of ?? are obtained every day at the same time, from 11 a.m. to 12 a.m., corresponding to the peak hour. Another signi?cant observation from ?gure 4.a is that ?? is always greater for reads than for writes in the same hour, despite the fact that the number of reads is usually smaller. This is different from what we obtained for the workload of 1992. In that case, the writes were the most self-similar
1 R+W R W
2 Estimated H 1 3 2.5 1.5 2 log(n) 2.5 3 3.5