Timestamp Synchronization for Event Traces of Large-Scale Message...

By Norman Ruiz,2014-05-27 14:59
10 views 0
Timestamp Synchronization for Event Traces of Large-Scale Message...



     Timestamp Synchronization for Event Traces of Large-Scale Message-Passing Applications

     Daniel Becker1 , Rolf Rabenseifner2 , and Felix Wolf1


     Forschungszentrum J??lich, John von Neumann Institute for Computing (NIC) u 52425 J??lich, Germany u {d.becker,f.wolf}@fz-juelich.de www.fz-juelich.de 2 University of Stuttgart, High-Performance Computing-Center (HLRS) 70550 Stuttgart, Germany rabenseifner@hlrs.de www.hlrs.de

     Abstract. Identifying wait states in event traces of

    message-passing applications requires measuring temporal

    displacements between concurrent events. In the absence of synchronized hardware clocks, linear interpolation techniques can already account for dierences in oset and drift, assuming that the drift of an individual processor is not time dependant. However, inaccuracies and drifts varying in time can still cause violations of the logical event ordering. The controlled logical clock algorithm accounts for such violations in point-to-point communication by shifting message events in time as much as needed while trying to preserve the length of intervals between local events. In this article, we describe how the controlled logical clock is extended to collective communication to enable a more complete correction of realistic message-passing traces. In addition, we present a parallel version of the algorithm that is intended to scale to thousands of application processes and outline its implementation within the framework of the scalasca toolkit. Keywords: Performance analysis, event tracing, clock synchronization.



     Event tracing is a frequently applied technique for post-mortem performance analysis of message-passing applications because it can be used to analyze temporal relationships between concurrent activities. Obviously, the accuracy of such analyses depends on the comparability of timestamps taken on dierent processors. Inaccurate timestamps can not only cause a given interval to appear shorter or longer than it actually was, but also change the logical event order, which requires that a message can only be received after it has been sent. This is also referred to as the clock condition. To avoid violations of this condition, the error of timestamps should ideally be smaller than one half of the message latency. Often, however, the clocks accessible from dierent processors are entirely nonsynchronized or only synchronized

within disjoint partitions (e.g., smp-node or

     F. Cappello et al. (Eds.): EuroPVM/MPI 2007, LNCS 4757, pp. 315?C325, 2007. c Springer-Verlag Berlin Heidelberg 2007


     D. Becker, R. Rabenseifner, and F. Wolf

     multicore-chip). Clock synchronization protocols, such as ntp [4], can align the clocks to a certain degree, but are often not accurate enough for our purposes. Assuming that all local clocks on a parallel machine run at dierent but constant speeds (i.e., drifts), their time can be described as a linear function of the global time. This approach is used in the tracing library of the scalasca toolkit [2], which performs oset measurements between all local clocks and an arbitrarily chosen master clock once at program initialization and once at program nalization. However, as the assumption of constant drift is only an approximation, violations of the clock condition may still occur. The controlled logical clock (clc) [6] is a method to retroactively correct timestamps violating the clock condition. As the modication of individual timestamps might change the length of local intervals and even introduce new violations, the correction takes the context of the modied event into account by stretching the local time axis in the immediate vicinity of the aected event. The current clc algorithm, however, is limited by two factors. First, it covers only point-to-point operations and ignores collective ones. Second, it is a serial algorithm designed for a single global trace le. In this article, we describe how the controlled logical clock is extended to collective communication to enable a more complete correction of realistic message-passing traces. In addition, we present a parallel version of the algorithm that is intended to scale to thousands of application processes and outline its implementation design within the framework of scalasca [2], a performance-analysis tool that can be used to automatically identify idle times in event traces of large-scale message-passing programs. The outline of this article is as follows: In Section 2, we start with a short description of scalasca's event model and its parallel trace analysis approach, followed by a review of the basic clc mechanism in Section 3. In Section 4, we describe our extensions required to handle collective operations. After that, we present the new parallel algorithm design in Section 5. Finally in Section 6, we summarize our paper and give an outlook on future work.


     Event Model and Replay-Based Parallel Analysis

     Because we plan to integrate the extended clc algorithm with the scalasca trace-analysis tool, we describe it in terms of the scalasca event model, which is similar to the vampir event model [5], for which the algorithm has been originally designed. As far as message passing

    is concerned, the two models dier only in the way they express collective communication, which the original algorithm ignores anyway. The information scalasca records for an individual event includes at least a timestamp, the location (i.e., the process) causing the event and the event type. Depending on the type, additional information may be supplied. The event model distinguishes between programming-model independent events, such as entering and exiting code regions, and events related to mpi operations. The latter include events representing point-to-point operations, such as sending and receiving messages, and events representing the completion of collective operations.

     Timestamp Synchronization for Event Traces


     These collective exit events are specializations of normal exit events carrying additional information (i.e., the communicator) that allows identifying concurrent collective exits belonging to the same collective operation instance. Table 1 illustrates the event sequences recorded for typical mpi operations. To facilitate trace analysis for large numbers of application processes, the scalasca analyzer scans the trace data in parallel. After creating one analysis process per (target) application process, the analyzer loads the entire trace data into the potentially distributed main memory and performs a parallel replay of the applications communication behaviour, thereby examining each communication operation using an operation of similar type. During this procedure, the analyzer measures temporal dierences both between remote and between local events, which requires the time stamps to be as accurate as possible. The execution time of the analyzer mainly depends on communication, which resembles the original communication of the target application. For details, please see [2].

     Table 1. Exemplary event sequences recorded for typical mpi operations Function name MPI Send() MPI Recv() MPI Allreduce() Event sequence (enter, send, exit) (enter, receive, exit) (enter, collective exit) for each participating process


     Controlled Logical Clock

     Non-synchronized processor clocks may cause inaccurate timestamps in event traces. A clock condition violation occurs if the receive event of a message has an earlier timestamp than its matching send event. That is, the happened-before relation e ?ú e [6] between two events e and e with their respective timestamps C(e) and C(e ) does not hold. A clock condition violation between two events is dened as: (1) e, e : e ?ú e ?Ä C(e) ?Ý C(e ). The clc algorithm restores the clock condition using happened-before relationships between distributed events derived from point-to-point communication event semantics. More precisely, if

    the condition is violated for a send-receive event pair, the receive event is moved forward in time. The correction is only applied if the trace contains clock condition violations. To preserve the length of intervals between local events, events immediately following or preceding the corrected event are moved forward as well. This adjustment is called forward and backward amortization, respectively. Note that the accuracy of the adjustment depends on the accuracy of the original timestamps. Therefore, the algorithm benets from weak

    pre-synchronization, such as the aforementioned linear interpolation. In this section, we review the clc algorithm including forward and backward amortization. The interested reader can nd a detailed description of the clc algorithm and a review of further synchronization approaches in [6] and [7].


     D. Becker, R. Rabenseifner, and F. Wolf


     CLC with Forward Amortization

     The clc algorithm is an enhancement of Lamport's logical clock [3] and was introduced by Rabenseifner [6]. The algorithm requires timestamps with limited errors, which can be achieved through weak pre-synchronization. To denote timestamps computed by clc, we use the symbol LC . In the following, LC is modeled with t as the wall clock time and T (t) as the global time to which the process clocks Ci (t) (i = 0..n 1) are synchronized. Next, n is the number of processes, ej is the j th event on process i and so i E = {ej |i = 0..n 1, j = 0..jmax (i)} is the set of all events in the trace. In i addition, the set of matching send and receive pairs is dened with M = {(el , en )|el = send event, en = matching receive event}. k m k m (2)

     Note that the send event always marks the beginning of a send operation whereas a receive event marks the end of a receive operation. By contrast, ej is i an internal event if it is neither a send nor a receive event. Furthermore, ?Äi is the minimal dierence between two events on process i and ?Ìk,i is the minimum j message delay of messages from process k to process i. Finally, ?Ãi is a control j variable with ?Ãi ?Ê [0, 1]. For each process, LCi is now dened as l max(LCk (ek ) + ?Ìk,i , LCi (ej1 ) + ?Äi , i j LCi (ej1 ) + ?Ãi (Ci (t(ej ))Ci (t(ej1 ))), i i i Ci (t(ej ))) if (el , ej ) ?Ê M (3) i k i LCi (ej ) := el i k max(LC (ej1 ) + ?Ä , i i i j LCi (ej1 ) + ?Ãi (Ci (t(ej ))Ci (t(ej1 ))), i i i Ci (t(ej ))) otherwise. (4) i As can be seen, the algorithm consists of two equations. Equation (3) adjusts the timestamps of receive events while Equation (4) modies timestamps of internal and send events. Note that for each process, the terms LCi (ej1 ) + ?Äi i j and LCi (ej1 ) + ?Ãi (Ci (t(ej ))Ci (t(ej1 ))) must be omitted for the rst event i i i (j = 0). Through the term Ci (t(ej )) in Equation (3) and Equation

    (4), the algorithm i ensures that a correction is only applied if the trace violates the clock condition. The new timestamps satisfy the clock condition, since the term LCk (el ) + ?Ìk,i k in Equation (3) ensures that LC (ej ) is put forward compared to Ci (t(ej )) if i i needed in case of a clock condition violation. To ensure that the clock does j not stop after a clock condition violation, the term LCi (ej1 ) + ?Ãi (Ci (t(ej )) i i Ci (t(ej1 )) in Equation (3) and Equation (4) approximates the duration of the i

     Timestamp Synchronization for Event Traces


     original communication after a clock condition violation. This mechanism is called forward amortization. j Moreover, Rabenseifner has shown that ?Ãi with a constant value can cause LC to be faster than the fastest clock among all process-local clocks Ci [7]. Cyclic changes of physical clock drifts may cause an avalanche eect that enlarges the value of clock corrections and propagates until the end. To avoid this eect, j a control loop is used to nd the optimal value of ?Ãi . The controller tries to limit the dierences between LC and T , i.e., the controller estimates the output error indirectly because T (t(ej )) is unknown. If 1 ?Ã is chosen smaller than the i maximal drift dierences, the controller will enlarge 1 ?Ã (e.g., to 1%) to ensure j that any propagation is bounded by this factor. To calculate ?Ãi for each event, j the controller requires a global view of the event data. Mainly, ?Ãi is kept less than 1 minus the maximal drift of the clocks, however, in most cases a xed ?Ã = 0.99 or 0.999 is good enough because physical clock drifts are normally less than 104 . For subsequent events of the same process, the term LCi (ej1 ) + ?Äi i in Equation (3) and Equation (4) causes LC to advance at least a small number j of ticks ?Äi if the controller has reduced ?Ãi to nearly zero. Rabenseifner describes the control mechanism in more detail in [7]. A jump discontinuity in LC of ??t is caused by the term LCk (el ) + ?Ìk,i in k Equation (3) if LC (ej ) of the violating receive event is put forward compared to i j Ci (t(ej )). The term LCi (ej1 ) + ?Ãi (Ci (t(ej ))Ci (t(ej1 )) in Equation (3) implei i i i ments a forward amortization of such a jump. That is, the clock LCi for subsequent j events of process i runs with the speed of Ci reduced by the factor ?Ãi . 3.2 Backward Amortization

     Backward amortization is applied to smooth jump discontinuities caused by the rst part of the clc algorithm. This is done by slowly building up the ascension to a jump ??t using a piecewise process-local linear correction in an amortization interval LA of appropriate size before the violating receive event [7] (Figure 1). The compensation isrealized by setting the timestamps forward. If there are no violating send events in the backward amortization interval of a process i, then the dash-dotted linear interpolation can be used. In Figure 1, the

    horizontal axis b represents LCi , which is equal to LCi (i.e., the state after forward amortization) b but without the jump ??t at event r. The vertical axis shows osets to LCi after applying dierent stages of backward amortization. Naturally, the oset at r corresponds to the jump ??t. Note that the smaller the gradient of a clock in this gure, the better the correction and the smaller the perturbation of preceding events. Therefore, the ratio ??t/LA should be only a few percent. Apparently, adjacent clock condition violations cause a larger perturbation. In addition, not to violate the clock condition, the correction must not advance the timestamps of send events farther than LCm ?Ìi,m of the corresponding receive event en of a process m. These upper limits are shown as circled m values above the locations of the send events. If these limits are smaller than the dashed-dotted line (here at events s1 and s2 ), then a reduced piecewise linear interpolation function must be used, see the dotted line in Figure 1. As can be


     D. Becker, R. Rabenseifner, and F. Wolf

     Clocks ?C LC ib of process i Clocks: LC i' LCiI ideal backward amortization in the absence of conflicting sends LCiA piece-wise linear backward amortization Events : r = Receive event s = Send event i = Internal event Jump ??t due to LC' k(e kl)+?Ì i.k in Eq.(3) LC i b with LC i := LC'i without jump ??t


     (LC'm(e mn) - ?Ì i.m)


     Corresponding receive event , i.e., (s3,emn) ?Ê M X X i s3 s1 i s2 Amortization interval LA i r

     Fig. 1. Algorithm of the backward amortization

     seen, the clock error rate is higher than the desired ??t/LA in the interval (s2 , r). For each receive event with a jump, the backward amortization algorithm is applied independently. If there are additional receive events inside the amortization interval during such a calculation step, then these events can be treated like internal events, because advancing the timestamp of a receive event further cannot violate the clock condition.


     Extended Controlled Logical Clock

     Unfortunately, the clc algorithm in its present state is only designed to correct clock condition violations related to point-to-point communication. Collective communication semantics are ignored. In this section, we extend the algorithm including forward and backward amortization to correctly handle collective communication. Again, we start with considering happened-before relationships among

    collective communication events. We start with a description of the extended forward amortization followed by the extended backward amortization. 4.1 Extended CLC with Forward Amortization

     The clc algorithm requires the detection of clock condition violations. The happened-before relation is used to synchronize the timestamp of the receive event with the timestamp of the corresponding send event, i.e., the receive event is put forward in time if a clock condition violation has occurred. A single collective operation can be considered as a composition of many point-to-point communications. Using this model, we determine collective send and receive pairs occurring during a collective operation instance. We distinguish several types of collective operations (e.g., 1-to-N, N-to-1, etc.). Depending on the type, some of the enter events in a collective operation instance can be regarded as send events and some of the collective exit events as receive events. In the following, we review the dierent types of collective operations to identify happened-before relationships based on the decomposition of collective operations into send and receive pairs. With S and R we denote the set of send and receive events in a collective operation instance i, respectively. For each call to a collective operation, the set of all send-receive pairs M is enlarged by adding S ?Á R.

     Timestamp Synchronization for Event Traces


     1-to-N: One root process sends its data to N other processes. Example are MPI Bcast, MPI Scatter, and MPI Scatterv. S only contains the send event of the root process (i.e., its enter event), whereas R contains receive events from all processes of the communicator (i.e., all collective exit events) with a data length greater zero, i.e., the set may be smaller than the size of the communicator in the case of variable length operations (MPI ????v). N-to-1: One root process receives its data from N processes. Examples are MPI Reduce, MPI Gather, and MPI Gatherv. R only contains the receive event on the root process (i.e., its collective exit event). S is the set of send events (i.e., all enter events) on all processes of the communicator with a data length greater zero. Given that the root process is not allowed to exit the operation until the last process enters the operation, the latest enter event is the relevant send event to fulll the collective clock condition. Hence, if S contains more than one element, the term LCk (el ) + ?Ìk,i in Equation (3) must be replaced by the maximum k of LCk (el ) + ?Ìk,i over all el ?Ê S. That is, Equation (3) must be replaced by k k max( (maxel with(el ,ej )?ÊM LCk (el ) + ?Ìk,i ), k k k i LCi (ej1 ) + ?Äi , i j LCi (ej1 ) + ?Ãi (Ci (t(ej ))Ci (t(ej1 ))), i i i LCi (ej ) := i Ci (t(ej ))) if (el , ej ) ?Ê M i k i el k ???? otherwise.

     (3 ) (4)

     N-to-N': All processes of the communicator are sender and receiver. Examples are MPI Allreduce, MPI Allgather, MPI Alltoall, and MPI Barrier with N'=N, and the variable length operations MPI Reduce scatter, MPI Allgatherv, and MPI Alltoallv. S and R are dened by all those enter and collective exit events whose processes contribute input data or receive output data. For a call to MPI Barrier, all processes of the communicator contribute to S and R. Special cases: For MPI Scan and MPI Exscan, the set of messages added to M cannot be expressed as the Cartesian product S ?Á R. Below, el refers to the k enter event of a collective operation instance and ej refers to the collective exit i event and, thus, the set of messages added to M has the form {(el , ej ) | k = 0..N1, i = 0..kx} k i with x = 0 for MPI Scan and x = 1 for MPI Exscan. Independently of collective operation type, it is important to optimize the handling of S ?Á R in Equation (3'). A parallelized algorithm of the extended clc should attempt to reduce the eort to O(log N ).


     D. Becker, R. Rabenseifner, and F. Wolf


     Extended Backward Amortization

     To extend the backward amortization algorithm for collective routines, the upper bounds for the send events (see Figure 1) must be adapted to collective events: If ejm is the send event of a collective routine, an upper bound for the piecewise i linear interpolation at ejm is dened by minel ?ÊR LCk (el ) ?Ìi,k with R being k i k the receive event data set dened in Section 4.1.


     Parallel Timestamp Synchronization

     Event tracing of applications running on thousands of processes [8] requires a scalable synchronization scheme. In this section, we present a parallel version of the extended clc algorithm. 5.1


     The accuracy of the clc algorithm depends on the accuracy of the original timestamps and therefore a pre-synchronization is required. This can be achieved through a linear interpolation where all process-local clocks are mapped onto a single master clock. Given that dierent clocks vary in oset and drift, oset values between worker processes and one master process measured at program start and at program end are used to nd a linear correction function. The oset values are measured using the remote clock reading technique introduced by Cristian [1]. As a byproduct, the minimum transfer delay can be estimated during the oset measurements. 5.2 Parallel Post-mortem Timestamp Synchronization

     scalasca's replay-based approach of analyzing separate

    process-local trace les in parallel can handle traces from thousands of processes. We can achieve comparable scalability for the clc algorithm if we also implement it using a parallel replay. This has the additional advantage that it can be seamlessly integrated into the existing analysis framework.

     Fig. 2. Non-linear drifts of physical clocks measured on an Inniband cluster in comparison to Send-Recv and Allreduce latency

     Timestamp Synchronization for Event Traces


     Preparation: While each scalasca analysis process reads the local trace le of the corresponding application process into memory, the linear correction is applied to all timestamps based on the previous oset measurements at program start and end. The resulting timestamps are taken as the Ci . Inaccurate Ci can occur for two reasons: (i) inaccurate oset measurements and (ii) time-dependant clock drift. Figure 2 shows the non-linear behavior of the clocks Ci after such linear correction on an infiniband cluster. Clock errors are still signicantly larger than point-to-point and collective latencies, i.e., violations of the clock condition can still occur. Logical clock synchronization algorithm: To apply the extended clc algorithm, a parallel traversal of the event stream is performed. Whenever reaching communication events, the corresponding communication operation is replayed to exchange the timestamps of communication events for their later comparison. For each event, a new timestamp is calculated using the extended clc algorithm. The comparison between the remote timestamp and the local timestamp is used to nd clock condition violations. Depending on the type of the original communication operation, dierent timestamps are exchanged using dierent mpi function calls, as listed in Table 2.

     Table 2. Timestamps exchanged depending on the type of operation during forward amortization Type of operation P2P 1-to-N N-to-1 N-to-N' MPI Scan MPI Exscan timestamp exchanged timestamp of send event timestamp of root enter event max( all enter event timestamps ) max( all enter event timestamps ) max( some enter event timestamps ) max( some enter event timestamps ) MPI function MPI Send MPI Bcast MPI Reduce MPI Allreduce MPI Scan MPI Exscan

     Note, that in Equation (3'), the parallel calculation of the maximum over all corresponding send events (maxel with(el ,ej )?ÊM LCk (el ) + ?Ìk,i ) in the case k k k i of N-to-1, N-to-N', MPI Scan, and MPI Exscan can not be implemented with the mpi function identied in Table 2 if ?Ìk,i is not the same for all pairs of processes. Therefore in Equation (3'), ?Ìk,i must be substituted by mink,i (?Ìk,i ). The exchanged timestamps are based on the LC values calculated up to the specic event. The control mechanism used for the controlled logical

    clock requires a global view of the trace data to calculate ?Ãi as described in Section 3. Establishing a global view of the trace data is not feasible with the replay-based approach since communication would be required for each single event. Therefore, we eventually have to perform multiple passes until the maximum error e is below a predened threshold . For the rst pass through the trace les, we propose to use ?Ã = const < 1, for subsequent passes a ?Ãj+1 < ?Ãj should be used. Backward amortization algorithm: The backward amortization requires a second replay of the target application's communication behavior. Timestamps are


     D. Becker, R. Rabenseifner, and F. Wolf

     Table 3. Timestamps exchanged depending on the type of operation during backward amortization

     Type of operation P2P 1-to-N N-to-1 N-to-N MPI Scan MPI Exscan timestamp exchanged MPI function timestamp of receive event MPI Send min( all collective exit event timestamps ) MPI Reduce timestamp of root collective exit event MPI Bcast min( all collective exit event timestamps ) MPI Allreduce min( some collective exit event timestamps ) MPI Scan min( some collective exit event timestamps ) MPI Exscan

     exchanged at synchronization points of the application. However, as explained in Section 3, the former sender now needs data from the former receiver and so the roles between sender and receiver are switched during the backward amortization. Depending on the type of operation, the collective receiver needs the timestamp of the relevant collective send event which are shown in Table 3. For MPI Scan and MPI Exscan, a communicator with reverse rank ordering must be used. The exchanged timestamps are based on the LC values after completion of the extended clc algorithm. After receiving the data, each process temporally stores the timestamps to locally apply the backward amortization if LC exhibits a jump disconutinity. Note that this happens after the forward amortization has already been applied. Given that most mpi implementations use binomial tree algorithms to perform their collective operations, our replay-based approach reduces the communication complexity automatically to O(log N ). Moreover, the stepwise parallel replay during the backward amortization phase could be replaced by a single collective operation per communicator for the entire trace - provided that sucient memory is available.



     In this paper, we have extended the clc algorithm to take collective communication semantics into account so that now a more complete correction of realistic message-passing traces can be achieved. Although the extended clc algorithm only needs information about the

Report this document

For any questions or suggestions please email