By George Harris,2014-05-27 15:01
12 views 0



     2009 Third International Conference on Network and System Security


     Van-Hau PHAM Networking and Security Department, EURECOM Sophia Antipolis, France Email: pham@eurecom.fr Marc DACIER Symantec Research Labs Sophia Antipolis, France Email: Marc Dacier@symantec.com

     Abstract?ªIn this paper, we propose a method to identify and group together traces left on low interaction honeypots by machines belonging to the same botnet(s) without having any a priori information at our disposal regarding these botnets. In other terms, we offer a solution to detect new botnets thanks to very cheap and easily deployable solutions. The approach is validated thanks to several months of data collected with the worldwide distributed Leurr?ä .com system. To distinguish the e relevant traces from the other ones, we group them according to either the platforms, i.e. targets hit or the countries of origin of the attackers. We show that the choice of one of these two observation viewpoints dramatically in?uences the results obtained. Each one reveals unique botnets. We explain why. Last but not least, we show that these botnets remain active during very long periods of times, up to 700 days, even if the traces they left are only visible from time to time. Keywords-honeypot; attack trace analysis; botnet detection;

     I. I NTRODUCTION There is a consensus in the security community to say that botnets are today??s plague of the Internet. A lot of attention has been paid to detect and eradicate them. Several approaches have been proposed for this purpose. By identifying the so called Command and Control (C&C) channels, one can keep track of all IPs connecting to it. The task is more or less complicated, depending on the type of C&C (IRC[1], [2], [3], [4], HTTP [5], [6], fast-?ux based or not [7], [8], P2P [9], [10], [11], etc.) but, in any case, one needs to have some insight about the channels and the capability to observe all communications on them. Another approach consists in snif?ng packets on a network and in recognizing patterns of bot-like traf?c. This is, for instance, the approach pursued by [12], [13] and [14], [15]. The solutions mostly aim at detecting compromised machines in a given network rather than to study the botnets themselves as they only see the bots that exist within the network under study. In this work, we are interested in ?nding a very general technique that would enable us to count the amount of various botnets that exist, their size and their lifetime. As opposed to previous work, we are not interested in studying a particular botnet in details or in detecting compromised

    nodes in a given network. We also do not want to learn the various protocols used by bots to communicate in order

     978-0-7695-3838-9/09 $26.00 ? 2009 IEEE DOI 10.1109/NSS.2009.46 365 360

     to in?ltrate the botnets and obtain more precise information about them [4]. By doing so, we certainly will not be able to get as much in depth information about this or that botnet but our hope is to provide insights into the bigger picture of today??s (and yesterday??s) botnets activities. The solution described in the following is generic and simple to deploy widely. It relies on a distributed system of low interaction honeypots. Based on the traces left on these honeypots, we provide a technique that groups together the traces that are likely to have been generated by groups of machines controlled by a similar authority. Since we have no information regarding the C&C they obey to, we do not know if these machines are part of a single botnet or if they belong to several botnets that are coordinated. Therefore, to avoid any ambiguity, we write in the following that they are part of a army of zombies. An army of zombies can be a single botnet or a group of botnets the actions of which are coordinated during a given time interval. In this paper, we propose a technique to identify and study the size as well as the lifetime of such armies of zombies. The approach does not pretend to be able to identify all armies of zombies that could be found in our dataset. At the contrary, we show that, depending on how the dataset is preprocessed, i.e. depending on the observation viewpoint, different armies can be found. Exhaustiveness is not our concern at this stage but, instead, we are interested in offering an approach that could easily be widely adopted. The idea exposed here is similar, in its spirit, to the one presented in the paper coauthored by Allmann et al. [16]. However, instead of ?? [????] leveraging the deep understanding of network detectives and the broad understanding of a large number of network witnesses to form a richer understanding of large-scale coordinated attackers??, our approach relies on a diverse yet limited number of low interaction honeypots. They do not need to be neither as smart as the network detectives nor as numerous as the network witnesses proposed in that work. Both approaches are quite complementary. Finally, Kitti et al have proposed an approach to detect related attacks in [17]. The method has been validated thanks to data collected from DShield project [18]. In that work, related attacks are understood as attacks mounted

     by the same sources against different networks which is a narrower view of the problem than ours. The remainder of the paper is organised as follows. Section II de?nes the terms used in the paper. Section III describes the dataset we have used and what we mean when we refer to the notion of observation viewpoint. It provides some motivation for

    the work. In Section IV, we describe the method itself and provide the main characteristics of the results obtained as well as two precise, yet anecdotal, examples of armies detected thanks to our method. Finally, Section V concludes the paper. II. T ERMINOLOGY In order to avoid any ambiguity, we introduce a few terms that will be used throughout the text. Some of them are taken from [19]. Readers who are familiar with the Leurr?ä .com e project are invited to skip this Section. ? Platform: A physical machine simulating, thanks to honeyd [20], the presence of three distinct machines. A platform is connected directly to the Internet and collects tcpdump traces that are fed daily into the centralized Leurr?ä .com??s database. e ? Leurr?ä .com: The Leurr?ä .com project is a distributed e e system of such platforms deployed in more than 50 different locations in 30 different countries (see [21] for details) ? A Source corresponds to an IP address that has sent at least one packet to, at least, one platform. A given IP address can correspond to several distinct sources. Indeed, a given IP remains associated to a given source as long as there is no more than 25 hours between 2 packets received from that IP. After that, a new source identi?er will be assigned to the IP. By grouping packets by sources instead of by IPs, we minimize the risk of gathering packets sent by distinct physical machines that have been assigned the same IP dynamically after 25 hours, or machines that have the same IP address seen from the outside due to side-effect of Network Address Translation. ? An Attack, in the context of this paper, is de?ned as the packets exchanged between one source and one platform. ? A Cluster is made of a group of sources that have left highly similar network traces on all platforms they have been seen on. Clusters have been precisely de?ned in [22]. ? An Observed cluster time series ?µT,c,op is a function de?ned over a period of time T , T being de?ned as a time interval (in days). That function returns the amount of sources per day associated to a cluster c that can be seen from a given observation viewpoint op. The observation viewpoint can either be a speci?c platform or a speci?c country of origin. In the ?rst case, ?µT,c,platf ormX returns, per day, the amount of

     366 361


     number of sources

     Cluster 60322 attacks on 7 platforms 5,8,11,????,21

     40 20 0 380



     395 time(day)





     number of sources

     Cluster 0 coming from Spain

     200 100 0 295


     305 time(day)




     Figure 1. on the top plot, cluster 60232 attacks seven platforms from day 393 to day 400. On the bottom plot, peak of activities of cluster 0 from Spain on day 307

     sources belonging to cluster c that have hit platf ormX . Similarly, in the second case, ?µT,c,countryX returns, per day, the amount of sources belonging to cluster c that are geographically located in countryX . Clearly, ?i?Êcountries ?µT,c,i = we always have: ?µT,c = ?x?Êplatf orms ?µT,c,x An attack event is de?ned as a set of observed cluster time series exhibiting a particular shape during a limited time interval. The set can be a singleton. We denote the attack event i as ei = (Tstart , Tend , Si ) where the attack event starts at Tstart , ends at Tend and Si contains a set of observed cluster time series identi?ers (ci , opi ) such that all ?µ[Tstar ?Tend ),ci ,opi are strongly correlated to each other ?(ci , opi ) ?Ê Si . As an example, the top plot of Figure 1 represents the attack event 225 which consists of a given cluster attacking seven platforms. Each curve represents the amount of sources of that cluster observed from one of these platforms. As we can observe, the attack event starts at day 393 and ends at day 400. According to our convention, we have e225 = (393, 400, {(60232, 5), (60232, 8), ????, (60232, 31)}). Similarly, the bottom plot of Figure 1 represents an attack event due to one cluster during a single day and mostly due to a single country (e14 = (307, 307, {(0, ES)}))

     III. IMPACT OF THE OBSERVATION VIEWPOINT A. Dataset Description For our experiments, we have selected the traces observed on 40 platforms out of 50 at our disposal. All these 40 platforms have been running for more than 800 days. None of them has been down for more than 10 times and each of them has been up continuously for at least 100 days at least once. They all have been up for a minimum of 400 days over that period. We denote by T , the time series representing the total amount of sources observed, day by

     day, on all these 40 platforms. We can split that time series per country1 of origin of the sources. This gives us 231 time series T SX where the ith point of such time series indicates the amount of sources, observed on all platforms, located in country X. We represent by T S L1 the set of all these Level 1 time series. To reduce the computational

    cost, we keep only the countries from which we have seen at least 10 sources on at least one day. This leaves us with 85, instead of 231, time series. We represent by T S L1 this re?ned set of Level 1 time series. Then, we split each of these time series by cluster to produce the ?nal set of time series ?µ[0?800),ci ,countryj ?ci and ?countryj ?Ê bigcountries . The ith point of the time series ?µ[0?800),X,Y indicates the amount of sources originating from country Y that has been observed on day i attacking any of our platforms thanks to the attack de?ned by means of the cluster X. We represent by T S L2 the set of all these Level 2 time series. In this case |T S L2| is equal to 436,756 which corresponds to 3,284,551 sources. As explained in [19], time series that barely vary in amplitude over the 800 days are meaningless to identify attack events and we can get rid of them. Therefore, we only keep the time series that highlight important variations. We represent by T S L2 this re?ned set of Level 2 time series. In this case |T S L2 | is equal to 2,420 which corresponds to 2,330,244 sources. We have done the very same splitting and ?ltering by looking at the traces on a per platform basis instead of on a per country of origin basis. The corresponding results are given in Table I.

     T S consists of 3,477,976 sources OVP country platform |T S L1| 231 40 |T S L1 | 85 40 (94,4% TS) (100% TS) |T S L2| 436,756 395,712 |T S L2 | 2,420 2,127 sources 2,330,244 2,538,922 (67% of T S) (73% of T S)

     To do this, in a ?rst step, we use a sliding window of L days to compute the Pearson correlation of all pairs of time series. That is, we compute the correlation of N time series for T-L+1 time interval {[1, L], [2, L + 1], . . . [T ? L, T ]}. As a result, we obtain, for every pair of time series in N, the time intervals during which they are correlated. Then we group together all pairs of cluster time series that are correlated together over the same period of time. Each such group constitutes an attack event as de?ned before. It is worth noting that this method, which we refer to as M 1 in the sequel, can not detect attack events made of a single cluster time series. This is typically the case for peaks of activities occurring on a single day. In such cases, it is more ef?cient to apply another, less expensive, algorithm to identify the attack events. For the sake of conciseness, we do not to include the description of this second method, M 2. C. Impact of the Observation Viewpoint 1) Results on Attack Event Detection: We have applied these algorithms against our 2 distinct datasets, namely T Scountry and T Splatf orm . As shown in Table II, for T Scountry , method M1 (resp. second method M2) has found 549 (resp. 43) attack events, accounting for a total of 552,492 sources (resp. 21,633). Similarly, with T Splatf orm , applying M1 (resp. M2) leads to 564 (resp. 126) attack events, containing 550,305 (resp. 28,067) sources.


     AE-set-I(T Scountry ) AE-set-II(T Splatf orm) No.AEs No.sources No.AEs No.sources M1 549 552,492 564 550,305 43 21,633 126 28,067 M2 Total 592 574,125 690 578,372 No.AEs: amount of attack events M1,M2: methods represented in Section III-B

     all sources observed on the period under study, OV P : observation viewpoint, T S L1: set of time series at country/platform level, T S L1 : set of signi?cant time series in T S L1, T S L2 : set of all cluster time series, T S L2 set of strongly varying cluster time series


     Table I

     2) Analysis: The table highlights the fact that depending on how we decompose the initial set of traces of attacks (i.e the initial time series T S), namely by splitting it by countries of origin of the attackers or by platforms attacked, different attacks events show up. To assess the overlap between attack events detected from different observation viewpoints we use the common source ratio, namely csr, measure as follows: csr(e, AEop ) =

     e ?ÊAEop

     |e ?É e |


     B. Attack Event Detection Having de?ned the time series we are interested in, we now need to identify all time periods during which 2 or more of these observed cluster time series are correlated together.

     1 We

     use Maxmind to get the geographical location of IPs

     in which e ?Ê AEop and |e| is the amount of sources in attack event e, AEop is AEcountry and AEop is AEplatf orms (or vice versa). Figure 2 represents the two cumulative distribution functions corresponding to this measure. The point (x, y) on the curve means that there are y ? 100% of attack events obtained thanks to Tcountry (resp Tplatf orms ) that have less than x ? 100% of sources in common with all attack events

     367 362

     Empirical CDF 1 0.9 0.8 0.7 0.6 TPlatform T


     0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 common source ratio 0.8 1

     Figure 2.

     CDF common source ratio

     obtained thanks to Tplatf orms (resp Tcountry ). The Tcountry curve represents the cumulative distribution obtained in this ?rst case and the Tplatf orms one represents the CDF obtained when starting from the attacks events obtained with the intial Tplatf orms set of time series. As we can notice, around 23% (resp. 25%) of attack events obtained by starting from the Tcountry (resp. Tplatf orm ) set of time series do

    not share any source in common with any attack events obtained when starting the attack even identi?cation process from the Tplatf orm (resp. Tcountry ) set of time series. This corresponds to 136 (16,919 sources) and 171 (75,920 sources) attack events not being detected. In total, there are 288,825 (resp. 293,132) sources present in AE-SetI (resp. AE-Set-II), but not in AE-Set-II (resp. AE-Set-I). As a ?nal note, there are in total 867,248 sources involved in all the attack events detected from both datasets which correspond to 25% the attacks observed in the period under study. 3) Explanation: The reasons why we can not rely on a single viewpoint to detect all attacks events are described below. Split by country: Suppose we have one botnet B made of machines that are located within the set of countries {X, Y, Z}. Suppose that, from time to time, these machines attack our platforms leaving traces that are also assigned to a cluster C. Suppose also that this cluster C is a very popular one, that is, many other machines from all over the world continuously leave traces on our platforms that are assigned to this cluster. As a result, the activities speci?cally linked to the botnet B are lost in the noise of all other machines leaving traces belonging to C. This is certainly true for the cluster time series (as de?ned earlier) related to C and this can also be true for the time series obtained by splitting it by platform, ?µ[0?800),C,platf ormi ?platf ormi ?Ê 1..40. However, by splitting the time series corresponding to cluster C by countries of origins of the sources, then it is quite likely that the time series ?µ[0?800),C,countryi ?countryi ?Ê {X, Y, Z} will be highly correlated during the periods in which the botnet present in these countries will be active

     368 363

     against our platforms. This will lead to the identi?cation of one or several attack events. Split by platform: Similarly, suppose we have a botnet B made of machines located all over the world. Suppose that, from time to time, these machines attack a speci?c set of platforms {X, Y, Z} leaving traces that are assigned to a cluster C. Suppose also that this cluster C is a very popular one, that is, many other machines from all over the world continuously leave traces on all our platforms that are assigned to this cluster. As a result, the activities speci?cally linked to the botnet B are lost in the noise of all other machines leaving traces belonging to C. This is certainly true for the cluster time series (as de?ned earlier) related to C and this can also be true for the time series obtained by splitting it by countries, ?µ[0?800),C,countryi ?countryi ?Ê bigcountries . However, by splitting the time series corresponding to cluster C by platforms attacked, then it is quite likely that the time series ?µ[0?800),C,platf ormi ?platf ormi ?Ê {X, Y, Z} will be highly correlated during the periods in which the botnet in?uences the traces left on the sole platforms concerned

    by its attack. This will lead to the identi?cation of one or several attack events. The top plot of Figure 3 represents the attack event 79. In this case, we see that the traces due to the cluster 175309 are highly correlated when we group them by platform attacked. In fact, there are 9 platforms involved in this case, accounting for a total of 870 sources. If we group the same set of traces by country of origin of the sources, we end up with the bottom curves of Figure 3 where the speci?c attack event identi?ed previously can barely be seen. This highlights the existence of a botnet made of machines located all over the world that target a speci?c subset of the Internet.


     40 30 20 10 0





















     Figure 3. top plot represents the attack event 79 related to cluster 17309 on 9 platforms. The bottom plot represents the evolution of this cluster by country. Noise of the attacks to other platforms decrease signi?cantly the correlation of observed cluster time series when split by country

     IV. O N THE



     So far, we have identi?ed what we have called attack events which highlight the existence of coordinated attacks launched by a group of compromised machines, i.e. a zombie army. It would be interesting to see if the very same army manifests itself in more than one attack event.

    To do this, we propose to compute what we call the action sets. An action set is a set of attack events that are likely due to the same army. In this Section, we show how to build these action sets and what information we can derive from them regarding the size and the lifetime of the zombie armies. A. Identi?cation of the armies

     obtained with a value of ?Ä = 10%. Other values could, possibly, have delivered more armies but the point we want to make is that these armies exist, not that we have found a method to ?nd all of them. For such value of ?Ä we have identi?ed 40 (resp. 33) zombie armies from AE-set-I (resp. AE-set-II) which have issued a total of 193 (resp. 247) attack events. Figure 4 represents the distribution of attack events per zombie army. Its top (resp. bottom) plot represents the distribution obtained from AE-set-I(resp. AE-set-II). We can see that the largest amount of attack events for an army is 53 (resp. 47) whereas 28 (resp. 20) armies have been observed only two times.

     # of zombie armies

     30 20 10 0 0 20 10 0 0 10 20 30 40 50 amount of attack events 60 70

     1) Similarity Measures: In its simplest form, a zombie army is a classical botnet. It can also be made of several botnets, that is several groups of machines listening to distinct C&C. This is invisible to us and irrelevant. What matters is that all the machines do act in a coordinated way. As time passes, it is reasonable to expect members of an army to be cured while others join. So, if the same army attacks our honeypots twice over distinct periods of time, one simple way to link the two attack events together is by noticing that they have a large amount of IP addresses in common. More formally, we measure the likelihood of two attacks events e1 and e2 to be linked to the same army by means of their similarity de?ned as follows: sim(e1 , e2 ) = max( |e1 ?Ée2 | , |e1 ?Ée2 | ) if |e1 ?É e2 | < 200 |e1 | |e2 | 1 otherwise

     # of zombie armies


     20 30 40 50 amount of attack events



     Figure 4.

     Zombie Army Size

     B. Main Characteristics of the Zombie armies In this section, we will analyze the main characteristic of the zombie armies. Lifetime of Zombie Army Figure 5 represents the cumula1

     We will say that e1 and e2 are caused by the same army if and only if sim(e1 , e2 ) > ?Ä. This only makes sense for reasonable values of ?Ä. We address this issue in the next subsections. 2) Action Sets: We now use the sim() function to group together attack events into action

    sets. To do so, we build a simple graph where the nodes are the attack events. There is an arc between two nodes e1 and e2 if and only if sim(e1 , e2 ) > ?Ä. All nodes that are connected by at least one path end up in the same action set. In other words, we have as many action sets as we have disconnected graphs made of at least two nodes; singleton sets are not counted as action sets. We note that our approach is such that we can have an action set made of three attack events e1 , e2 and e3 where sim(e1 , e2 ) > ?Ä and sim(e2 , e3 ) > ?Ä but where sim(e1 , e3 ) < ?Ä. This is consistent with our intuition that armies can evolve over time in such a way that the machines present in the army can, eventually, be very different from the ones found the ?rst time we have seen the same army in action. 3) Results: We skip, for the sake of conciseness, the discussion on how to de?ne the optimal value for the threshold delta. In this paper, results presented have been




     0.4 0.2

     country platform 100 200 300 400 500 duration (day) 600 700 800

     0 0

     Figure 5.

     CDF duration

     tive distribution of minimum lifetime of zombie armies obtained from T Splatf orm and T Scountry (see Section IV-A3). According to the plot, around 20% of zombie armies have existed for more than 200 days. In the extreme case, two armies seems to have survived for 700 days! Such result seems to indicate that either i) it takes a long time to cure compromised machines or that ii) armies are able to stay active for long periods of time, despite the fact that some of their members disappear, by continuously compromising new ones. Lifetime of Infected Host in Zombie Armies In fact, we can classify the armies into two classes as mentioned

     369 364

     in the previous Section. For instance, Figure 6a represents the similarity matrix of zombie army 33, ZA33. To build this matrix, we ?rst order its 42 attack events according to their time of occurrence. Then we represent their similarity relation under an 42 ?Á 42 similarity matrix M . The cell (i,j) represents the value of sim() of the ordered attack event ith and j th . Since, M is a symmetric matrix, only half of it is shown. As we can see, we have a very high similarity measure between almost all the attacks events, around 60%. This is also true between the very ?rst and the very last attack events. In this case, the time elapsed between the ?rst and the last event is 753 days!

     C. Illustrated Examples After having offered a high level overview

Report this document

For any questions or suggestions please email