TXT

Context-aware Clustering of DNS Query Traffic

By Cindy Hawkins,2014-05-27 15:07
7 views 0
Context-aware Clustering of DNS Query Traffic

     ??ÎÄÓÉcjwddhfys??Ï×

    pdfÎĵµ?ÉÄÜÔÚWAP?Ëä?ÀÀÌåÑé???Ñ????ÒéÄúÓÅÏÈÑ?ÔñTXT???òÏÂÔØÔ?ÎÄ?þµ????ú?é????

     ÎåµÀ?ÚÉú?îÍø

     www.wdklife.com

     ÎåµÀÂÛÌ???ÎåµÀ?ÚÈË×Ô?ºµÄÂÛÌ??? www.wdklife.com/bbs

     Í?ÖÞÉú?îÍø www.85118.com

     Context-aware Clustering of DNS Query Traf?c

     David Plonka

     University of Wisconsin-Madison

     Paul Barford

     University of Wisconsin-Madison Nemean Networks

     plonka@cs.wisc.edu

     pb@cs.wisc.edu

     ABSTRACT

     The Domain Name System (DNS) is a one of the most widely used services in the Internet. In this paper, we consider the question of how DNS tra?c monitoring can provide an important and useful perspective on network tra?c in an enterprise. We approach this problem by considering three classes of DNS tra?c: canonical (i.e., RFC-intended behaviors), overloaded (e.g., black-list services), and unwanted (i.e., queries that will never succeed). We describe a contextaware clustering methodology that is applied to DNS queryresponses to generate the desired aggregates. Our method enables the analysis to be scaled to expose the desired level of detail of each tra?c type, and to expose their time varying characteristics. We implement our method in a tool we call TreeTop, which can be used to analyze and visualize DNS tra?c in real-time. We demonstrate the capabilities of our methodology and the utility of TreeTop using a set of DNS traces that we collected from our campus network over a period of three months. Our evaluation highlights both the coarse and ?ne level of detail that can be revealed by our method. Finally, we show preliminary results on how DNS analysis can be coupled with general network tra?c monitoring to provide a useful perspective for network management and operations. Categories and Subject Descriptors: C.2.3 [Network Operations]: Network management, Network monitoring, C.4 [Performance of Systems]: Measurement Techniques General Terms: Design, Experimentation, Measurement, Performance

     1.

     INTRODUCTION

     Methods for classifying and identifying key characteristics of network tra?c have important implications in network management, tra?c engineering and network security. For example, the popularity and large

    sizes of the ?les distributed through peer-to-peer (P2P) applications can con-

     Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the ?rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci?c permission and/or a fee. IMC??08, October 20?C22, 2008, Vouliagmeni, Greece. Copyright 2008 ACM 978-1-60558-334-1/08/10 ????$5.00.

     sume a signi?cant percentage of the bandwidth in a network. The ability to accurately identify P2P tra?c can enable it to be throttled at the network border to the bene?t of other more critical tra?c types. Similarly, the ability to identify malicious tra?c accurately and in a timely fashion in the best case can enable an attack to be blocked before it is completed or at least can enable the e?ects to be mitigated quickly. The key challenge in accurately identifying di?erent traf?c types and their characteristics is that there is no inherent mechanism for this task. In years past, port numbers could be used to classify a large percentage of network traf?c, primarily due to the limited diversity of applications. However, there is a wide variety of applications in use today, and many of these use ephemeral ports or standard protocols such as HTTP for communication, which defeat simple classi?cation via port numbers. In the case of malicious tra?c, there is strong incentive to actively obfuscate payloads (e.g., via packing and morphing methods), which makes the identi?cation problem even more challenging. Finally, encrypted tra?c transmitted via standard protocols represents perhaps the most signi?cant classi?cation challenge since it would seem that almost no details could be discerned. Prior work on non-port based approaches to identifying network tra?c include payload-based analysis, behavioral analysis and clustering analysis. Payload-based approaches (e.g., [28, 12]) are standard e.g., in network intrusion detection systems (NIDS) and in some commercially available tra?c shaping systems. This approach tries to match packet payloads to a library of signatures composed of unique byte sequences associated with particular attacks or applications. A disadvantage of the payload-based approach is that byte sequences are often not unique to a particular tra?c type, which leads to the well-known false alarm problem in NIDS. Classi?cation methods based on behavioral characteristics such as [19, 29, 20] focus on building statistical models of transport layer metrics such as connection duration and packet size to distinguish applications. Cluster-based approaches such as [13, 22] take the next logical step by using standard machine learning methods to divide tra?c into groups based on similarity of transport layer

    characteristics. We believe that these methods have merit but are ultimately limited by the diversity of information available to them from the protocols that are being used. While traf?c classi?cation using methods such as the aforementioned can be useful, they often omit key details that are required to diagnose and remedy problems and are likely to never

     ÎåµÀ?ÚÉú?îÍø

     www.wdklife.com

     ÎåµÀÂÛÌ???ÎåµÀ?ÚÈË×Ô?ºµÄÂÛÌ??? www.wdklife.com/bbs

     217

     Í?ÖÞÉú?îÍø www.85118.com

     ÎåµÀ?ÚÉú?îÍø

     www.wdklife.com

     ÎåµÀÂÛÌ???ÎåµÀ?ÚÈË×Ô?ºµÄÂÛÌ??? www.wdklife.com/bbs

     Í?ÖÞÉú?îÍø www.85118.com

     be able to fully distinguish all tra?c types accurately. We argue that a broader perspective is necessary. In this paper, we investigate the question of how Domain Name System queries could be used to provide important and unique insights on network tra?c. Our motivation for this work is the observation that DNS is used by almost all applications in the Internet, and the conjecture that the plain-text DNS query/response tra?c is a rich source of information on network tra?c that might otherwise be di?cult to understand. For example, while prior classi?cation methods might accurately identify application traf?c as HTTP, information from DNS queries that precedes this tra?c could be used to further label the tra?c with prominent domain names. (Throughout the remainder of this paper, we refer to standard or expected DNS tra?c as ??canonical??.) DNS is also now routinely used for black-listing services (throughout the remainder of this paper, we refer to this type of DNS tra?c as ??overloaded??), which are critical for spam checking, but increasingly used for other purposes (see Section 3 for details). Understanding the nature of this tra?c could be useful in network operations. Finally, there are many queries that never succeed, but still require DNS resources. So, any improvement in understanding this category of DNS tra?c (throughout the remainder of this paper, we refer to this type of tra?c as ??unwanted??) will be important to network operations and security administrators. The starting point for our work was a set of traces of DNS query/response tra?c continuously gathered from our campus network from January through April, 2008. This data set comprised over 11 billion total query responses for tens of thousands of clients. With a data set this large and diverse, a principled analysis method is required in order to extract, visualize and evaluate the desired information. Our approach to analyzing the DNS traces is data-driven and context-aware. In particular, we apply a clustering

    methodology that is guided by DNS syntax and semantics to decompose the query/response traces into the three major categories described above. We also employ IP pre?x and domain name search trees to divide clusters into more detailed subclusters and aggregates. Rather than relying on single ?elds, we distinguish additional unwanted and overloaded tra?c types by identifying combinations of query names, response codes, and answer values. Additionally, we employ a ??re?exive clustering?? method that uses these multiple dimensions for creating groups where the interpretation of one group is based on the context of the other. We implemented our context-aware clustering method in a tool we call TreeTop. This tool enables both o?-line and real-time analysis and visualization of DNS query/response tra?c. Speci?cally, TreeTop analyzes query/response tra?c with a variety of ?lters and summarizes in tabular or graphical reports. TreeTop is currently in operational use in our campus network and is also available to the community [25]. When applied to our DNS query/response traces, TreeTop highlighted a number of interesting characteristics that demonstrate the utility of our approach. First, we found a diurnal cycle consistent with standard packet tra?c. The pro?le for this tra?c is relatively smooth and clearly highlights a wide variety of popular applications such as Facebook, Google, etc. Next, we automatically identi?ed approximately 200 black-lists and found black-list tra?c to be of signi?cant volume continually while also marked by high magnitude spikes. Finally, we de?ned and measured

     a new high-volume category of unwanted, avoidable queries due to incorrect use of resolver search lists. While the details of these results are derived from our local dataset, our approach and TreeTop can be used to investigate similar activity in other networks. The remainder of this paper is organized as follows. In Section 2, we discuss prior studies that are related to our own. In Section 3, we provide an overview of DNS including details that are pertinent to this paper. In Section 4, we describe the measurement infrastructure used to gather our DNS query traces, and details on the traces themselves. In Section 5, we describe our context-aware clustering method, and in Section 6, we describe the implementation of the method in our TreeTop tool. The results of the analysis of our dataset are provided in Section 7. We outline future work, summarize, and conclude in Section 8.

     2.

     RELATED WORK

     Methods for analyzing the characteristics of network traf?c behavior have been described in a large number of prior studies. Of particular relevance to our work are prior studies that describe techniques for classifying network tra?c including [19, 20, 29, 13, 22]. These methods have been shown to be highly accurate, and we consider

    the information that they produce to be complementary to what is produced by our DNS query analysis. Our approach to clustering is informed by the work of Cho, et al. in [7] and Estan, et al. in [14]. Both employ hierarchical aggregation based on IP packet header information and the latter describes a dynamic method for creating minimal,

    multidimensional clusters of interest. Our work diverges from those techniques by utilizing the DNS tra?c payload to create new clusters and by introducing hierarchical aggregation by domain names. Clustering methods have been applied to network tra?c in several other studies. For example, Estan and Varghese describe a method for e?cient identi?cation of heavy hitter ?ows that is based on a ?xed cluster de?nition [15]. Zhang, et al. describe a method for detecting anomalous BGP route advertisements based on clustering update behavior [39]. Finally, Yegneswaran et al. use cluster analysis as the key component of an algorithm for intrusion signature generation in [37]. Our work di?ers from these in that our clustering techniques are customized to the unique semantics of the DNS. There is a growing literature on the empirical characteristics of DNS behavior and performance (e.g., [3, 18, 21]). These studies have focused on volume and diversity of query types from both the client and server perspectives, and shed light on the impact of speci?c mechanisms such as client-side caching. More recent studies have focused on how the DNS can provide insights on unwanted activity. For example, Whyte, et al. describe a method for identifying scanning behavior associated with worm infections based on maintaining whitelists of known DNS records [36]. Other works propose new tools for passive monitoring of DNS tra?c. Wessels, et al. [34, 35] introduce a tool to identify high volume types of DNS tra?c. While we build upon this tool (dnstop), our analysis di?ers in that we perform clustering based on the DNS response answer values rather than just queries and response codes and then we apply DNS measurements to classify IP tra?c in general. In [33], Weimer introduced a tool that populates a database from passive DNS traces and e?ectively identi?ed abusive behavior including botnet activity.

     ÎåµÀ?ÚÉú?îÍø

     www.wdklife.com

     ÎåµÀÂÛÌ???ÎåµÀ?ÚÈË×Ô?ºµÄÂÛÌ??? www.wdklife.com/bbs

     218

     Í?ÖÞÉú?îÍø www.85118.com

     ÎåµÀ?ÚÉú?îÍø

     www.wdklife.com

     ÎåµÀÂÛÌ???ÎåµÀ?ÚÈË×Ô?ºµÄÂÛÌ??? www.wdklife.com/bbs

     Í?ÖÞÉú?îÍø www.85118.com

     Likewise, Zdrnja, et al. [38] record DNS trace information to a database and subsequently identify DNS anomalies including fast ?ux

    domains typically associated with botnet activity [10, 26]. While that work mentions DNS tra?c due to anti-spam tools, our work di?ers in that we isolate and measure this black-list tra?c, and we monitor all DNS query responses from clients, not just authoritative answers. Several of the methods described in these papers are incorporated into our DNS monitoring framework. Finally, Ren, et al. [27] propose visualization techniques for DNS data. Our motivation is similar to this study in some respects and we also utilize time series data as input, but our work di?ers in analysis method and in the tree-based visualizations that we produce.

     4.

     EMPIRICAL DATASETS

     3.

     DNS MECHANICS

     We are primarily concerned with analysis of DNS packets sent in response to queries from end-hosts, i.e., those at the periphery of the Internet. As in [38], we analyze just the responses (replies) because the details of the query are repeated in the corresponding DNS response packet. Here we present a partial overview of the DNS service as it is used by these hosts and provide de?nitions of the terms we use in this paper. DNS query packets and response packets have a similar form, and are typically exchanged between clients and name servers using the UDP ??domain?? service port 53. The packet contains a header, a question section, and an answer section. Generally, queries are performed with query names that are Internet domain names. The Internet domain space is hierarchical 1 , with a well-known set of top-level domains, such as ??com,?? ??net,?? and ??org.?? Institutions have sub-domains, such as ??example.com?? and ??example.org,?? in which they can arbitrarily create sub-domains and entries such as ??www.example.com.?? DNS client hosts typically perform queries by using a resolver that is supplied with the operating system. The most common queries are for the IP addresses associated with domain names. These queries have a type IPv4 Address (A) or IPv6 Address (AAAA, known as ??quad A??) and contain a string-based query name such as ??www.example.org,?? to which a DNS name server typically responds with either ??No error?? (NOERROR) or ??Nonexistent Domain?? (NXDOMAIN). In the NOERROR case, one or more IP addresses, such as 192.0.2.2, are returned in the response packet??s answer section. Other common query types include those for Mail eXchanger (MX) records used to route e-mail, Pointer (PTR) records used to translate IP addresses to names, Service Location (SRV) records used for automatic discovery of services, and Text (TXT) records used for various purposes. Each query type may have its own corresponding answer type. We refer the reader to either [32] or [23], [24] for a thorough introduction to DNS packet structure and service semantics.

     In this work, we are interested in DNS tra?c, i.e., queries and corresponding replies, exchanged between Internet hosts and trustworthy recursive name servers. To assure the legitimacy of the servers, we monitor only the tra?c involving those servers under the campus?? administrative control. This avoids us having to question the validity of responses because the campus DNS servers perform recursive queries, on their clients behalf, only to zone-authoritative name servers (based on referrals from the Internet??s trusted root servers). Thus, we avoid rogue DNS servers such as those investigated in [9]. For o?-line analysis, we capture DNS tra?c exchanged between campus client hosts and the campus?? recursive anycast [2] DNS service. Our university operates a recursive name service consisting of four geographically dispersed server machines that answer queries received at one of the service??s two IP addresses, which are in di?erent campus network pre?xes. As such, this recursive anycast DNS service exempli?es current best practice for a large, highly-reliable lookup service that serves tens of thousands of clients. The complication introduced by anycast is that any of the servers could handle a speci?c client??s request, so we monitor all servers simultaneously, and combine the traces at synchronized points in time to get a complete view. In this paper, we consider a tra?c trace from January 8, 2008 through April 21, 2008. Tables 1 2 and 2 show the query types and response codes as percentages of total DNS tra?c observed during this time. The active client numbers are based on the count of clients observed performing queries in a ?ve minute interval. Figure 1 presents the traf?c as a time series. While the details have been omitted for space, note the rich set of characteristics involving multiple dimensions in the measurement data. (The weeks labeled 2, 3, and 12 are during the January inter-semester and spring recesses, thus had lower tra?c volume due to fewer active clients.) Query Type A PTR AAAA MX TXT SRV any Queries/Sec 671 (54%) 310 (25%) 120 (10%) 99 (8%) 25 (2%) 5 (0%) 1236 (100%) Active Clients 4521 (87%) 1386 (26%) 906 (17%) 197 (3%) 112 (2%) 145 (2%) 5183 (100%)

     Table 1: DNS query distribution: average rates and average numbers of active clients by query type. An ??active client?? is one that has performed a DNS query within a given ?ve minute interval.

     For online analysis in real-time, we also monitor tra?c at individual DNS servers and on an individual workstation. That is, the tra?c is observed within the end host, either the DNS server or client host, at its network interface.

     2 The percentages of active clients in Table 1 are not expected to add to 100% because any given active client can issue multiple types of queries in a measurement interval.

     See Figure 5 for a graphical example of a portion of the DNS

hierarchy.

     1

     ÎåµÀ?ÚÉú?îÍø

     www.wdklife.com

     ÎåµÀÂÛÌ???ÎåµÀ?ÚÈË×Ô?ºµÄÂÛÌ??? www.wdklife.com/bbs

     219

     Í?ÖÞÉú?îÍø www.85118.com

     ÎåµÀ?ÚÉú?îÍø

     www.wdklife.com

     ÎåµÀÂÛÌ???ÎåµÀ?ÚÈË×Ô?ºµÄÂÛÌ??? www.wdklife.com/bbs

     Í?ÖÞÉú?îÍø www.85118.com

     the measurement and analysis of IP tra?c in general. That is, we want to classify tra?c by familiar domain name identi?ers.

     5.2

     Methods

     We use two methods to achieve our goals: 1. Context-awareness. Our ?rst method is to form clusters by leveraging the knowledge of DNS syntax and semantics. Instead of attempting to apply general clustering methods (e.g., simple K-means), we use knowledge of the protocol itself and knowledge gleaned from prior work to assemble DNS-speci?c clusters. Our starting point for context-aware clustering is based on our speci?cation of three general types of DNS queries. While other high-level taxonomies of DNS tra?c are certainly possible, we argue that the following three classes support the goal of making the resulting analysis useful in both research and operations. Unwanted Tra?c. Many of the prior empirical studies of DNS tra?c discuss high-volume anomalies observed in the data, and are driven by concern of their potential impact on local and Internet-wide DNS operations. These anomalies are within an important class of unwanted DNS tra?c including all sorts of misdirected and malformed queries, such as those with IP addresses as query names, unknown Top Level Domains (TLDs), RFC-1918 addresses for PTR, and for names containing invalid characters. Overloaded Tra?c. The DNS has come to be both extended and reused for new purposes in both foreseen and creative ways, i.e., it has become overloaded. By this we mean that an earlier function of the DNS is overloaded with new meaning (rather than meaning that the DNS service is experiencing excessive load due to these new purposes). In light of these new uses, there is the danger of misinterpreting this ??overloaded?? tra?c as either unwanted or typical DNS tra?c, thus we wish to identify and isolate it in analyses. The primary examples of applications that overloads the DNS are ??black-lists.?? The most common intent and use of these lists is to limit spam or network abuse by providing a mechanism for determining whether or not a given IP address or domain name is currently a member of a list that is maintained

    by some ??listing service?? (both community-based and commercial services are available). These lists exist in many varieties including Real-time Blackhole Lists (RBLs), DNS Black-Lists (DNSBLs), DNS White-Lists (DNSWLs), Uniform Resource Identi?er Black-Lists (URIBLs), Spam URI Real-time Black-Lists (SURBLs), and RightHand-Side Black-Lists (RHSBLs, for testing the domain name portion of an email address). Black-lists employ an informal protocol [1] atop DNS and, in doing so, they overload the meanings of the DNS A query type and its response codes. For instance, a given IP address or fully quali?ed domain

     Figure 1: DNS query and response rates, January 8, 2008 through April 21, 2008. Query rates by type are plotted above the horizontal axis and the corresponding response rates by code are plotted below. See Tables 1 and 2 for the rate values. Response Code NOERROR NXDOMAIN SERVFAIL any Responses/Sec 729 (59%) 480 (39%) 27 (2%) 1236 (100%)

     Table 2: DNS response distribution: average rates by response code.

     5.

     ANALYSIS METHOD

     Our initial observation about the measurement data, presented in time series in Figure 1, is that the DNS query responses have a rich set of characteristics not unlike those seen when measuring all Internet tra?c (i.e., not just DNS) involving a similar number of hosts. This observation motivated our analysis goals and the methods we developed to achieve them.

     5.1

     Goals

     We have two primary goals for o?-line and real-time DNS tra?c analysis: 1. Distill Useful DNS Tra?c Types. The number of combinations of DNS packet ?eld values is large, similar to that of TCP and UDP IP headers in general IP tra?c. This suggests applying analysis techniques successful in prior work, i.e., aggregationbased clustering techniques inspired by [7] and [14], both of which use hierarchical, volume-based clustering to more succinctly store and represent an otherwise overwhelming number of measures. Thus, our foremost goal is to distill the measurement data so that we can present essential, concentrated clusters that will be useful in both research and operations. 2. Enable Flexible Analysis. Our second goal is a ?exible analysis of DNS tra?c such that we can answer new questions and conveniently apply the knowledge gleaned from our analysis to broader Internet tra?c applications. For example, we wish to use the knowledge of the domain names by which clients refer to Internet hosts for

     ÎåµÀ?ÚÉú?îÍø

     www.wdklife.com

     ÎåµÀÂÛÌ???ÎåµÀ?ÚÈË×Ô?ºµÄÂÛÌ??? www.wdklife.com/bbs

     220

     Í?ÖÞÉú?îÍø www.85118.com

     ÎåµÀ?ÚÉú?îÍø

     www.wdklife.com

     ÎåµÀÂÛÌ???ÎåµÀ?ÚÈË×Ô?ºµÄÂÛÌ??? www.wdklife.com/bbs

     Í?ÖÞÉú?îÍø www.85118.com

     name (FQDN) is tested by prepending it to the blacklist??s domain name and then performing a DNS lookup, and testing for ??magic numbers?? in the returned answer. While the meaning of these numbers is de?ned by the particular black-listing service, black-lists clearly overload the DNS query types, response codes, and answers, thus requiring special context-aware treatment in our clustering method to isolate this tra?c from the canonical. In Section 6, we explain in detail how we cluster this tra?c using a technique we call ??re?exive clustering.?? Canonical Tra?c. This class of tra?c is the expected, well-behaved DNS tra?c. Essentially, it is what is likely to be left over once the unwanted and overloaded tra?c is removed, and is most often used to identify hosts and services, such as converting domain names to IP addresses or the reverse (A, AAAA, or PTR queries), routing electronic mail (MX queries), etc. Canonical tra?c uses the RFC-de?ned query classes, types, and response codes in a well-de?ned fashion. We have signi?cant interest in the canonical tra?c and the clients involved in it since our intent is to apply the information gleaned to improve identi?cation and analysis of the subsequent IP tra?c involving those clients. The DNS query/response tra?c is a compelling, transparent source of additional information about Internet tra?c beyond what is available in packet headers. DNS tra?c is of relatively low volume (compared with all IP tra?c involving a given population of clients), making it practical to process in real-time. Lastly, it is not obscured by encryption mechanisms that thwart general payload analysis. With these categories, our method improves the analysis of DNS tra?c by using clusters involving multiple ?elds of the response packets (such as query name, response code, and answer values) and re?exive clusters prepared from other clusters in a DNS-speci?c way. That is, we form clusters using the contextual knowledge of DNS tra?c and its idiosyncrasies for unwanted, overloaded, and canonical tra?c. 2. Utilize Purpose-built Data Structures. Our method to achieve the goal of ?exible clustering and analysis in real-time is to utilize e?cient, highperformance data structures to handle IP addresses and domain names. (In contrast, a relational database as the data store is a good choice for o?-line analysis as in [33] and [38].) The ability to store, lookup, and report IP address and domain names are key functions to identify and measure the unwanted, overloaded, and canonical types of tra?c. Furthermore, an implementation will bene?t if these data structures can be combined

Report this document

For any questions or suggestions please email
cust-service@docsford.com