Accelerating DPI Traffic Classifiers

By Ellen Hernandez,2014-05-27 15:05
12 views 0
Accelerating DPI Traffic Classifiers



     Accelerating DPI Trafc Classiers

     Niccol?? Cascarano

     Politecnico di Torino C.so Duca degli Abruzzi 24 Torino, Italy

     Luigi Ciminiera

     Politecnico di Torino C.so Duca degli Abruzzi 24 Torino, Italy

     Fulvio Risso

     Politecnico di Torino C.so Duca degli Abruzzi 24 Torino, Italy

     niccolo.cascarano@polito.it luigi.ciminiera@polito.it



     Trac classication through Deep Packet Inspection (DPI) is considered extremely expensive in terms of processing costs, leading to the conclusion that this technique is not suitable for DPI analysis at high speed. This paper shows that some performance issues derive from inecient choices in the implementation of the DPI classier that ignores some common characteristics of the network trac. In this paper we present some optimizations that can dramatically decrease the processing cost and can even improve the classication precision.

     This paper analyzes the characteristics of the network trac and proposes new optimizations that are able to substantially reduce the cost of a DPI classier. For each optimization we evaluate the impact on the processing load and on the accuracy of the classication process. Our analysis is based on real trac traces, some of them containing also a relevant portion of peer-to-peer trac that is known to be challenging for DPI classiers. Moreover, two of the three trac traces used have been collected using the GT suite [2] thus providing an objective ground truth for evaluating the classication accuracy. This paper is organized as follows. Section 2 presents the existing works on DPI optimizations, while Section 3 introduces the general architecture of a DPI classier. Section 4 summarizes the procedures used to evaluate the proposed optimizations while Section 5 presents the optimizations and evaluates their impact in terms of processing load and classication precision. Finally, Section 6 concludes the paper.



     Deep packet inspection (DPI) is perhaps the most common technique for trac classication. Among the reasons, its (relatively) straightforward hardware implementation and the fact that real alternative techniques are still in the research domain. For example, statistical classiers have still to demonstrate their scalability with

    respect to the high number of protocols and applications (often 100+) we need to distinguish. However, DPI classiers are considered expensive in terms of CPU and memory consumption. In fact, only a few implementations are able to work at wire speed at 10Gbps without sampling techniques or other tricks that decrease the amount of processing power required to handle the trac. A careful examination of the network trac makes clear that often the processing cost can be decreased by simply avoiding some (useless) processing, leading even to an improved classication precision. An obvious example refers to network sessions that cannot be classied due to the use of encryption. Since usually the DPI engine repeats its classication attempts at every new packet [1], this can entail a large waste of computational resources especially if the ??unclassiable session?? generates a large amount of data. Moreover, this approach increases the misclassied trac due to the probability that a signature matches a random payload.



     Papers on DPI classication usually focus on the analysis or the denition of new techniques for performing fast and scalable regular expression matching [3?C5]; none of these works analyzes the whole process of DPI classication, missing some optimizations that could be implemented on the DPI architecture itself. Many other works [6?C11] present new alternative classication techniques assuming that DPI is computationally too complex to be used in real time systems but they do not perform a complete analysis or at least a comparison with a real DPI classier; hence DPI classication is utilized mostly for post-processing trac traces (o-line) to derive ground truth information. The work of Moore et al. [12] is perhaps the only one that tries to investigate some possible improvements for a DPI classier; the authors propose a sequence of increasingly complex techniques that basically increase the amount of data to be analyzed until the algorithm returns a positive match on a known protocol. However they limit theirs analysis on the classication precision and do not investigate the impact in terms of processing cost. Furthermore, they use old trac traces that do not contain peer-to-peer trac, which is known to be a challenge for DPI classiers because of its frequent use of encryption and content hiding techniques to avoid the identication.

     A (relatively) widespread tecnique to reduce the processing cost of trac classication consists in using only the rst portion of the session data for classication. For instance, BRO intrusion detection system [13] stops the pattern matching after examining 1KBytes of application payload, while Snort [14] allows specifying (for each signature) the limit of application data to be analyzed before stopping the search. However, the amount of data being analyzed before stopping

    the classication process has been chosen as ??reasonable?? by the user??s experience. Also in this case no scientic analysis has been performed to justify this choice and to evaluate its impact on classication precision and computational load.



     A DPI classier relies on the observation that each application uses specic protocol headers to initiate and/or control the information transfer. DPI uses mostly regular expressions (commonly called signatures) to identify the peculiar sequence of data of an application protocol. When a packet belonging to a session not yet associated to an applicationlevel protocol is delivered to a DPI classier, this compares the application data with all the available signatures. Usually, only one signature matches and in this case the packet (hence the entire TCP/IP session) is associated to the corresponding application. In case several signatures match the same payload, an heuristic is used to select the best one (e.g., the longest signature, priority, etc.). The signature checking is skipped for all the packets belonging to TCP/IP sessions that have already been successfully classied. Although there are several types of DPI classiers, the most common categories are (i) packet based, per-ow state (in short PBFS) which analyzes data on a packet-by-packet basis as soon as packets are received by the classier, and (ii) message based, per-ow state (in short MBFS) that analyzes application-level payload as a unique stream of data, after applying TCP/IP normalization (Section 3.2). PBFS and MBFS approaches have already been investigated in previous works [1,12] but some interesting aspects are still open. For instance, it is currently unclear which approach is the best in case of trac classication, although we tend to agree with [1] that suggests that the PBFS is usually enough. In fact, in Section 3.2 we speculate why the TCP/IP normalization can be avoided in the trac classication problem, while Section 5 will provide some experimental results supporting this assumption.

     The TCP/IP normalization aims at recovering the problem of IP fragmentation (i.e. in case an IP packet is larger than the MTU of the physical network and hence it must be split in several fragments) and TCP reassembly (e.g. in case an application-layer message cannot t a single IP packet and hence it is split across dierent packets). These techniques can be used in DPI classiers in order to be able to examine a portion of data that is larger than a single IP packet. In fact, the regular expressions used by in a DPI engine might require a larger number of bytes than the ones available in the rst packet (with a valid payload) of the TCP/UDP session. Actually, this happens frequently in case the regular expression contains the Kleene closure operator (i.e., ??*??), which means that an arbitrary number of characters might be present

    in the specied position in the data under analysis. Our speculation is that the TCP/IP normalization (and its associated overhead) is often useless in trac classication because the number of bytes that need to be examined for a correct classication of TCP/UDP sessions is usually small, i.e. the data contained in the rst packet of the session is usually enough. If this assumption is true, the PBFS can be replaced with the simpler MBFS approach without (or with a limited) loss of precision.



     This Section presents the parameters taken into consideration in our analysis and describes the methodology used to evaluate the processing cost and the classication accuracy of the proposed optimizations, comparing these results with the ones obtained by the baseline classier. In addition, it presents the trac traces used in the evaluation.


     Parameters under evaluation

     We dene three parameters for the evaluation of each optimization: (i) the performance speedup of the new classier compared to the PBFS classier presented in Section 3.1 and used as baseline, and the accuracy in terms of (ii) percentage of unknown trac and (iii) misclassied trac. The trac correctly classied can be derived by complementing the unknown and misclassied trac, i.e. Correct = 100% (U nknown + M isclassif ied). The ??unknown trac?? is the trac that does not match any known protocol signature. The ??misclassied trac?? is the trac that matches a protocol signature that does not correspond to the application that generated it. Obviously, the last parameter can be calculated only on trac traces generated with the GT suite. The performance speedup is the ratio between the processing cost of the classier used as baseline and the optimized one, when classifying the same trac trace. Processing cost has been evaluated by running the classier on our traces and measuring the average processing cost per packet. Measurements have been done using the RDTSC assembly instruction available on Pentium-compatible processors and include only the time spent in the DPI classier, excluding all the other portions of the code that were outside the classier (e.g. loading packet from disk). In order to mitigate the eects of content switching and cache lling, measurements were executed with no other jobs running on the machine


     Baseline DPI classier

     Our ??baseline classier?? is the DPI classier presented in [1], which is based on the Netbee library [15] and can identify the protocols present in the NetPDL database [16]. In addition to a pure PBFS classier, it can analyze correlated sessions such as the ones created at run-time

    by some protocols (e.g., FTP or SIP), whose network parameters are negotiated in the control connection. Since usually these sessions are used to transport a large amount of data, the capability to recognize these correlated sessions may provide a boost on the classication precision, as shown in [12].


     TCP/IP normalization

     Table 1: Details on trac traces used

     Data set POLITO-GT UNIBS-GT POLITO Date and Duration December 10, 2008 68 hours December 17, 2008 56 hours Dec. 20, 2007 12 hours Bytes 202 GB 76.8% TCP 3.5 GB 99.4% TCP 419 GB 94.7 % TCP Packets 330M 69.8% TCP 4.72M 67.6% TCP 579M 92.3% TCP

     Table 2: Cost of the most important components of the baseline PBFS classier

     Block name Sess. ID Extraction Sess. Lookup/Update Pattern Match (POLITO, UDP) Pattern Match (POLITO, TCP) Pattern Match (min, max) Cost 78 49 2988 9051 76 - 19147

     apart from essentials system daemons. The measurement platform was an Intel Dual Xeon 5160 at 3GHz, 4GB RAM and Ubuntu 8.04 32bit; the code under examination was compiled with GCC v4.2.4 with -O3 optimization level and always executed on the same CPU core.

     Table 3: Classication cost and precision of the baseline classier

     Data set POLITO-GT (tcp) UNIBS-GT (tcp) POLITO (tcp) POLITO-GT (udp) UNIBS-GT (udp) POLITO (udp) Avg. cost (ticks/pkt) 6681 2503 743 242 758 709 Unknown tra. (bytes) 72.7% 29.2% 5.67% 0.43% 14.7% 15.3% Misclass. tra. (bytes) 16.9% 7.73% N/A 57.7% 1.39% N/A


     Trafc traces used

     Table 1 summarizes the most important characteristics of the three trac traces that we used to evaluate the goodness of the proposed optimizations. All of them were collected through the well-known tcpdump tool at the border routers of Politecnico di Torino and University of Brescia campuses and then properly anonymized. The ??POLITOGT?? and ??UNIBS-GT?? traces were obtained using the GT suite [2] thus we know exactly which application generated each session; this information allows to evaluate the absolute accuracy of the classication process in terms of correctly classied, misclassied and unknown trac. POLITO-GT contains mainly peer-to-peer and WebTV trafc, generated by 10 virtual machines running Edonkey, Bittorrent, Skype, PPlive, TVAnts and SopCast on Windows XP, plus four real machines running Linux, Windows Vista and MacOS X with default settings and used by regular users. WebTV applications were executed with an automatic turnover of 1 hour, while P2P application were downloading and seeding some popular resources for all the duration of the capture. This trac

    dataset is known to be very challenging for DPI classier because of the massive presence of P2P trac. Some signatures related to WebTV protocols were not available at all, while some have been derived from reverse engineering. Skype trac is encrypted and P2P clients adopt hiding techniques for avoiding classication. The UNIBS-GT trace was collected using the GT suite in a research laboratory where about 20 PhD students were asked to run the GT daemon while doing their normal activities. This trace is smaller than POLITO-GT in term of volume, but it is interesting because it contains normal users?? activity, including P2P le sharing. Both POLITO-GT and UNIBS-GT include trac coming from a limited number of hosts due to the diculties to deploy the GT suite over many clients. Thus, to extend the evaluation scenario, we decided to add the POLITO trace that represents the trac generated by a medium-large network; this trace includes trac generated by about 6000 hosts during an entire working day. For this trace we do not have the ground truth so we can only evaluate the amount of classied and unknown trac, but this is still interesting for evaluating the impact of proposed optimizations at least in terms of processing costs in a more realistic scenario.


     Performance of the baseline classier

     Table 2 reports the cost for executing the most important portions of a PBFS DPI classier, notably the Session ID Extraction (i.e., the extraction of the source/destination IP address and source/destination TCP/UDP port from each incoming packet), the Session Lookup (i.e., the operation of determining if the current session has already been classied) and the Pattern Matching, which represents the most important component of the per-packet processing cost. While the cost of the rst two components is independent from the packet in input, the pattern matching heavily depends on the payload being analyzed and hence of the trafc trace used in the evaluation. For instance, most pattern matching algorithms depends on the number of bytes to be evaluated, while the average per-packet cost depends also on the number of packets required for classifying a session, and more. Therefore it is dicult to present an average processing cost for the pattern matching engine (in fact Table 2 reports dierent values for TCP and UDP trac for the specic trac trace), because of the many parameters (i.e., average packet size, average session length, presence of unclassiable sessions, type of regular expressions used, etc.) it depends on. The cost reported for pattern match algorithm has been evaluated by implementing the signature database as a Deterministic Finite Automata (DFA). The cost range (last row in Table 2) represents the cost of executing the DFA from 1 bytes up to 1460 bytes of input. We chose to use the DFA as pattern matching algorithm mainly because (i) the cost is linear and depends

    only on the number of input characters analyzed and (ii) it supports the merging of multiple regular expressions, avoiding a linear dependency from the number of patterns. As a possible drawback, the memory occupied could grow exponentially. Section 5.1 will demonstrate that this event does not occur when using our database of signatures, which justies our choice of a simple and ecient DFA engine. Additionally, Section 5.1 will also show the possible performance penalties when using a more sophisticated regular expression engine.

     Table 4: Cost of DFA and PCRE patter match algorithm on real trac

     Block name DFA (min - avg - max) PCRE (min - avg - max) Cost (ticks) 76 - 3980 - 19147 35.7K - 2.08M - 9.16M

     Processing cost (CPU ticks)

     7000000 6000000 5000000 4000000 3000000

     First not anchored regular expression with Kleene closure First not anchored regular expression

     8000 7000 6000 5000 Memory occupation (Bytes) 4000 3000

     Table 3 reports the classication results (in terms of average processing cost per packet and unknown / misclassied trac) obtained on the three trac traces by the baseline DPI classier. POLITO-GT trace has a very high percentage of unknown TCP trac. This is expected because the signatures for the WebTV protocols, which represent the largest part of the trac captured, are partially unknown and partially derived with reverse engineering (and not very precise). With respect to the UDP portion, the result is even more problematic because of the high percentage of misclassied trac. Trace UNIBS-GT is less critical than POLITO-GT since the percentage of misclassied trac is reasonably low; we still have a large portion of unknown trafc due to the use of P2P le sharing applications. For trace POLITO we have only the information of unknown trac that results to be lower than the other two traces for the TCP case (most hosts on the network use only ??standard?? applications such as web and email). Vice versa, the unknown UDP trac is rather high, probably for the presence of Skype trac that goes mostly undetected.

     2000000 1000000 0 0 5 10 15 20 25 30

     cost memory occupation 2000 1000 0 35 40 45 50

     # of regular expressions

     Figure 1: Cost and memory occupation of PCRE implementation of pattern matching algorithm. algorithm implemented by the ??PCRE?? library1 [19], which uses a Non-Deterministic Finite Automata (NFA) that conducts a depth-rst search of the pattern tree, and a DFA generated by the lexical analyzer generator ??Flex?? [20]. Both use the same regular expressions in input and operate over a full size IP packet (1460 bytes payload) lled with fake data that do no match any pattern. We measured the processing cost and the memory occupancy starting with

    a single regular expression, then repeating the same test with up to N, N + 1, ???? patterns, till we reach the total number of 48 patterns contained in our protocol database. Analyzing Figure 1, which refers to PCRE, three points are immediately evident: (i) as expected, the processing cost increases by increasing the number of regular expressions to be tested, (ii) its absolute value is very high (on the order of millions of click ticks on our test machine) and (iii) the growth is not regular. Particularly, the last point depends on the type of signature added. In this respect, we can divide regular expressions in three classes: (i) anchored regexp (i.e., begins with the ???? sign), that identies the regular expressions satised only if the pattern is found at the beginning of the payload, (ii) anchored regexp containing the Kleene closure (i.e. the ??*?? wildcard), in which the regular expression can be found in any point of the input data, and (iii) not anchored regexp containing the Kleene closure. Signatures of type (ii) or (iii) force the algorithm to analyze all the data before stating if the regular expression matches or not. The big cost gaps in Figure 1 occur when one signature of these two types is added to the previous set. Figure 2 shows the outcome of the same test related to the DFA engine. The most relevant point is that the processing cost does not depend on the number of regular expressions represented by the DFA and it is two orders of magnitude smaller than the PCRE case. On the other hand, the memory occupation is denitely larger (at least 3 orders of magnitude, although still acceptable) than the one used by PRCE and its growth is not linear as in the previous case;

     1 We choose this implementation because of its representativeness, attested by its large diusion in many dierent open source tools and commercial produts in a wide range of dierent network applications.



     This Section presents a set of possible optimizations that are able to improve the behavior of a DPI trac classier.


     Efcient pattern matching engine

     As discussed in Section 4.3, the main cost of a DPI classier derives from the execution of the pattern matching algorithm. There are several possible implementations for regular expression search, but techniques based on deterministic nite automaton (DFA) appear to be the best in terms of performance. DFA computational cost depends only on the length of the input sequence, independently from the number of regular expression to be checked. The drawback of DFA is the memory occupation; depending on the characteristics of the regular expressions, the automaton that represents the set of regular expressions may require a large amount of memory. There are many pattern search algorithms

    alternative to DFA; some also have a cost that is even less than linear with respect to the length of input (e.g. Boyer-Moore algorithm [17] used by Snort [14]), but the problem of these algorithms is that they cannot combine multiple patterns together. Hence, the advantage brought by the lower computational cost and the better memory consumption is vanied by the need to test all the signatures sequentially, leading to a cost that is linearly dependent on the number of signature to be tested. In fact, DPI classiers that implement these algorithms use some euristics that reduce the number of signatures to be tested on each session [18]. We speculate that the choice of the proper regular expression engine could make a big boost in performance. We tested the cost and the memory occupied by the pattern matching

     24000 22000 20000 18000 16000 14000 12000 10000

     0 5

     cost memory occupation

     First not anchored regular expression with Kleene closure First not anchored regular expression

     18 16 14 12 10 8 6 4 2 0 Memory occupation (MB)

     Table 5: Execution cost on dierent types of regular expressions

     Signature type Anchored Anchored + Kleene Not anchored + Kleene Not anchored + Kleene and backtracking Match (ticks) 1663 5622 5503 5290 No-match (ticks) 1415 1367 3300 13659

     Processing cost (CPU ticks)










     # of regular expressions

     Figure 2: Cost and memory occupation of DFA implementation of pattern matching algorithm.

     antees better performance but limits the expressiveness of the regular expressions to be used. As in previous case, we speculate that these regular expressions are very rare in the real world; for instance, the database of signatures we used did not contain any of those. The problem of writing ??better?? regular expression will be investigated in the next Section.


     Writing better regular expressions

     in fact it increases linearly when the input patterns contain only

    expressions of type (i) (rst region on the left), but the slope increases when we add also expressions of type (ii) (second region), and it tends to explode exponentially when we add the second expressions of type (iii). This is due to the possible ambiguities in the input pattern that force the addition of a large number of states for matching all the possible cases. It is worthy noting that the number of states explodes when at least two expressions of type (iii) are merged in the same DFA. Although DFA state explosion has been widely investigated and several solutions exist (e.g., [3?C5]) we believe that a careful selection of the regular expression could even make useless these techniques. For instance, in this paper we classied the trac using the regular expressions dened in the NetPDL database [16], which includes only two patterns of type (iii), one for a protocol encapsulated in TCP (http) while the other in UDP (nt-security-log). Consequently is possible to generate two dierent DFAs for TCP and UDPbased protocols; this avoinds any memory explosion and the total amount of memory used was about 3MBytes, roughly split in half between TCP and UDP. Table 4 reports the execution cost of both DFA and PCRE with the complete set of regular expressions on real trac. For both implementations we measured the minimum and the maximum costs experienced by a packet present on our traces and the average cost per packet. Results show the evident gap between the DFA algorithm with respect to the heavier PCRE implementation, which is about 3 orders of magnitude in all cases (minimum / maximum / average). We can conclude that a simpler and more ecient pattern matching engine can have a dramatic impact on the performance of the DPI classier, as shown in our tests. However, this requires a careful denition of the regular expressions (i.e. avoiding, whenever possible, the Kleene closure and using anchored patterns) in order to avoid the drawback of the simpler engines. Furthermore, the regular expression used must not use context-sensitive regular expressions (e.g., (a*)(b*)(\1)(\2)), which are not supported by a DFA implementation. In other words, a DFA implementation guar-

     As we have seen, the form of the regular expressions is instrumental for having the possibility to use faster pattern matching engines. In this Section we can demonstrate that the form of the regular expression may have an additional impact on the overall performance, even in presence of the same engine. We evaluated on the POLITO trace four dierent versions of the signature that is used by dierent classiers2 to recognize the HTTP protocol. We measured the average cost of each regular expression in term of clock ticks per packet, dierentiating between matching and no-matching costs. Table 5 shows that the anchored version3 of the signature has a lower cost for both the match and no-match cases because the signature requires that the pattern starts at the beginning of the packet (and hence it usually stops after a few bytes

Report this document

For any questions or suggestions please email