TXT

Graph-based P2P Traffic Classification at the Internet Backbone...

By Jennifer Flores,2014-05-27 15:08
8 views 0
Graph-based P2P Traffic Classification at the Internet Backbone...

     ??ÎÄÓÉtao8565??Ï×

    pdfÎĵµ?ÉÄÜÔÚWAP?Ëä?ÀÀÌåÑé???Ñ????ÒéÄúÓÅÏÈÑ?ÔñTXT???òÏÂÔØÔ?ÎÄ?þµ????ú?é????

     Graph-based P2P Traf?c Classi?cation at the Internet Backbone

     Marios Iliofotou? , Hyun-chul Kim? , Michalis Faloutsos?, Michael Mitzenmacher?ì, Prashanth Pappu? , and George Varghese?

     University of California, Riverside CAIDA and Seoul National University ? University of California, San Diego ?ì Harvard University ? Conviva, Inc. parameters [2]. We discuss the limitations of previous methods in more detail in ?ìIV. In this paper, we use the network-wide behavior of an application to assist in classifying its traf?c. To model this behavior, we use graphs where each node is an IP address, and each edge represents a type of interaction between two nodes. We use the term Traf?c Dispersion Graph or TDG to refer to such a graph [10]. While we recognize that some previous efforts [3], [5] have used graphs to detect worm activity, they have not explored the full capabilities of TDGs for application classi?cation. We propose a classi?cation framework, dubbed Graption, as a systematic way to combine network-wide behavior and ?ow-level characteristics. Graption ?rst groups ?ows using ?ow-level features, in an unsupervised and agnostic way, i.e., without using application-speci?c knowledge. It then uses TDGs to classify each group of ?ows. As a proof of concept, we instantiate our framework and develop a P2P detection method, which we call ??Graption-P2P??. Compared to other methods, Graption-P2P is easy to con?gure and requires very little a priori knowledge (mainly a few intuitive parameters). The experimental part of our paper shows that: ? Graption-P2P identi?es over 90% of P2P traf?c with precision greater than 95% in backbone traces. ? Graption-P2P performs better than BLINC in P2P identi?cation at the backbone. For example, Graption-P2P identi?es 95% of BitTorrent traf?c while BLINC identi?es only 25%. ? Even a single backbone link contains enough information to generate TDGs that can be used to classify traf?c. In addition, TDGs of the same application seem fairly consistent across different times and locations. The rest of the paper is organized as follows. In ?ìII we de?ne TDGs, and identify TDG-based metrics that differentiate between applications. In ?ìIII we present the Graption framework and our instantiation, Graption-P2P. In ?ìIV we discuss related work. In ?ìV we conclude the paper. II. T RAFFIC D ISPERSION G RAPHS De?nition. Throughout this paper, we assume that packets can be grouped into ?ows using the standard 5-tuple {srcIP,

     Abstract?ªMonitoring network traf?c and classifying applications are essential functions for network administrators. In this paper, we consider the use of Traf?c Dispersion Graphs (TDGs) to classify network

    traf?c. Given a set of ?ows, a TDG is a graph with an edge between any two IP addresses that communicate; thus TDGs capture network-wide interactions. Using TDGs, we develop an application classi?cation framework dubbed Graption (Graph-based classi?cation). Our framework provides a systematic way to harness the power of network-wide behavior, ?ow-level characteristics, and data mining techniques. As a proof of concept, we instantiate our framework to detect P2P applications, and show that it can identify P2P traf?c with recall and precision greater than 90% in backbone traces, which are particularly challenging for other methods.

     I. I NTRODUCTION An important task when monitoring and managing large networks is classifying ?ows according to the application that generates them. Such information can be utilized for network planning and design, QoS and traf?c shaping, and security. In particular, detecting P2P traf?c is a potentially important problem for ISPs that want to manage such traf?c, and for speci?c groups such as the entertainment industry in legal and copyright disputes. Detecting P2P traf?c also has particular interest since it represents a large portion of the Internet traf?c, with more than 40% of the overall volume in some networks [11]. Most current application classi?cation methods can be naturally categorized according to their level of observation: payload-based signature-matching methods [16], [14], ?owlevel statistical approaches [6], [18], or host-level methods, such as BLINC [13], [24]. Each existing approach has its own pros and cons, and no single method clearly emerges as a winner. Relevant problems that need to be considered include identifying applications that are new, and thus without a known pro?le; operating at backbone links [2], [13]; and detecting applications that intentionally alter their behavior. Flow-level and payload-based approaches require per application training and will thus not detect traf?c from emerging protocols. Hostbased approaches can detect traf?c from new protocols [13], but have weak performance when applied at the backbone [2]. In addition, most tools including BLINC [13] (which has 28 parameters) require ?ne-tuning and careful selection of

     Name TR-PAY1 TR-PAY2 TR-ABIL

     Date/Time 2004-04-21/17:59 2004-04-21/19:00 2002-09 /(N/A)

     Duration 1 hour 1 hour 1 month

     Flows 38,808,604 37,612,752 2,057,729

     TABLE I S ET OF BACKBONE TRACES FROM THE C OOPERATIVE A SSOCIATION FOR I NTERNET D ATA A NALYSIS (CAIDA). S TATISTICS FOR THE TR-ABIL TRACE , ARE REPORTED ONLY FOR THE FIRST FIVE - MINUTE INTERVAL .

     srcPort, dstIP, dstPort, protocol}. Given a group of ?ows S, collected over a ?xed-length time interval, we de?ne the corresponding TDG to be a directed graph G(V, E), where the set of nodes V corresponds

    to the set of IP addresses in S, and there is a link (u, v) ?Ê E from u to v if there is a ?ow f ?Ê S between them. In this paper, we consider bidirectional ?ows. We de?ne a TCP ?ow to start on the ?rst packet with the SYN ?ag set (referred to as the SYN-packet), so that the initiator and the recipient of the ?ow are de?ned for the purposes of direction. For UDP ?ows, direction is decided upon the ?rst packet of the ?ow. Data Set. To study TDGs, we use three backbone traces from a Tier-1 ISP and the Abilene (Internet2) network. These traces are summarized in Table I. All data are IP anonymized and contain traf?c from both directions of the link. The TRPAY1 and TR-PAY21 traces were collected from an OC48 link of a commercial US Tier-1 ISP at the Palo Alto Internet eXchange (PAIX). The TR-ABIL trace is a publicly available data set collected from the Abilene (Internet2) academic network connecting Indianapolis with Kansas City. The Abilene trace consists of ?ve randomly selected ?ve-minute samples taken every day for one month, and covers both day and night hours as well as weekdays and weekends. Ground Truth. We used a Payload-based Classi?er (PC) to establish the ground truth of ?ows for the TR-PAY1 and TRPAY2 traces. Both traces contain up to 16 bytes of payload in each packet, thereby allowing the labeling of ?ows using the signature matching techniques described in [2], [13]. Running the PC over the TR-PAY1 and TR-PAY2 traces we ?nd 14% of the traf?c to be P2P, 28% Web, 6% DNS, and the rest to belong to other applications, such as Email, FTP, NTP, SNMP, etc. For our study, we remove the 2% of traf?c that remained unclassi?ed and the 28% that contained no payload. A. Identifying P2P TDGs Identifying the right metrics to compare graph structures is a challenging question that arises in many disciplines [17]. Our approach is to consider several graph metrics, each capturing a potentially useful characteristic, until a set of metrics is found that distinguishes the target graphs. To select an appropriate set of metrics, we generate a large number of TDGs using all our traces (Table I), thus observing TDGs over two different locations at the backbone. For the

     1 The authors thank CAIDA for providing this set of traf?c traces. Additional information for these traces can be found in the DatCat, Internet Measurement Data Catalog [26], indexed under the label ??PAIX??.

     TR-PAY1 and TR-PAY2 traces, we use the payload-based classi?er (PC) in order to select which ?ows belong to each TDG. Since the TR-ABIL trace does not have any payload information, we use port numbers [2] to assign ?ows to applications. We can use port numbers for the TR-ABIL trace since it was collected in 2002 where most P2P applications used their default port numbers [7], [12]. We only use the TR-ABIL trace to verify our TDG observations over a second location in the backbone and we do not use it in the ?nal evaluation of our classi?er. By using

    the month-long TR-ABIL trace, we can study the consistency of TDGs over different times of the day and over weekdays and weekends. We observe TDGs over 5-minute intervals. This interval length gives good classi?cation results and stability of TDG metrics over time. For each TDG we generate a diverse set of metrics. Our metrics capture various aspects of TDGs including the degree distribution, degree correlations, connected components, and distance distribution. For additional details about these metrics we refer the reader to [9], [17]. To select the right set of metrics we use various graph visualizations and trial and error. Finding a less ad hoc approach is beyond the scope of this work. Two TDG visualization examples are shown in Figure 1. We see that FastTrack (P2P) has a denser graph than HTTPS, or a higher average degree, ?? ?? where the average node degree k is given by k = 2|E|/|V |. We utilize two other metrics that capture the directionality of the edges in the graph and the distances between nodes. The directionality is useful since we know that pure clients only initiate traf?c, pure servers should never initiate traf?c, and that some P2P nodes play both roles. To capture this quantitatively, we de?ne InO to be the percentage of nodes in the graph that have both incoming and outgoing edges. The distance between two nodes is de?ned as the length of their shortest path in the graph. The diameter of a graph is de?ned as the maximum distance between all pairs of nodes, which is sensitive as a metric [17]. For a more robust metric, we use the effective diameter (EDiam), which we de?ne as the 90-th percentile of all pairwise distances in the graph. From our measurements, we empirically derive the following two rules for detecting P2P activity. Rule 1: ?? k > 2.8 and InO > 1%; Rule 2: InO > 1% and EDiam > 11. With these simple rules, we can correctly identify all P2P TDGs from both backbone locations (Abilene backbone and Tier-1 ISP). Intuitively, P2P hosts need to be connected with a large set of peers in order to perform tasks such as answering content queries and sharing ?les, which can explain the higher average degree compared to client-server applications. An additional characteristic of P2P applications is the duality of roles, with many hosts acting both as client and server. The duality of roles is in turn captured by the high InO value. We further speculate that the decentralized architecture of some P2P applications (such as BitTorrrent), can explain the high diameters in some P2P TDGs. Additional speculations on why these three metrics effectively capture P2P behavior is provided in [9] and are omitted due to space limitations.

     402 1330

     1840

     1814

     1787

     1785

     1562

     1408

     884

     1933 1593 1582

     1841

     1815

     1788

     1786

     403 1329 1315 1510 1279 1261

     1594

     1583

     910

     998 1153 719 831 1316

     476

     1280

     1262

     1006

     14 807 694 999 994

     716

     618

     443

     1357

     808

     523 1724 1559 1509

     658

     639

     1358 1954

     1133

     718 1124 640

     1442

     284 753 1680

     943

     928

     13

     717

     619

     444

     1233

     1636 486 1681

     1609

     323 1150

     728 1390 1766

     137 735 192

     1576

     586 3 657

     285 1361 745 1391 1767

     453

     601 1455 442

     325 1596

     1149

     251 1154 342 191

     389 1586

     117

     1597

     1867

     585 136

     656

     1566 932 1216 1205

     118 1665

     454

     1188 947 388 1761 1307 513 780 1619 1222

     110

     607 1569

     1614

     495

     960

     1223

     1600 1463 974 1826

     946 900

     398 1210 1522 55 1560 1783 1898 1187 1686 418 1802 934 1838 1590

    692 740 1314 310

     1199 1691 1638 56 779 1529 1001 1437

     876

     1464

     1687 1512

     680

     60 1429 1902

     1858 469

     750 1528 1470 399 309 33 130 377 490 749

     1568 417 1526

     531 1206 592

     1859

     1335

     783 611 249

     198 1641

     574

     899 959 1114 1309 1364 357 647 834 1818 671 958 1552 1057 422 746

    1424 465 280 1097 69 1301 1373 1192 508 915 1776 1622 1284 85 544 330

    1864 1803 1491 1177 1906 144 665 691 32 522 757 1827 1816 1320 1453 1946 748 1387 1939 489 1207 1670 49 965 565 669

     1324 1202 1234 954 86 1618 131

     1385 148 1029

     50 1392 1115 1056 1431 470 545 61 1685 70 520 1203 891

     356

     1439 1180 1592 1690 1248 1393 1715 370 970 401 1487 317 1805 527 855 423 253 778 143 1113 911 279 427 846 409 1834 1616 519 1817 1578 195 916 1503 355 1551 166 199

     145 1642 1395 1182 446 1384 1896 528 711 1499 1258 1028 149 62 499 1643 546 425 1121 552 1036 313 111 1008 187 1806 789 1449 1723 1250 1200 1336 73 990 1882 684 1862 1266 723 1459 1474

     1661 1716 210 1419 1621 59 1160 252 387 599 1625 1644 1445 1916 212 316 119 580 704 77 1084 1360 1117 209 8 312 1722 1268 345 393 1645 896 1733 1126 385 1648 596 696 887 1208 1769 197 882 1169 581 1089 1919 74 991

     282 1186 1027 1707 426 1215 186 400 1753 982 1922 804 1444 981 38 194 1520 829 568 1271 1672 1132 213 24 1007 708 1897 1069 1185 215 413 112 792 790

     567 772 1138 1064 1035 765 1396 613 556 1129 188 1142 1061 363 1673 6 886 1120 358 1843 1732 1465 1042 1174 1917747 862 583 303 135 346 428 5 429 1041 908 1217 1653 40 695 968 452 283 1574 767 412 1602 438 557 705 179 394 677

     221

     343

     243 1231 549 1671 1173 1676 318 1650 78 484 830 324 1096 185 97 1065 791 572 1166 501 1 241 25

     272 879 222 1434 623 271 731 81

     10 376 963 216 1893

     1533 134 1869 709 386 1808 784 133 1466 129 534 1058 795 392 1441 479 439 193 231 768 341 622 30 180 227 7 319 1189

     624 1141 1270 1151 835 730 1162 612 478 1108 82 861 39

     93

     1088 1494

     563 1190

     616 2 1518 1272 226 1131 1461 598 1371 1148

     548 361 650 359 94 666 301 411 264 584 9 836 1220 302 873 1354 114 1032 165 562 132 509 98 1469 897 485 46 633 1740 675 852 853 988 1146 872 901 1626 104 902 64 907 232 654 204 1204 1246 1550 223 31 455 29 63 375 45 615 378 843 1835 1525 1346 1577 89 1404 989 655 1195 786 1793 587 661 1611 404 107 158 1564 885 414 961 1297 921 1863 206 92 140 667 1928 1446 1237 518 1683 1542 1228 1286 1467 793 962 1389 1704 233 1514 1640 1714 1285 679 286 1476 90 1684 157 1877 379 1795 170 127 1488 756 1921 27 1278 1394 17 35 837 1051 276 277 828 91 482 463 1693 892 859

    561 19 698 1184 917 1748 564 468 102 1283 945 517 340 1780 156 589 26 1892 368 573 1555 1617 938 174 1765 267 952 1472 369 832 1504 1861 34 755 1860 1764 263 850 898 774 173 714 1318 649 785 948 1580 445 178 920 608 155 638 487 1952 415 1054 269 177 43 435 1567 610 931 1345 270 1196 1147 211 434 396 1125 1689 1098 175 1935 1489 464 1256 488 1699 1947 467 1538 1181 500 987 803 674 306

     126 245 582 1423 766 641 877 883 483 895 1343 668 1235 918 1018 630 22 1274

     and then use graph metrics on the remaining traf?c. In addition to port inspection, we can also examine the payload of a ?ow in order to verify that it follows the expected application-layer interactions. As a future work, our goal is to select metrics that can further help to separate between collaborative applications (e.g., DNS) and P2P. We discuss similar topics again is ?ìIII-C. We do not claim that our thresholds are universal, but our measurements suggest that small adjustments to these simple parameters allow our methodology to work on different backbone links. Furthermore, the three thresholds (InO, EDiam, and average degree) are observed to remain stable over time. III. T HE G RAPTION F RAMEWORK

     169 1194 1074 752 600 724 459 964 1053 605 504 458 147 1339 352 838 1178 268

     995 116 973 1695 1471 697 172 281

     1327 617 381 1014 707 678 1251 1048 287 42 1055 1398 1591 1696 100 462 41 1004 1226 288 733 246 351

     1735 254 1702 844 18 503

     939

     1383 208 1152 555 171 1212 1218 1145 146 996 163 1060 1709 258 1033 339 1090 1778 1080 1728 1328 23 912 1951 1137 575

     670 1312 122 626

     344 942 827 1955 1325 1139 560 1531 1710 294 1768 682 1291 1201 535 296 53 1034 436 1005 440 662 1631 1399 322 1416 687 88 1868 951 839 461 1242 625 21 997 1219 1356 530 1537 234 79 257 542 142 905 541 1259 293 758 550 1536 683 1260 1548 99 1485 913 1473 335

     1211 1221 1298 1244

     732

     1725 16

     906 502 66 1374 1942 224 971 496 138 1456 1105 1252 182 295 431 1943 1292 1865 1031 903 579 1744 893 141 350

     419

     238 1401 331 702 875 1726 1379 124 510 150 1093 354 300 1759 75 1701 1656 54 1388 292 759 430 1191 1197 1066

     1425 591 1402 421 1106 151 1047 1554 1475 985 644 1267 332

     159 631 80 380 922 266 1183 87 1571 1779

     1403

     634 842 645 760 1310 183

     1847

     514 1692 456 225 237 1807 1430 663 1782 1572 15 689 1273 703 333

     218 1412 315

     802

     76 1352 265 337

     1094 796 511 336 1413 953 1757

     1289 307 526

     1784 871 228 688 1669 1311 338 980 797

     646 1410 761 822

     1910

     713

     975 1380 457 594

     184 1411 1756 314 334 925 424 629

     1545 1754

     1428 1288 1317

     217 595 298 525 801

     1711

     1934 1099 1255 328

     1755

     776

     1294 1009 1015

     297

     203 1589 1889 1515 712 1159

     1647

     1163 67

     1010

     794

     1046 200

     202

     120

     1907 348 367 1508 1544 1637 864 1164

     95 1871 68

     121

     1905

     1904 11

     1774 289 450

     1872

     788 1789

     775

     1282

     1209 1697

     96 681 321 52 543 1866 12 1355 290 449

     1729

     1397

     371 651 863 366 364 349 1660 1700 554 408 320 255 1134

     1478

     515

     373

     447

     365 533 721

     256

     1135

     603

     577

     1479

     516

     553

     374

     407

     51

     448

     699

     273

     620

     652

     1039

     604

     787

     327

     1305 1170 372 814 816 825 1486 726 769 781

     700

     722

     621

     653

     1019

     848

     815

     817

     826

     727

     770

     782

     1483

     1492

     1100

     1102

     1118

Report this document

For any questions or suggestions please email
cust-service@docsford.com