Apache Eagle - eBay open source distributed real-time Hadoop data security solutions

By Marcus Tucker,2015-11-03 13:54
13 views 0
Apache Eagle - eBay open source distributed real-time Hadoop data security solutions

    Apache Eagle - eBay open source distributed real-time

    Hadoop data security solutions

    A few days ago, eBay announced formally introducing distributed real-time safety monitoring to the open source industry led scheme --Apache Eagle, this project has officially joined the Apache called incubator project.Apache Eagle provides a set of efficient distributed streaming policy engine, high real-time, scalable, easy extension, friendly interaction characteristics of integrated machine learning at the same time set up Profile for the user behavior in order to realize real-time intelligent real-time protection Hadoop ecosystem cuhk security of data.


    With the development of big data, more and more successful enterprises or organizations begin to adopt data driven commercial operation mode.On eBay, we have tens of thousands of scientists, engineers, analysts, and data they visit to analyze several petabytes of data every day, think of our users unparalleled experience.In the global business, we also widely use to connect to the massive big data we hundreds of millions of users.

    In recent years, Hadoop has gradually become the most popular in the field of big data analysis solutions, eBay has also been using Hadoop value from the data mining technology, for example, we improve the user's search experience by big data, identification and optimization of precision advertising, enrich our product catalog, and by clicking on the flow analysis in order to understand how users use our online marketing platform, etc.

    At present, the total eBay Hadoop cluster node data more than 10000, storage capacity of more than 170 pb, more than 2000 active users.Now the scale is still in the growing, at the same time, in order to support diversified needs, we introduce the more and more variety of data storage and analysis program, such as the Hive, MapReduec, Spark and HBase, then bring the challenge of the management and monitoring of more and more serious, data security problem is also one of the most important one.

    Big data era, security problems become the key to an unprecedented, special eBay as the world's leading e-commerce company, we must guarantee the absolute security of user data in Hadoop.Usually our security measures according to the following points: access control, security isolation behavior, data classification, data encryption, and real-time data monitoring, however, after extensive trial and research, we realized that there is no existing product or solution can fully meet our faces massive real-time data flow and diversified cases scenario data behavior monitoring requirements.In order to overcome this gap, eBay has decided to build a Eagle from the very beginning.

    Eagle is open source distributed real-time Hadoop data security solutions, support data real-time monitoring, can immediately detect access to sensitive data or malicious operation, and take immediate steps to deal with it

    We believe that the Eagle will be one of the core components of the Hadoop data security field, so we decided to share it with the whole community.At present we have will be donated to the Apache software foundation Eagle as open source Apache incubator project, expected to collaborative development together with the open source community, resulting in the development of Eagle, meet the needs of more widely in the open source community together.

    Eagle data behavior monitoring scheme can be used in the following several kind of typical scenario:

    ; Monitoring of the Hadoop data traffic

    ; Detection of illegal invasion and the violation of safety rules

    ; Detect and avoid the loss of sensitive data and access

    ; Based on the strategy of real-time detection and early warning

    ; Implementation based on user behavior model of abnormal behavior detection data

    Eagle has the following features:

    ; High real-time: we fully understand the importance of safe monitoring in real-time and rapid

    response, so at the beginning of the design Eagle, we do everything possible to ensure that can

    produce the alarm level in the second time, once the comprehensive factors to truly dangerous

    operation, take immediate steps to prevent illegal behavior.

    ; Scalable: on eBay Eagle is deployed on multiple large Hadoop cluster, the cluster has hundreds of

    PB data, has more than 800 million data access time every day, so the Eagle must have the ability

    to deal with massive amounts of real-time data of highly scalable.

    ; Simple to use: availability is also one of the core design principle of Eagle.Through the Sandbox

    Eagle, users only need a few minutes can then set up the environment and start to try.In order to

    make the user experience as simple as possible, we built a lot of good example, simply click on the

    steps of the mouse, they can easily complete the strategy to create and add.

    ; User Profile: Eagle built-in provide based on machine learning algorithm of user behavior in the

    Hadoop to establish the function of the user Profile.We offer a variety of machine learning

    algorithm of the default for your choice for modeling, according to different HDFS feature set by

    historical behavior model, Eagle can real-time detect abnormal user behavior and early warning.

    ; Open source: Eagle has been according to the standard of open source development, and build on

    the many open source products in the field of big data, so we decided to Apache license open

    source Eagle, in giving back to the community, at the same time also looking for community

    feedback, collaboration and support.

    An overview of the Eagle

    Flow of Data access and Storage (Data Collection and Storage)

    Eagle provides highly extensible programming apis, can support any type of data integration to Eagle strategy execution engine.For example, in the Eagle HDFS Audit events (Audit) monitoring module, real time by Kafka received from the Namenode Log4j appenders or Logstash Agent to collect the data;The Eagle Hive monitoring module, through collecting YARN API is running Job Hive query log, and ensure the high scalability and fault tolerance.

    Real-time Data Processing (Data Processing)

    The streaming API (Stream Processing API).Eagle provides independent of the physical platform and highly abstract streaming API, the current default support Apache Storm, but also allows extended to any other stream processing engines, such as Flink or Samza, etc.In the layer of abstraction allows developers to define monitoring data processing logic, without any specific flow in physical execution layer binding processing platform, and only by reusing, splicing and assembly, such as data conversion, filtering, external data Join components, such as to achieve to meet the requirements of DAG (directed acyclic graph), at the same time, developers will be able to easily programming way to integrate business logic process and strategy of Eagle engine framework.Eagle framework inside will describe the business logic of the DAG compiled into the underlying stream processing architecture of native applications, such as Apache Storm Topology, etc., is engaged in the implementation platform independence.

    The following is an Eagle how to deal with the sample event and alarm: StormExecutionEnvironment env = ExecutionEnvironmentFactory.getStorm(config); // storm env StreamProducer producer = env.newSource(new

    KafkaSourcedSpoutProvider().getSpout(config)).renameOutputFields(1) // declare kafka source

     .flatMap(new AuditLogTransformer()) // transform event

     .groupBy(Arrays.asList(0)) // group by 1st field

     .flatMap(new UserProfileAggregatorExecutor()); // aggregate one-hour data by user

     .alertWithConsumer(“userActivity“,”userProfileExecutor“) // ML policy evaluate

    env.execute(); // execute stream processing and alert

    The alarm Framework (Alerting Framework).Eagle alarm framework provided by the metadata API, policy engine service API Partitioner API, policy and early warning to heavy framework, etc:

    ; The metadata API.Allows the user to declare the Schema of events, including event which attribute,

    the type of each attribute, and when a user allocation strategy how to dynamically at runtime

    analysis the value of the attribute, etc.

    ; Policy engine service API.Allows developers to easily in the form of a plug-in extension new

    policy engine.Features Siddhi CEP engines is the Eagle the default priority support policy engine,

    and machine learning algorithms can be used as a policy engine is carried out.

    ; Extensibility.Eagle policy engine service provider API allows you to insert a new policy engine

public interface PolicyEvaluatorServiceProvider {

     public String getPolicyType(); // literal string to identify one type of policy

     public Class getPolicyEvaluator(); // get policy evaluator implementation

     public List getBindingModules(); // policy text with json format to object mapping


    public interface PolicyEvaluator {

     public void evaluate(ValuesArray input) throws Exception; // evaluate input event

     public void onPolicyUpdate(AlertDefinitionAPIEntity newAlertDef); // invoked when policy is updated

     public void onPolicyDelete(); // invoked when policy is deleted


    ; Strategy Partitioner API.Allow strategy on different physical nodes in parallel.The strategy also

    allows you to custom Partitioner class.These features make strategies and execute completely in a

    distributed way.

    ; Scalability Eagle.By supporting strategy partition interface to realize the large strategy scalable run


    public interface PolicyPartitioner extends Serializable {

     int partition(int numTotalPartitions, String policyType, String policyId); // method to distribute policies


    1. A scalable Eagle policy enforcement framework

    Machine learning module:

    Eagle support in using Hadoop platform history according to the user behavior to define the behavior patterns or the ability of the user Profile.Has the function of this, do not need to be preset in the system under the condition of fixed threshold, also can realize intelligent to detect the abnormal behavior.Eagle in the user Profile is generated by machine learning algorithm, is used for the user the history of the current real-time behavior model and its corresponding model pattern recognition when they have a certain degree of difference whether user behavior is abnormal.At present, the Eagle built-in provide the following two kinds of algorithm to detect abnormal, respectively for the eigenvalue Decomposition (Eigen Value Decomposition) and Density Estimation (Density Estimation).These algorithms to read data from the HDFS audit log, data segmentation, to review, cross analysis, periodically, in turn, create a Profile for each user behavior model.Once the model generation, Eagle real-time flow policy engine could identify abnormal in near real time, distinguish the behavior of the current user suspicious or behavior model is not consistent with their history.

    Below a brief description of the current user Profile in the Eagle offline training model and online real-time monitoring the flow of data:

    1. The user Profile b offline training and abnormal monitoring framework

    Based on the user Profile of the abnormal Eagle online real-time monitoring is implemented according to the general policy framework of Eagle, the user Profile is defined as a strategy in Eagle system, user Profile strategy is inherited from the Eagle unified strategy implement interface of machine learning Evaluator to execute, the definition of the strategy include eigenvector of anomaly detection process to keep) (online and offline training.

    In addition, the Eagle provides automatic training scheduler, according to the time period of files or UI configuration and granularity to dispatch the offline training program, based on the Spark for batch to create user Profile and behavior model, the default with monthly frequency update for the training system model, model size for a minute.

    Eagle built-in machine learning algorithm the basic idea is as follows:

    Kernel Density Estimation algorithm (Density Estimation)

    The basic ideas of the algorithm is based on detection of the training sample data for each user to calculate the corresponding probability density distribution function.First of all, we each characteristic of the training data set average standardization, standardization can make all data set into the same scale.Then, in our random variable probability distribution estimation, we adopt the gaussian distributed function to calculate the probability density.Assume any features are independent of each other, each other then the gaussian probability density can be resolved through various characteristics of the probability density and calculated.Online real-time detection phase, we can compute the probability of each user behavior in real time in the first place.If the user has the possibility that the current behavior falls below a certain threshold, we table for abnormal warning, and the critical value completely passed by off-line training program called "Matthews Correlation Coefficient (Mathews, the Correlation Coefficient) of the method.

    1. C user behavior on a single dimension histogram

    Eigenvalue Decomposition algorithm (Eigen Value Decomposition) -

    In this algorithm, we think that the main purpose of the user Profile is generated in order to find valuable user behavior patterns.In order to achieve this, we can consider to combination of characteristics in turn, and then observe them is how to influence each other.When data set is very large, as we usually encounter scenario, because the number of normal mode is very much, so that the feature set of abnormal pattern is easy to be ignored.Due to the normal behavior pattern is usually at a very low dimensional subspace, so maybe we can reduce the dimension of the data set to a better understanding of the user's real behavior patterns.This method can also be for noise reduction on the training data set.Operations according to the characteristics of a large number of users data variance, usually in our use case scenarios of variance is 95% as the benchmark, we can get the variance is 95% the number of principal components is k, so we will first k principal component as the user's normal subspace, and the rest of the (n - k), a principal component subspace will be recognized as abnormal.

    When line real-time anomaly detection, if the user behavior patterns near normal subspace, argues that this behavior is normal, otherwise, if the user behavior patterns near the abnormal subspaces, will immediately report to the police, because we believe in user behavior usually should be located in normal space.As for how to calculate the current user

    behavior close to normal or abnormal subspace, we use the Euclidean distance method (Euclidian short method).

    1. D show important user behavior patterns composition

    Eagle service

    Strategy manager: Eagle strategy manager to provide interactive friendly user interface and the REST API for users to easily define and manage strategy, all just a few mouse clicks.Eagle user interface makes the strategy management, sensitive metadata identification and import, HDFS or browse the Hive's resources and warning instrument, and other functions are very easy to use.

    Eagle policy engine default support features Siddhi CEP engines and machine learning engine, here are a few strategies based on Siddi ceps samples.

    ; A single event execution strategy (user access to sensitive data column in the Hive) from hiveAccessLogStream[sensitivityType=='PHONE_NUMBER'] select * insert into


    ; Strategy based on window (the user access to the directory/TMP/private in 10 minutes extra 5


    hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min) select user, count(timestamp) as aggValue group by user having aggValue >= 5 insert into outputStream;

    Query Service (Query Service) : Eagle provide class SQL REST API is used to implement comprehensive calculation, Query for huge amounts of data collection and analysis ability, support such as filtering,

    aggregation, straight side operation, sorting, top, arithmetic expressions and paging.Eagle priority support HBase as its default data storage, but also supports JDBC relational database.Especially when choose to HBase as storage, Eagle is native with HBase's ability to store and query massive monitoring data, the Eagle will query framework provided by the user class SQL query syntax compilation into HBase primary object of the Filter, and supported by HBase Coprocessor further enhance the response speed.


    The use of the Eagle on eBay

    At present, the Eagle behavior of data monitoring system has been deployed to a Hadoop cluster nodes have more than 2500, to protect hundreds of PB data security, and is planning to the end of the year extended to other ten Hadoop cluster, which covers the eBay over 10000 sets of all main Hadoop node.In our production environment, we have for HDFS wish yo, Hive cluster data set up some basic security policy, and will continuously introduce more before the end of strategy, in order to ensure the absolute security of important data.At present, the Eagle strategy covers a variety of patterns, including a visit from frequent access patterns, data sets, a predefined query types, Hive tables and columns, HBase table, and based on machine learning model generation all policies related to a user Profile.At the same time, we also have a wide range of strategies to prevent data loss, data is copied to the unsafe place, sensitive data access by unauthorized area, etc.Eagle strategy definition, great flexibility and extensibility make our future can continue to expand more easily, more complex strategies to support more diversified use case scenarios.

    The follow-up plan

    In the past two years, in the eBay, in addition to be used for data behavior monitoring, Eagle core framework also is widely used in health monitoring node, Hadoop application performance indicators, Hadoop core services, as well as the whole Hadoop cluster's health, and many other fields.We also set up a series of automatic mechanisms, such as node, etc., to help our platform department greatly saves us manual Labour, and effectively improve the utilization rate of the whole cluster source.

    The following are some features: we are currently developing occurs,

    ; The Hive and HBase support extension machine learning model

    ; Highly extensible API, which sets the current widely used monitoring early warning platform or

    other tools, such as Ganglia and Nagios, at the same time support the sensitive data import, such as

    with Dataguise integration, etc.

    ; In addition, we are actively organize other Hadoop cluster monitoring module, expectations in the

    subsequent release of open source to the community, for example

    ; HBase monitoring

    ; Hadoop job performance monitoring

    ; Hadoop node monitoring

Report this document

For any questions or suggestions please email