The Flume structure and source code analysis
In learning the Flume source recently, so I want to write a Flume source study notes for learning need friends reference.
1, the Flume is introduced
The Flume is cloudera company open source of a distributed, a large number of log data collection, aggregation, and reliably and moved to the storage;Through the transaction mechanism provides reliable message transmission support, built-in load balancing mechanism to support scale;For the use of direct and provides some default components.
The Flume is common application scenarios: log - > the Flume - > real-time calculation (such as Kafka + Storm), log - > the Flume - > offline calculation (such as the HDFS, HBase), log - > the Flume - > ElasticSearch.
2, overall architecture,
The Flume is mainly divided into three components: the Source, Channel, Sink;The data flow as shown in the figure below:
1, the Source is responsible for the log into, such as from the file, network, such as Kafka data into the data, there are two kinds of data into the way of training in rotation pull and event-driven;
2, the Channel is responsible for data aggregation/staging, such as temporary to memory, local files, database, Kafka, etc., the log data is not in the pipeline for a long time, will soon be Sink consumed;
3, Sink is responsible for the data transfer to the storage, such as the log from the Channel after storage directly to the HDFS, HBase, Kafka, ElasticSearch, etc., and then like Hadoop, Storm, or query data analysis such as ElasticSearch.
An Agent will be at the same time, there are the three components, the Source and Sink are executed asynchronously, do not affect each other.
Suppose that we have collected and index Nginx access log, we can deploy as follows:
1, Agent and Web Server is deployed on the same machine;
2, the Source using ExecSource and use the tail command log;
3, the Channel using MemoryChannel because log data lost points also nothing major problems;
4, Sink using ElasticSearchSink written to ElasticSearch, here you can configure multiple ElasticSearch server IP: PORT list to enhance processing capacity.
Introduces how log flow above, for complex log collection, we need to Source log filtering, wrote more than one Channel, to defeat the Sink/load balance processing, such as the Flume are provided by default support:
1, the log will be introduced to the Source collection ChannelProcessor component, the first through the Interceptor log filter, if exposed to the Servlet the concept is similar, you can refer to theServlet3.1 translation - filters";Filter can filter out logs, can also modify the log content;
2, filtering, next to the ChannelSelector processing, after the completion of the default provides two selector: copy or multiplex selector;Copy or copy a log to the multiple Channel;And multiplexing can according to the configuration of the selector conditions, the eligible routed to the appropriate Channel;There may be exist when writing multiple Channel failed, to deal with the failure of two kinds: try again later or ignored.Retry generally adopts the exponential time to try again.
We give Channel, said the Source production log before Sink from log Channel consumption;Both of it is asynchronous, so the Sink you just need to monitor their relationship Channel change.
Here we can to filter/modify Source journal, make a copy of a message/routing to multiple Channel, in the case of Sink should exist to write fail, Flume default provides the following strategies:
The default policy is a Sink, failed, the transaction fails, try again later.
The Flume also provides failover strategies:
Failover strategy is to define the priorities for a multiple Sink, assuming that one failed, the routing to the next priority Sink;Sink just throw an exception is seen as a failure, is removed from the live Sink, exponential time then waits for a retry, default is waiting for thestart of the 1 s retry, maximum wait for retry time is 30 s.
The Flume also provides load balancing strategy:
The default load balancing algorithm provides two: training in rotation and random.Through the abstract of a sort of ChannelSelector SinkSelector selection, compensation mechanism and algorithm of Failover similar failure, but failure is off by default compensation, need to configure the backoff parameter to true.
Involved in the Flume were introduced up some core components, how the Source and Sink asynchronous, transaction mechanism provided by the Channel, etc. We speak when we follow-up analysis components.
Suppose that we need to collect a lot of client log and they are some buffer or centralized processing, can deploy a polymer layer, overall architecture is similar to the following:
1, the first is the log collection layer, the layer of the Agent and the application deployed on the same machine, is responsible for collecting such as Nginx access log;Then through RPC will log into collecting/polymer layer;In this layer should be fast acquisition to the log and then into the collection/polymer layer;
2, collect/polymer layer logging collecting or aggregation, and can undertake fault-tolerant processing, such as failover or load balancing, to enhance the reliability;Also can be in the Channel, the layer to open file do data buffer;
3, collect/polymer layer to filter or modify the data and then for storage or processing;Such as storage to the HDFS, or into the Kafka and then through the Storm of real-time data processing.
To this from the core components of the Flume to general deployment architecture we overview, and involves some implementation details in the next section are introduced in details.