Spark ecosystem and open source distributed services
based on Redis Codis
January 24, a distributed system based on the Spark and Redis practice sharing by Spark veteran preacher and peas senior system architect Chen liu qi to create.
Chen: the Spark Ecosystem & Internals
Chen (@ CrazyJvm), Spark preachers
In sharing, Chen first briefly introduces the Spark community development: in 2014, the current release of the Spark is 1.2, the Spark, a total of 2014 released three main version - 1.0, 1.1, 1.2.Subsequently, Chen to Spark ecosystem are analyzed in detail:
Spark;What & Why；
The Spark is a very fast and very strong universality of a large data processing engine.When it comes to the Spark, the first is some common features: fast speed, easy to use, general and compatible with Hadoop.First general, Spark can support batch processing, flow calculation, calculation, machine learning, and many other scenarios;Second, good compatibility with Hadoop.Given that most companies still choose HDFS to save the data, the design of the Spark and HDFS has a very good compatibility, if the data is stored in HDFS, then don't do any data migration work can directly use the Spark.
Spark vs. Hadoop
Why to choose the Spark, pictured above, Chen from iterative calculation with HDFS data of multidimensional query two aspects with Hadoop are compared:
; Iterative calculation.Under this scenario, the Hadoop HDFS need many times, speaking, reading
and writing (disk), responsible for a large number of IO and serialization and deserialization
overhead.In addition, every time write HDFS need to write three, thus caused the backup spending.
; HDFS multidimensional query with batch data.On HDFS hundreds or thousands of the same data
dimension query, Hadoop every time to do a separate query, namely every time the data read from
the disk.Because every time the same set of data read from the disk, apparently can continue to
In these two scenarios, the Spark can use memory cache/common data, thus to avoid disk IO overhead at the same time, also will greatly improve the performance.
Why Spark is so Fast；
Before the Spark has been is famous for its rapid, so in addition to the memory, what characteristics when the Spark can so fast?Here, Chen mentioned the DAG (directed acyclic
graph, described in detail below), Thread Model (threading Model) and Optimization (such as delay scheduling) three aspects.
The Thread Model.Hadoop model based on process, every time I start a task needs to be a new start a child JVM calculation, may also exist the JVM Reuse, even here to avoid the problems that exist in the JVM Reuse, each JVM startup has caused high overhead.And Spark in the application startup on the thread pool, so the task of the startup cost is very small.
Optimization - the delay schedule.When the task to a host, the host just computing resources (CPU, memory, etc.) has been run out, at this time, the Spark will delayed scheduling mechanism, to wait for a moment, rather than to the station on a host need to compute data transmission through the network to another host.Using this mechanism, when calculating the data volume is very large, there is a big advantage.Known as "follow the data for the computer followed rather than data calculation".
The Spark parsing
Berkeley protocol stack in data analysis
Including: the resource management framework, Apache YARN, Apache Mesos;A distributed file system based on memory, Tachyon;Followed by the Spark, more is to realize the above all sorts of function of the system, such as machine learning MLlib library, figure to calculate GraphX, flow calculation Spark Streaming.For example: on top SparkR, analysts' favorites;BlinkDB, we can force it to us within a few seconds the query results.
It is this ecosystem, make the Spark can achieve "one stack to rule them all", it can complete the batch can also be engaged in the flow calculation, so as to avoid the two logic code to achieve.And the theoretical basis of the Spark is a RDD:
The core concept of RDD
RDD can imagine as individual partitions, step back can also be interpreted as a very large List (1, 2,... 9), USES three Partion save this List three elements respectively, and each partition (or split) will have a function to calculate.At the same time, between the RDD can depend on each other.Then, for the Key - value RDD specified partitioner, RDD in each split also has its own preferred location.
Finally a preferred locations, the concept of many distributed system exists in the present, the followed is the calculation data.Under normal circumstances, transfer calculation time is far less than time of transferring data.For Hadoop, because the data in the disk, disk local usually reached its peak, and for the Spark, because the data can be kept in memory, and memory local to have the highest priority.
Above stated the Spark operation principle: rdd1 transformation, rdd2, rdd3 and so on until another RDD.It is important to note that there is a delay of execution, namely conversion is not immediately executed.Spark will only record the process in the metadata, but not the real execution, this should note that only if the encounter will truly have to perform the action.This time it is important to note that such as above RDD2 cache, this operation is also lazy, in the same encounter will perform the action.Right here, the pit appeared, even if the persist and cache is used the same interface, but unpersist is eager.Starting from version 1.1, the cache actually have more safe, but involves the kernel too much detail, here is not do more explanation.
The dependence of the RDD
Narrow the dependency and wide the dependency is the other two important concepts in the Spark.Compared to the latter, narrow the dependency in both from the fault tolerance, or on the execution efficiency advantage.
ClusterManager: at present, in the domestic adoption more obviously YARN.
Sparkcontext, write code generation, and ask ClusterManager for
resources.ClusterManager will be responsible for the connection to the Worker Node resources, the executor is the real executor of the task.There are three points to note: first, ClusterManager is pluggable, can be arbitrarily chosen;The second point, because the driver program needs to send tasks to the Worker Node, therefore submit any place not to from the Worker Node and far.The third point is important, each application in each Worker on the Node will be independent of the executor, and different applications of executor (between) you can't share data.
PS: YARN by the Container to encapsulate resources, therefore the Worker in the YARN is corresponding to the Container.
At first, the Spark program would implicitly create a logical directed acyclic graph (DAG), will then DAGScheduler DAG into each stage, the stage will be transmitted to the TaskSchedluer, then then send it to the Worker excutor execution.Including excutor can be carried in multithreaded mode.
Theoretically, the Spark Shuffle never exceeded graphs, until change after OK.At present, Shuffle is using a model based on the PULL, intermediate files will be written to disk, at the same time, build the hash map in each partition.It is important to note that in across the keys of spill at the same time, the host memory must be loaded into a single key - value.
On the monitor, previous versions, only when a task at the end of the can collect running data of the task, it has been improved in the current version.
Spark Streaming: Spark Streaming is still essentially batch, but before the big batch split in small batch.At the same time, the Spark Streaming has support for current limit, when the flow is very big, can Spark.In addition, it also can support real-time machine learning.In the Spark Streaming, loss of data for two kinds of general conditions, the worker failure and driver failure.In previous versions, there might be a small loss of data, and after the 1.2
release, reliable receiver mode ensures that all data is not lost, this connection in Kafka is very applicable.
MLlib: the present algorithm has a very rich, including classification, clustering, regression, collaborative filtering, dimension reduction and so on.ML Pipeline can significantly reduce development time, it can help developers through data collection, data cleaning, feature extraction, model training, testing, evaluation, the whole process online.
Graphx: here is the advantage of Spark can handle table view, also can deal with diagram view.
Spark SQL: the Spark of the most popular component of ecosystem, the purpose is very simple, to support the SQL standard.Comparing the Spark SQL, because the process model based on graphs, there are many has not been restored in the Hive multi-thread bug.It is worth mentioning that the Spark SQL contributors, half are Chinese.
Tachyon can support almost all of the framework
Tachyon: distributed memory system, make different Job or framework to share data, thereby circumventing HDFS, perform at a faster speed.At the same time, it also can avoid the task failure data recalculation.Finally, the Tachyon can let the system to avoid multiple GC.
Call the Spark SparkR: let the R language.Principle is to Spark the Context through JNI calls the Java Spark the Context, then through the Worker Excutor calls on the R shell to execute.Problems now is that every time the task execution time all need to start the R shell, so you also need to be optimized.
BlinkDB, a wayward database
BlinkDB: very wayward a database that allows the operator with time bounds or error bounds to check.Principle is to maintain a set of multidimensional samples on the original data, samples, of course, it also need a dynamic selection strategy.
JobServer: provides a RESTful interface to submit and manage the Apache Spark job, jars and job contexts, namely the Spark as a Service.