Novice benefits Apache Spark entry strategy

By Diana Ellis,2015-12-13 14:55
52 views 0
Novice benefits Apache Spark entry strategy

    Novice benefits: Apache Spark entry strategy

    This paper is focused on the introduction to Apache Spark, understand its position in the field of big data, covering the Apache installation and application of Spark, and explain some of the common behavior and operation.

    A, why use Apache Spark

    Nowadays, we are in an era of "big data", every moment, there are various types of data are produced.Data in the ultraviolet, the speed of growth is also increased significantly.Look from broad sense, these data include transaction data, social media content (such as text, images and video), and the sensor data.So why put so much effort on these content, its reason is extract insights from huge amounts of data can be a good guide to the life and production practice.

    In a few years ago, only a few company has sufficient technical force and capital to a large amount of data storage and mining, and the mining in order to gain insights.By yahoo in 2009, however, the open source Apache Hadoop has a disruptive impact on the situation, through the use of commercial server cluster of substantially reducing the threshold of mass data processing.As a result, many industry (such as Health care, Infrastructure, Finance, Insurance, Telematics, Consumer, Retail, Marketing, e-commerce, Media, Manufacturing and Entertainment) begun to Hadoop, on the path of the huge amounts of data to extract value.Hadoop, it provides two major functions:

    ; By extending commercial host, HDFS provides a cheap way of fault tolerant storage for huge

    amounts of data.

    ; Calculation example graphs, provides a simple programming model to mining data and insights.

    The figure below shows the graphs of the data processing procedure, including a Map - Reduce the output will be the next step of typical Hadoop input result of the job.

    In the whole process, the intermediate results will be using disk transfer, so the contrast calculation, a large number of Map - Reduced operations are restricted to the IO.Yet for ETL, data integration and clean up the use case, IO constraint does not produce very big effect, because these scenarios for the data processing time tend not to have a higher demand.In the real world, however, there are also many cases of delay more demanding, such as:

    1. Convection data processing for near real-time analysis.For example, by analyzing the clickstream

    data do video recommended, so as to improve the participation of users.In this case, the developer

    must make a balance between accuracy and time delay.

    2. Interactive analysis on large data sets, a data scientist can do AD - hoc query in data set.

    No doubt, after several years of development, the Hadoop ecosystem has been loved by the user, the rich tools but there still exist many problems to use poses challenges:

    1. Each use case requires a number of different technology stack to support, under the different usage scenarios, a large number of solutions are often stretched.

    2. In a production environment institutions often require at gate technology.

    3. Many technical version compatibility problems.

    4. Not in the concurrent job sharing data more quickly.

    And by the Apache Spark, the above problem solved!Apache Spark is a lightweight memory cluster computing platform, through the different components to support batch, flow and interactive use cases, the following figure.

    Second, about the Apache Spark

    Apache Spark is an open source and compatible with the Hadoop cluster computing platform.Developed by AMPLabs at the university of California, Berkeley, as the Berkeley Data Analytics Stack (BDAS) part of the escort Databricks by big Data company, is the top class project that Apache, the following figure shows the Apache Spark a Stack of different components.

    Apache Spark five major advantages:

    1. Higher performance, because the data is loaded into the cluster host distributed memory.Data can be quickly iteration of the conversion, and the cache to subsequent frequent

    access requirements.A lot of friends interested in Spark may have heard of the phrase will be - in the case of all data is loaded into memory, the Spark can be 100 times faster than the Hadoop, in the case of memory to store all data Hadoop 10 times faster.

    2. By establishing in Java, Scala, Python, SQL (dealing with the interactive query) standard apis for use in all walks of life, but also contains a lot of the box machine learning repository.

    3. With the existing Hadoop v1 (SIMR ") and (2) x (YARN) ecological compatibility, so agency can be seamless migration.

    4. Easy to download and install.Convenient shell (REPL: Read - Eval - Print - Loop) the API can be interactive learning.

    5. With the help of a high-level architecture to improve productivity, thus can speak energy on calculation.

    At the same time, the Apache Spark by Scala, the code is very simple.

    Three, install Apache Spark

    The following table lists the key links and prerequisites:

Current Release @ 1.0.1


    Downloads Page

    JDK Version (Required) 1.6 or higher

    Scala Version (Required) 2.10 or higher

    Python (Optional) [2.6, 3.0)

    Simple Build Tool (Required)

    Development Version git clone git://

    Building Instructions


    Maven 3.0 or higher

    As shown in figure 6, Apache Spark deployment way including standalone, Hadoop V1 SIMR, 2 YARN/Mesos Hadoop.Apache Spark demand must have knowledge of Java and Scala or Python.Here, we will focus on standalone configuration, the installation and operation.

    1. Install the JDK 1.6 +, Scala 2.10 +, Python 2.6, 3 and SBT

    2. Download Apache Spark 1.0.1 Release

    3. Untar in the specified directory and Unzip the spark - 1.0.1. TGZ akuntamukkala@localhost~/Downloads$ pwd

    /Users/akuntamukkala/Downloads akuntamukkala@localhost~/Downloads$ tar -zxvf spark- 1.0.1.tgz -C /Users/akuntamukkala/spark

    4. Run SBT build Apache Spark

    akuntamukkala@localhost~/spark/spark-1.0.1$ pwd /Users/akuntamukkala/spark/spark-1.0.1 akuntamukkala@localhost~/spark/spark-1.0.1$ sbt/sbt assembly

    5. Release the Scala Apache Spark standalone REPL


    If it is a Python

/Users/akuntamukkala/spark/spark-1.0.1/bin/ pyspark

    6. Check the SparkUI @http://localhost:4040

    Four, Apache Spark working mode

    Spark engine provides all hosts in the cluster on distributed memory data processing ability, the following figure shows a typical Spark job handling process.

    The following figure shows the Apache Spark how to perform a job in the cluster.

    The Master control how the data is divided, use of local data, and track all distributed computing on the Slaves.In a Slave is unavailable, the stored data will be assigned to other

    Slaves available.Although the present (version 1.0.1) Master also has a single point of failure, but later will be repaired.

    Five, the elasticity of the Distributed data sets (Resilient Distributed Dataset, RDD)

    Elastic distributed data sets (RDD, starting from the Spark version 1.3 has been replaced by DataFrame) is the core of Apache Spark ideas.It is composed of data collection of immutable distributed, its main carry out two operations: the transformation and the action.Transformation is similar on RDD filter (), map () or the union () to generate another RDD operation, and the action is the count (), the first (), take (n), collect (such as) trigger a calculation and return value to the Master or stable storage system operation.Transformations are usually lazy, it was not until after the action will be executed.Spark Master/Driver will save on RDD Transformations.In this way, if you lose a RDD (i.e., salves crash), it can quickly and easily convert to survive for hosts to the cluster.This is the elasticity of RDD.

    The figure below shows the Transformation of the lazy:

    We can understand this concept by the following example: from the text found that five of the most commonly used word.The following figure shows a possible solution.

    In the above command, we read the text and the establishment of string RDD.Each entry represents 1 row in the text.

scala> val hamlet = sc.textFile(“/Users/akuntamukkala/temp/gutenburg.txt”)

    hamlet: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12 scala> val topWordCount = hamlet.flatMap(str=>str.split(“ “)).

    filter(!_.isEmpty).map(word=>(word,1)).reduceByKey(_+_).map{case (word, count) => (count, word)}.sortByKey(false)

    topWordCount: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[10] at sortByKey at :14

    1. Through the above command we can find that this operation is very simple - through simple Scala API to connect the transformations and the actions.

    2. There may be some words can be separated by more than one space, lead to some of the words is an empty string, so you need to use the filter (! _. IsEmpty) will they are filtered out.

    3. Each word is mapped to a key value to: map (word = > (word, 1)).

    4. To aggregate all count, here you need to call a reduce step - reduceByKey + _ (_)._ + _ can very conveniently for each key assignment.

    5. We got the words and their respective counts, the next need to do is ordered according to counts.The Apache Spark, users can only according to the key sequence, not a value.Here, therefore, need to use the map {case (word count) = > (word count)} to circulate the (word count) to (word count).

    6. The most commonly used to evaluate the five words, so you need to use sortByKey (false) do a count of decreasing order.

    The above command includes a. Take (5) (an action or operation, which triggers computation) and in/Users/akuntamukkala/temp/gutenburg. TXT text output in 10 of the most commonly used words.Users can realize the same function in the Python shell.

    RDD lineage can toDebugString (a memorable operation) to track.

    scala> topWordCount.take(5).foreach(x=>println(x))






    The commonly used Transformations:

    Transformation & Purpose Example & Result

    filter(func) Purpose: new RDD by selecting thosescala> val rdd = sc.parallelize(List(“ABC”,”BC

     data elements on which func returns true D”,”DEF”)) scala> val filtered = rdd.filter(_.contai

    ns(“C”)) scala> filtered.collect() Result:

    Array[String] = Array(ABC, BCD)

    map(func) Purpose: return new RDD by applyingscala> val rdd=sc.parallelize(List(1,2,3,4,5)) scala>

     func on each data element val times2 =*2) scala> times2.collect

    () Result:

    Array[Int] = Array(2, 4, 6, 8, 10)

    flatMap(func) Purpose: Similar to map but func rscala> val rdd=sc.parallelize(List(“Spark is awesoeturns a Seq instead of a value. For example, mme”,”It is fun”)) scala> val fm=rdd.flatMap(str=>sapping a sentence into a Seq of words tr.split(“ “)) scala> fm.collect() Result:

    Array[String] = Array(Spark, is, awesome, It, is,


    reduceByKey(func,[numTasks]) Purpose: To aggrescala> val>(word,1)) scala>gate values of a key using a function. “numTask val wrdCnt=word1.reduceByKey(_+_) scala> wrds” is an optional parameter to specify number ofCnt.collect() Result:

     reduce tasks Array[(String, Int)] = Array((is,2), (It,1), (awesom

    e,1), (Spark,1), (fun,1))

    groupByKey([numTasks]) Purpose: To convert scala> val cntWrd ={case (word, cou(K,V) to (K,Iterable) nt) => (count, word)} scala> cntWrd.groupByKey

    ().collect() Result:

    Array[(Int, Iterable[String])] = Array((1,ArrayBuffe

    r(It, awesome, Spark, fun)), (2,ArrayBuffer(is)))

    distinct([numTasks]) Purpose: Eliminate duplicatesscala> fm.distinct().collect() Result:

     from RDD Array[String] = Array(is, It, awesome, Spark, fu


    Common set of operations:

    Transformation and Purpose Example and Result

    union() Scala> val rdd1=sc.parallelize(List(„A‟,‟B‟))

    Purpose: new RDD containing all elements from scala> val rdd2=sc.parallelize(List(„B‟,‟C‟))

    source RDD and argument. scala> rdd1.union(rdd2).collect()


Report this document

For any questions or suggestions please email