Introduction to Apache Spark big data analysis (a)
Apache Spark appeared to let ordinary people also have the big data and real-time data analysis ability.In view of this, this article through to operating presentation introduction to lead everyone to quickly learn the Spark.This article is the Apache Spark introductory tutorial (four parts) in the first part.
Full text includes four parts:
; The first part: introduction to Spark, introduce how to use the Shell and RDDs
; The second part: introduce the Spark SQL, Dataframes and how to combine the Spark and
Cassandra used together
; The third part: introduce the Spark MLlib and Spark Streaming
; The fourth part: introduce the Spark Graphx figure is calculated
This explanation is the first part
About all the part and the outline, please visit our websiteApache Spark QuickStart for real-time data-analyticsFor a visit.
On the website, you can find more articles and tutorials in this regard, such as:Java Reactive Microservice Training？Microservices Architecture | Consul Service Discovery and
Health For Microservices Architecture Tutorial.There are many more other content, interested can go to look at it.
Summary of the Spark
Apache Spark is a rapidly growing open source cluster computing system, is rapidly growing.Apache Spark package and framework of the ecological system increasingly rich, enables the Spark to advanced data analysis.Apache Spark's rapid success thanks to its powerful function and easy to use.Compared to traditional graphs of the data analysis, the Spark efficiency higher, run faster.Apache Spark provides an in-memory distributed computing ability, with Java, Scala, Python, R four API programming language programming interface.The Spark ecological system as shown in the figure below:
The whole ecological system builds on the Spark kernel engine, the kernel makes Spark with fast memory capacity, also makes its API support Java and Scala, four kinds of programming languages, Python, R.Streaming real-time Streaming data processing ability.Spark SQL allows users to use their best language query structured data, DataFrame located at the core of Spark SQL DataFrame to save the data for the collection, the columns of the corresponding elements are named, by using the DataFrame, can very conveniently inquiries, drawing and filter the data.MLlib in Spark machine learning framework.Graphx for computing framework, to provide structured data computing power.This is an overview of the entire ecosystem.
The development history of Apache Spark
; Initially by the university of California Berkeley (UC Berkeley) AMP lab laboratory development
and open source in 2010, has now become the Apache Software Foundation (Apache Software
Foundation) the top level of the project.
; Checking code, is 12500 times the submission from 630 source contributor (see Apache Spark
; Most of the code to useScalaLanguage.
; Apache Spark interest in the Google search volume () the recent spurt of growth, this suggests that
the visibility of the high (Google advertising tool shows: only search a whopping 108000 July, is
ten times more than the number of Microservices search)
; Part of Spark source contributor (distributors) from IBM, Oracle, DataStax, BlueData, Cloudera...
; Building on the Spark applications include: Qlik, Talen, Tresata, atscale, platfora...
; Use of Spark company are:VerizonVerizon、NBC、Yahoo、Spotify……
You so interested in Apache Spark of reason is that it makes the common development of Hadoop data processing ability.Compared with the Hadoop, Spark cluster configuration than the Hadoop cluster configuration is simpler and faster and easier to program.The Spark that most developers have big data and real-time data analysis ability.In view of this, in view of this, this article through the introduction to begin operating the demo led to quickly learning Apache Spark.
Download the Spark and river demonstrate how to use an interactive Shell command line
Experiment Apache Spark is the best way to use an interactive Shell command line, the Spark there are two interactive Python Shell and Scala the Shell command line.
Available from theHere,Download Apache Spark, download when choosing recently precompiled version to be able to run shell immediately.
At present the latest Apache Spark version is 1.5.0 release time is on September 9, 2015. tar -xvzf ~/spark-1.5.0-bin-hadoop2.4.tgz
Run Python Shell
In this section will not use the Python Shell for demonstration.
Scala interactive command line due to run on the JVM, the ability to use Java library.
Run Scala Shell
After the command line, you can see the following output: Scala Shell welcome message
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.0
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25) Type in expressions to have them evaluated.
Type :help for more information.
15/08/24 21:58:29 INFO SparkContext: Running Spark version 1.5.0
Here are some simple exercises to help use the shell.Maybe you can't understand what
we do is now, but we will have a detailed analysis in the rear.In Scala Shell, perform the
Use the README file to create textFileRDD in Spark val textFile = sc.textFile("README.md")
The first element of obtaining textFile RDD
res3: String = # Apache Spark
To textFile RDD in data filtering operation, return all contain "Spark" keyword line, after the completion of the operation will return a new RDD, after the completion of the operation to return the RDD line to count
Selected including Spark keyword RDD line count then
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
res10: Long = 19
To identify RDD linesWithSpark words appear most line, can use the following operation.Using the map method, all the various mapping in the RDD into a number, and then use the reduce and find out the line contains the maximum number of words.
Find the RDD textFile contained in the words of the maximum number of lines textFile.map(line => line.split(" ").size)
.reduce((a, b) => if (a > b) a else b)
res11: Int = 14
Returns the result showed that the maximum number of line 14 words.
Can also introduce other Java package, such as Math. The Max () method, because the map and reduce method accepts the scala function literal as an argument.
Introduce a Java method in scala shell
textFile.map(line => line.split(" ").size)
.reduce((a, b) => Math.max(a, b))
res12: Int = 14
We can easily be data cached in memory.
The RDD linesWithSpark cache, then the line count
res13: linesWithSpark.type =
MapPartitionsRDD at filter at
res15: Long = 19
The above briefly show you how to use the Spark of interactive command line.
Elastic distributed data sets (RDDs)
Spark in a cluster can be executed in parallel, the parallel degree is decided by one of the main components of the Spark - RDD.Elastic distributed data sets (Resilient distributed data, RDD) is a data representation, data are stored in a cluster partition of RDD (fragmentation data storage way), it is because of data partition storage tasks can be executed in parallel.The higher the number of partitions, the more parallel.Picture below shows the RDD said:
Imagine for a partition for each column (partition), you can very easily will be assigned to each node in cluster partition data.
To create RDD, can read data from external Storage, for example from Cassandra, Amazon Simple Storage Service (Amazon Simple Storage Service), HDFS or other Hadoop support reads input data format.Also can be read through the file, array, or JSON format of the data to create the RDD.If, on the other hand, for the application, the data are localized, you only need to use parallelize method can Spark the characteristics on the corresponding data, and through the Apache Spark cluster for parallel data analysis.To test this, we use Scala Spark Shell demonstrate:
Through the list of words collection to create RDD thingsRDD
val thingsRDD = sc.parallelize(List("spoon", "fork", "plate", "cup", "bottle"))
thingsRDD: org.apache.spark.rdd.RDD[String] =
ParallelCollectionRDD at parallelize at
The number of single calculation RDD thingsRDD
res16: Long = 5
Run the Spark, you need to create the Spark of the Context.Use Spark Shell interactive command line, the Spark the Context will be automatically created.When the Spark parallelize the Context object method, we get a partitioned RDD, these data will be distributed to all nodes of the cluster.
Using RDD we can do?
To RDD, either for data transformation, also can undertake the action for operation.This means that the use of the transformation can change the data format, data query and data filtering operation, use action operation, can trigger the change of the data, extract the data, collecting data, and even to count.
For example, we can use the Spark in the README text file. The md create a RDD textFile, file contains several lines of text, the text file into RDD textFile, text lines of data will be partition to distribute to the cluster and parallel operation.
According to the README. Md file creation RDD textFile
val textFile = sc.textFile("README.md")
res17: Long = 98
The README. Md 98 rows of data file.
The results shown in the figure below:
Then, we all can be contained Spark filtered keyword, after finishing the operation will generate a new RDDlinesWithSpark:
Create a filtered RDD linesWithSpark
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
In the previous picture, we presents a textFile RDD said, the graph below RDD linesWithSpark said:
It is important to note that there is the Spark of key-value pairs RDD (Pair RDD), the data format of RDD data for key/value Pair (key/value paired data).For example the data in the table below, it says fruits and color corresponding relation:
Use of the data in the table groupByKey () conversion operations will get the following results:
GroupByKey () conversion operations
Apple [Red, Green]
The conversion operation will be key for Apple, the data values for the Red and Green groups.These are the conversion operation examples given so far.
When get a RDD after filtering operation, can collect/materialize the corresponding data and make it to the application, this is an example of an action operation.After the operation, all data will disappear in the RDD, but we still can undertake some operations on the data of RDD, because they are still in the memory.
The data Collect or materializelinesWithSpark RDD
It is worth mentioning for each Spark action operation, such as the count () action operation, the Spark will restart all conversion operations, computing will run to the final conversion operations, then count the operation return calculation results and the operation speed will be slower.To help solve the problem and improve the running speed, can be RDD data cached in memory, in this way, when you repeat the action, can avoid to start from scratch every time calculation, directly from the cache to the memory of RDD corresponding results are obtained.
The cache RDDlinesWithSpark
If you want the RDD linesWithSpark cleared from the cache, you can use unpersist () method.
Will linesWithSpark deleted from the memory
If not removed manually, in the case of memory space nervous, Spark will adopt the most recently for a long time did not use (further recently informs the logic, LRU) scheduling algorithm to delete cached in memory of RDD.
The following summarize the Spark from start to the running process of results:
; To create a certain data type RDD
; Data of RDD conversion operations, such as filtering operation
; In case of need to reuse for the converted or filtered RDD caching
; Action on RDD for operation, such as extraction, counting, storing data to Cassandra, etc.
RDD is part of the conversion operation is given below:
; groupbykey() ; sortbykey()
; combineByKey() ; subtractByKey() ; mapValues()
Here is part of the action of RDD operation list:
; countbykey() ; saveAsTextFile() ; reduce()
; countBykey() ; collectAsMap() ; lookup(key)
On RDD all operating listing and description, you can refer toSpark documentation