New visualization to help better understand the Spark
Before, we showed in Spark1.4.0New visual function;Release 1.4: the Spark SparkR,
exhibited on the tungsten plan"[Chinese]), in order to better understand the Spark the behavior of the application.Then this topic, this blog will highlight for understanding the Spark Streaming applications and introducing new visual function.We've updated the Spark in the UI Streaming TAB to display the following information:
; The timeline view and event rate statistics, schedule delay statistics as well as the previous batch
; Each batch of all details of the JOB
In addition, in order to understand the context of job execution, Streaming operation directed acyclic execution graph visualization (execution DAG visualization Increased Streaming information).
Let's pass a Streaming application example analysis in detail from the beginning to the end have a look at these new functions.
The timeline and histogram of dealing with the trend
When we debugging a Spark Streaming applications, we want to see more data is be received at a rate of about what and how much is the processing time of each batch.Streaming TAB in the new UI allows you to easily see the current value and the trend of 1000 batches before.When you run A Streaming application, if you go to visit the Spark in the UI Streaming TAB, you will see something similar to figure 1 below (red letters, for example, [A], is our comments, and is not A part of the UI).
Figure 1: the Spark in the UI Streaming TAB
The first line (marked [A]) shows the current state of the Streaming applications.In this example, the application has run at 1 second batch interval for nearly 40 minutes;Below it is Input rate (Input rate) of the timeline (marked as [B]), shows the Streaming applications from the source of all its receive data at a speed of about 49 events per second.In this case, the time axis shows the location in the middle (labeled [C]) has obvious drop, at an average rate in the timeline where the end of the application again.If you want to get more detailed information, you can click on the Input Rate (close to [B]) beside the drop-down list to show the timeline of each source, as shown in figure 2 below:
Figure 2 shows the application has two sources of (SocketReceiver - 0 and
SocketReceiver - 1), one of the leading to a fall in the receiving rate, because it is in the process of receiving data stopped for a period of time.
And down the page (marked as [D] in figure 1), the Processing Time, Processing Time), according to the timeline about the batch is processed in an average 20 milliseconds to complete, and batch interval (in this case is 1 s) less than the cost of Processing Time, means scheduling latency (is defined as: a batch before waiting for batch Processing is completed, is marked as [E]) is almost zero, because the batch is created has been dealt with.The schedule delay is Streaming quote program is the key to stable, UI new functions make it easier to monitor.
Refer to figure 1 again, you may wonder, why some of the batch to the right take longer to complete (note that the [F] in figure 1).You can through the analysis of the UI easily.First of all, you can click on the timeline view batch point with longer time, it will be at the bottom of the page to create a list of detailed information about the complete batch.
It will show the batch of all major information (with green highlighted in the above 3).As you can see, this batch than other batches have longer processing time.Another obvious question is: which is it spark job caused the batch processing time is too long.You can click the Batch Time (blue links) in the first column, it will take you to see the detailed information of the corresponding Batch show you output operations and their spark job, as shown in figure 4.
Figure 4 shows an output operation, it creates three spark job.You can click on the link to continue the job ID into the stages and tasks to do further analysis.
Streaming RDDs directed acyclic execution graph
Once you start, this article analyzes the stages and batch job tasks, a more thorough understanding of execution graph will be very useful.As previous post said, Spark1.4.0 joined the directed acyclic perform the visual execution (DAG) figure (DAG directed acyclic graph), it shows the dependence of RDD chain and how to handle RDD and a series of stages.If in a Streaming application, these RDD was produced by DStreams, visualization will show additional Streaming semantics.Let's Streaming from a simple word count (word count) program starts, we will statistics every batch receives the number of words.The sample programNetworkWordCount .It USES the DStream flatMap operation, map and
reduceByKey to calculate words.As a Spark in a batch job directed acyclic execution graph will be shown in the figure 5 below.
Represents the visual representation of the black spots in the batch 16:06:50 produced by DStream RDD.Shades of blue square is used to convert RDD DStream operation, the pink box represents the conversion operation stage.In figure 5 shows the following information:
; Data is the time in a batch 16:06:50 through a text stream socket (socket text stream) received.
; Job in the two stages and flatMap, map, reduceByKey conversion operations to calculate the
number of words in the data
Although this is a simple chart, it can add more input stream and similar to the window and updateStateByKey operation senior DStream conversion and become more complicated.For example, if we pass a moving window containing three batches to calculate words (that is, the use of reduceByKeyAndWindow), its data from two sockets text flow, then a batch job of directed acyclic graph of implementation will be as shown in the figure 6 below.
Figure 6 shows in a across three batches of statistical word Spark job many related information:
; The first three stages are actually their statistics window 3 batches of words.It's a bit like the above
example NetworkWordCount first stage, using the map and flatmap operation.But should pay
attention to the following differences:
; There are two input RDD, respectively from two sockets text flow, combining the union the two
RDD into a RDD, then further conversion, produce the middle of each batch of statistical results.
; Two stage have turned gray, because two older batches of the intermediate results are cached in
memory, so there is no need to calculate again, only the latest batch need to be calculated from
; The last one on the right side of the stage using reduceByKeyAndWindow to joint each batch of
statistics count eventually formed a "window" count.
These visualization makes developers not only can monitor the status and trend of Streaming applications, and can understand them and the underlying spark job and the relationship between the execution plan.
The future direction
Spark1.5.0 an important ascension is about the much anticipated in each batch (JIRA , PR For more information) in the input data.Kafka, for example, if you are using the batch details page will display the batch processing of switchable viewer, partitions and offsets, preview the diagram below: