Tachyon: Spark distributed memory file system of the
Tachyon is Spark within the ecosystem's rapid emergence of a new project.In essence, Tachyon is a distributed file system, memory it in at the same time, reduce the pressure for the Spark memory also gives the Spark memory quickly a large number of data reading and writing ability.Tachyon to separate the function of memory storage from the Spark, the Spark can concentrate on computing itself, in order to achieve higher execution efficiency through finer division of labor.This article will introduce to the readers first Tachyon in the use of Spark ecosystem, will share the baidu on big data platform with application of Tachyon performance improvement of use cases, and in the actual use of Tachyon some of the problems and solutions in the process.Finally, we will introduce some new features of Tachyon.
Introduction of Tachyon
Spark platform for distributed memory computing model to achieve higher computing performance, in a recent caused the extensive concern of the industry, the open source community is also very active.In baidu, for example, in baidu internal computing platform has been set up and run one thousand Spark computing cluster, baidu also through its BMR open cloud platform provides the Spark computing platform service.Distributed memory computing model, however, is both a blessing and a curse, in improve performance at the same time have to face the problem of distributed data storage, specific problems mainly include the following:
1. When two Spark assignments need to share data, must be by writing a disk operation.Such as:
assignment 1 to the data generated in HDFS first, and then homework 2 read data from the
HDFS.At this point, may cause disk read and write performance bottlenecks.
2. Due to the Spark will use its own JVM to cache data, when the collapse of the Spark program, the
JVM process exits, cached data has been lost, so when the job is restarted and need from the HDFS
data read out again.
3. When two Spark homework must be the same operating data, every homework JVM needs a data
cache, not only cause the waste of resources, also easy to cause frequent garbage collection, which
reduces the performance.
After careful analysis of these problems, the root of the problem can be confirmed from the data store, because computing platform to try to store management, so that the Spark can't concentrate on computing itself, which reduces the overall execution efficiency.Tachyon is put forward to solve these problems: in essence, Tachyon is a distributed memory file system, it relieves the Spark memory pressure at the same time gives the Spark memory quickly a large number of data reading and writing ability.Tachyon the storage and the function of data read and write from the Spark separation, make the Spark more focus in the calculation of itself, in order to achieve higher execution efficiency through finer division of labor.
Figure 1: the Tachyon deployment
Figure 1 shows the Tachyon deployment structure.Tachyon be deployed in computing platform (under the Spark, MR) and storage platform (HDFS, S3), isolated by global computing platform and storage platform, Tachyon can effectively solve some questions listed above, :
; When two Spark assignments need to share data, no need to by writing a disk, but with the aid of
Tachyon to read and write memory, thus improve the computational efficiency.
; In the use of Tachyon to cached data, even in the Spark program crashes after the JVM process,
cached data is not lost.In this way, the Spark job restart can be read directly from the Tachyon
; When two Spark job need to manipulate the same data, they can be obtained directly from the
Tachyon, do not need their cache a copy of the data, thus reducing the JVM memory pressure,
reduce the frequency of garbage collection occurs.
Tachyon system architecture
We introduced the design of Tachyon in the last chapter, this chapter we are going to simply look at Tachyon system architecture and implementation.Figure 2 shows the Tachyon deployment in Spark platform: in general, the Tachyon has three main parts: Master, Client, and the Worker.In each Spark the Worker nodes are deployed a Tachyon Worker, Spark Worker by Tachyon Client access Tachyon for data reading and writing.All the Tachyon Worker management, by Tachyon Master Tachyon Master through the Tachyon Worker regularly from the heart to judge whether the Worker has collapsed, and each Worker of the rest of the amount of memory space.
Figure 2: the Tachyon in Spark deployment platform
Figure 3 shows the Tachyon Master structure, its main function is as follows: first, the Tachyon Master is a Master manager, dealing with requests from each Client, a series of work done by the Service Handler.These requests include: for the Worker information, read the File of the Block information, create a File, etc.Second, the Tachyon Master is a Name Node, to store the information of all files, each file information is encapsulated into an Inode, every Inode recording all the Block information belongs to this file.In the Tachyon, Block is the smallest unit of storage file system, assuming that each Block is 256 MB, if there is a file size is 1 gb, so this file will be cut into four blocks.Each Block, there may be multiple copies, is stored in multiple Tachyon Worker, so the Master must also record each Block is stored inside the Worker address;Third, the Tachyon Master also manages all the Worker, the Worker will regularly send Master heartbeat notice this active and free storage space.Master is through the Master Worker Info to record the last heartbeat time, each Worker has used memory space, and the total storage space and other information.
Figure 3: the Tachyon Master design
Figure 4 shows the Tachyon Worker structure, it is mainly responsible for storage management: first, the Tachyon Worker Service Handler processing from the request from the Client, the request including: read a Block of information, a cache Block, lock in a Block, to the local memory space of storage requirements and so on.Second, the Tachyon is a major part of the Worker Worker Storage, its role is to manage Local Data (Local memory File System), and Under the File System (below the Tachyon disk File System, such as HDFS).Third, the Tachyon Worker and a Data Server in order to deal with other Client Data read and write requests for its launch.When the request to Tachyon will find data in local memory storage first, if not found will try to other Tachyon Worker to look up in the memory storage.If the data is not completely Tachyon, you need to pass Under the File System interface to read from the disk File System (HDFS).
Figure 4: Tachyon Worker design
Figure 5 shows the Tachyon Client structure, its main function is to abstract a file system interface to block out the underlying implementation details.First of all, the Tachyon Client will pass Master Client components with Tachyon Master interactions, such as to the Tachyon Master query where a Block of a file.Tachyon Client will also pass the Worker Client components with Tachyon Worker interactions, such as to a Tachyon Worker request the storage space.In the Tachyon Client implementation is the most important of Tachyon this part File.In the Tachyon File under the Block Out the Stream, it is mainly used for writing files local memory;Implement the Block In the Stream is mainly responsible for reading the memory file.In the Block In the Stream contains two different implementation: the Local Block In main Stream is used to read a memory file, and Remote Block In the Stream of Local memory is mainly read files.Please note that the local can be in other Tachyon memory File Worker, also can be in the Under the File System of the File.
Figure 5: Tachyon Client design
Now we through a simple scenario string together all the parts: suppose that a Spark homework launched a read request, it will first through the Tachyon Client to Tachyon Master query need Block is located.The Tachyon if the Block is not local Worker, the Client will through Remote Block In the Stream to the other Tachyon Worker read requests, In the process of Block read In at the same time, the Client will also Block through the Block Out the Stream to write to a local memory storage, so we can ensure that the next time the same request can be performed by the machine.
The use of Tachyon in baidu inside
Inside the baidu, we use the Spark SQL for big data analysis work, because the Spark is a computing platform based on memory, we expect most of the data query should be done in a few seconds or ten seconds to reach the purpose of interactive query.But in the Spark in the operation of the computing platform, we found that the query needs to be hundreds of seconds to complete, the reason for this is shown in figure 6: our computing resources (1) Data Center with the Data warehouse Data Center (2) may not be within the same Data Center, in this case, we have every Data query may need to read Data from the remote Data Center, due to the network bandwidth and time delay between the Data Center, each query needs to be a long time (> 100 seconds) to complete.Worse, many highly repetitive queries, the same data could be query multiple times, if every time read from the remote data center, will cause the waste of resources.
In order to solve this problem, we use the Tachyon data cached in the local, avoid cross data center data.When Tachyon is deployed to Spark the data center, where every cold data query, we are from the remote data warehouse, data, but when the data was again query, Spark will read data from the same data center of Tachyon, thus improve query performance.Experiments show that: if it is from the Tachyon reading data, takes down to 10
to 15 seconds, up to 10 times higher than that of the original performance.The best of circumstances, if from the native Tachyon reading data, query only 5 seconds, 30 times than the original performance improved, the effect is quite obvious.
Heat, after using the optimized query performance has reached the requirement of interactive query, but cold query the user experience is very poor.User behavior is analyzed, we found that the pattern of user queries are fixed, such as many users running the same query every day, and is only used by the date of the filtered data can change.Using this feature, we can according to customer demand for off-line query, get the data needed to import Tachyon in advance, so as to avoid cold user query.
Figure 6: Tachyon in baidu's big data platform deployment
In the process of using the Tachyon, we also have a few problems: in the first deployment Tachyon, we found that the data cannot be cached completely, for the first time and the subsequent query is the same time.As shown in figure 7 source code: only after the data Block is read, the Block will be cached in;Otherwise the operation of the cache will be cancelled.A Block is 256 MB, for example, if you read one of the 255 MB, the Block is still won't be cached, because it only need to read the entire Block some of the data.Inside the baidu, we have a lot of data is to use the determinant of storage, such as the ORC and Parquet file, each query will only read a few of these columns, so you don't read the complete Block, so that the Block cache fails.In order to solve this problem, we have to modify the Tachyon, if the data Block is not too big, even if the user requests are just a few of the cold inquiry, we will read the whole Block, ensure the whole Block can be cached, query again, then can be read directly from the Tachyon.After using the modified version, Tachyon achieve the effect that we are looking forward to, most of the queries can be done in 10 seconds.
Figure 7: Tachyon cache data logic
Tachyon some new features
We put the Tachyon as a cache to use, but each machine limited memory, the memory will be used up soon.If we have 50 machine, each distribution of 20 gb of memory to Tachyon, so a total of only 1 TB cache space, far cannot meet our needs.The Tachyon latest version has a new function: Hierarchical Storage, which USES different Hierarchical Storage medium for data cache.As shown in figure 8, which class in CPU cache design: memory read/write fastest so can be used to 0 level cache, then the SSD can be used for level 1 cache, local disk can be used as the underlying cache in the end.This design can provide us with a larger cache space, also 50 machines, now we each can contribute 20 terabytes of cache space, bringing the total cache space reaches 1 pb, basic can meet the demand of our store.Like CPU cache, if Tachyon block Replacement Policy design is proper, 99% of the request can be 0 level cache (memory), which can do second level response in most of the time.
Figure 8: Tachyon Hierarchical Storage
When Tachyon received read requests, it first checks whether the data in the 0, if hit, return the data directly, otherwise it will query the next layer of caching, until find the requested data.Data will be returned to the user directly, after find out at the same time also can be Promote to cache, 0 and 0 layer is replaced data Block will be LRU algorithm to the next layer of cache replacement.Thus, if the user requests the same data again will quickly get directly from 0 layer, so as to give full play to the Locality properties of the cache.
When Tachyon received a written request, it first checks to 0 layer whether there is enough space, if any, are directly write data back.Otherwise it will query the next layer of caching, until they find a cache layer has enough space, and then put a layer of a Block with the LRU algorithm on the next layer, so on, until he had 0 layer there is enough space to write the new data, and then return.The aim is to ensure that the data is written to 0 layer, if read requests happen immediately after the written request, the data can be read quickly.To do so, however, write performance is likely to become very poor: such as the first two layer of the cache is full, it needs to put a Block from level 1 to level 2, put a Block from layer to layer 1, 0 and then you can write data to the 0 layer, and then returned to the user.
We made an optimization, with its layers of analogy to make room, our algorithm directly to the cache data to have enough space layer, then quickly return to the user.If a cache is full, it dropped the ground floor of a Block replacement, and then the data is written to the underlying returns after the cache.Through the experiment, we found that the optimized process can reduce the write time delay by about 50%, greatly improve the efficiency of the writing.But the efficiency of reading what, because in the TACHYON, written by the Memory - Mapped File, so it is first written to Memory, and then Flush to disk, if reading is
to occur immediately after writing, actually from the operating system's Buffer, namely read data in the Memory, so read performance also won't fall.
Hierarchical Storage well solve the problem we use the cache is not enough, we will continue to the next is optimized.For example, now it's only a LRU replacement algorithm, and can't satisfy all application scenario, we will focus on different scenarios design more efficient replacement algorithm, try to improve the cache hit ratio.