Facebook's data warehouse is how to expand to 300 pb
Facebook encountered on the data warehouse storage scalability of the challenge is unique.Stored in our data warehouse based on the Hive for more than 300 petabytes of data, and is growing at a daily 600 new TB.Last year, the data warehouse stored data volume has increased by three times.Considering the growth trend, the problem of storage efficiency is our data warehouse infrastructure at present and a period of time the need to pay attention to in the future.
In the aspect of improving the storage efficiency of data warehouse we have a lot of innovations, such as set up a data center, cold storage of HDFS using similar to RAID technology to guarantee the high availability data unchanged under the premise of reducing redundancy rate, before data is written to the HDFS do compression to reduce the data storage space.On Facebook for a large amount of original log transformation (transformations) operation is the most widely used system of the Hive, Facebook's Hive is based onCorona Map-Reduce, which is used in the framework of process data, to establish data warehouse table query engine, etc.In this article, we mainly discuss the evolution of the Hive table storage format, the main purpose is to make this work Hive table storage format as efficient raw data compression.
Our data in data warehouse is loaded into the table is the first to use the storage format of Facebook when their own developmentRecord-Columnar File Format(RCFile).RCFile is a kind of "allows the line according to the query, provides a list of storage compression efficiency" of the hybrid storage format.Its core idea is first cut the Hive table level into multiple rows groups (row groups), vertical segmentation, then according to the columns in the group in this column and column of the data storage blocks on the disk is continuous.
When all columns within a row set to disk, RCFile will take as a unit of data using the algorithm of similar zlib/lzo compression.When read the column data using inert decompression strategy (lazy decompression), meaning that the user's query only if it involves a part of the column in a table, RCFile will skip without columns of decompression and deserialization process.Through our representative examples selected from the experiment, data warehouse RCFile can provide five times the compression ratio.
The next step beyond RCFile method adopting what kind of?
As the data stored in the data warehouse quantity sustained growth, engineers in the group to research techniques and methods to improve compression efficiency.Research focus among the levels of the coding method, such as the run-length encoding (run length encoding), dictionary coding (dictionary encoding), reference frame coding (frame of reference encoding), can be better at the column level before general compression process numerical coding method to reduce redundant logic.We also tried the new column type (for example JSON is widely used in the internal format, store the JSON format data in accordance with the structured way can meet the needs of efficient query, but also reduces the JSON metadata storage redundancy).Our experiments show that the column level of code if used properly, can significantly improve the compression ratio of RCFile.
Hortonworks, meanwhile, are experimenting with similar ideas to improve the Hive storage format.Hortonworks engineering team to design and implement the ORCFile (including storage format and read/write interface), which help us for Facebook data warehouse design and implementation of new storage format provides a good place to start.
Hive of data in the form of ORCFile writes to the disk, the data is divided into a series of 256 MB of stripe (stripe), a stripe is similar to a row in the RCFile group.Within each stripe, ORCFile to encode each column of the data first, and then make a list of all across the strip using similar zlib compression algorithm to compress.For a string format columns using dictionary encoding, and it is for all the rows of data within a stripe together for coding.Within each stripe will be an index for every 10000 rows of data storage, and record the maximum and the minimum of each column.The filter type queries can skip some not within the scope of the line.
In addition to compression improvement, a significant advantage of a new storage format is listed and guild are recorded in the form of migration (offset), so there is no need to use the separator to mark the end of each line.However in RCFile is reserved by some ASCII value as a delimiter, so these separators were not allowed to appear in the data stream.At the same time query engine can use the stripe and each column in the metadata file level to optimize the query efficiency.
The columns of the adaptive coding
When we started testing ORCFile in data warehouse, we found that some of the Hive table compression performance is good, some may cause inflation data, result in our data warehouse on a representative set of tests of compression efficiency is not obvious.Because if a column data entropy is very big, will cause inflation data dictionary coding, so if the default for all columns that are of type string dictionary encoding is inappropriate.Whether you need for a column dictionary coding we consider two ways: by user specified static column metadata specified;By investigating the situation of column values at run time dynamic selection coding method.We choose the latter because it is good compatible with our existing large Numbers of data warehouse table.
We had a lot of testing purpose is to find the maximum compression ratio at the same time does not affect ORCFile write performance method.As a result of type string in our biggest who dominates in the table, at the same time, about 80% of the column in the data warehouse is composed of a string type, so the optimized string type compression ratio is the most important.In each column for each stripe set a threshold, the number of different values
changed ORCFile we write interface at the same time, if you use a dictionary of data within a code for each strip can bring compression efficiency ascension we are encoding the data dictionary.We sampling was carried out on the column value at the same time, the investigation of the column value character set.Because if this character set is small general like zlib compression algorithm can have better compression ratio, this kind of circumstance dictionary coding, it is not necessary, sometimes even the side effects.
For a big plastic data, we can consider to use run-length coding or dictionary coding.In most cases the run-length coding is only slightly better than general compression algorithm, but when the column is made up of a few different numerical data dictionary coding would behave better.Based on the results, we also use dictionary coding for large plastic data rather than run-length coding.For string data type and value type adopt the idea coding can bring ORCFile high compression efficiency.
We also tested a lot of other ways to improve the compression ratio.The idea is worth mentioning the adaptive run-length coding, also is a kind of heuristic strategies, only when using run-length encoding can improve the compression ratio.In the open source version of ORCFile this strategy applied to the plastic data in a variety of methods in the process of selecting the encoding algorithm adaptively.This method improves the compression ratio, but write the drop in performance.We also studied the stripe size for the influence of compression ratio.Out of our expectation, increase the size of the stripe did not significantly improve the compression ratio.Because with the increase of stripe size, the number of dictionary elements will increase, which causes the encoded to column value increases the number of bytes occupied.So want to store the benefits of such less dictionary is not in line with expectations, with 256 MB as the experience value is the size of a stripe is on.
Considering the scale in our data write speed will affect query performance, our ORCFile write interface for open source used to improve a lot of improvement in performance.In which the key point is to eliminate redundant or unnecessary operation, at the same time to optimize memory usage.
The ORCFile we write in the improvement of the interface of the key is to create orderly dictionary.In the open source ORCFile version in order to ensure the orderly dictionary, write interface is to use the red-black tree (red - black tree) to store the dictionary.Then in adaptive coding dictionary, even if a column is not suitable to use a dictionary encoding storage, it also takes time complexity of O (log (n)) to insert a new key red-black tree.If you use a storage and efficient hash map to hold the dictionary, only when needed sorting, dictionary memory footprint was reduced by 30%, and more importantly, write performance increased 1.4 times.In order to adjust the dictionary to the faster and more efficient use of memory size, initial dictionary is a byte array for the elements of an array of storage.However it makes access dictionary elements this operation is very frequently, we choose to use based library Slice class used instead of dictionary efficient memory copy, can improve the write performance by a further 20% - 30%.
According to the column to open the dictionary coding consuming computing resources.Because each column of character will be repeated in different bands, we have found that for all the stripe unified coding dictionary no advantage.So we improve write interface, based on band subsets judgment column coding algorithm, at the same time, the corresponding column in the following bands will repeat the above algorithm in the band.That is to say, if write interface judgment for the column values using dictionary coding no advantage, so in the next stripe can intelligently not use a dictionary in the code.As Facebook's own ORCFile brings to promote the efficiency of compression format, we can reduce the zlib compression level has been used, at the same time write performance improved 20%.
When it comes to reading performance, everyone will soon think of problems is the inert column decompression (laze column decompression).For example, one to read a lot of columns but only have filter conditions in a column of the query.If there is no us achieve inert decompression, all columns will be read and extract.Ideally only set a filter column is decompression and decoding, so the query wouldn't waste a lot of time has nothing to do in the decompression and decoding the data.
For this purpose, we use ORCFile storage format of step index (index) strides to Facebook ORCFile read interface implements the inert decompression and inert decoding function.Speak in front of case, with filter conditions that involve a list of all the data will be decompressed and decoding.For the other columns of data, we change the interface to read read stripe step inside the corresponding index (metadata operations of stripe), only need to unzip and decoding the corresponding step index in front of the rows of data.After our test, the improvement made in our Facebook ORCFile version running on a simple filter type (select) query is three times higher than the open source software performance.Because without the additional data inert decoding, Facebook ORCFile type in this simple filter query performance than RCFile is also high.
By applying these improvements, we use in the data warehouse data compared ORCFile RCFile brought five to eight times in the compression ratio of ascension.To select a series of typical queries from the data warehouse and data test, we found that Facebook ORCFile write performance will be 3 times higher than the open source software.
We have to apply the new storage format to many dozens of PB level table data, through the data from RCFile into ORCFile storage format, of dozens of PB level has storage capacity.We are the new storage format to the other tables in the data warehouse, the improved storage efficiency can be achieved and read and write performance improvement.Our internal ORCFile code is open source, at the same time we are in close coordination with the open source community to these improvements are merged into the Apache Hive in the code.