Based on the LLVM memory computing
Recent industry there are a lot of technology and product are considered to belong to the category of memory computing, and everybody think memory computing is the next big data aspects of core technology, especially like the Spark and HANA products and technology, make memory computing has in big data technology into the mainstream.But as a memory computing research in 3 years of the author, feeling a lot of people understanding of the memory to calculate only stay in the data cached in memory, or use the latest SIMD instruction set, but I found in this three years of research and development in the process of the biggest is not the case, I personally think memory computing engine compared with traditional data processing engine, the biggest innovation is based on the LLVM compiler dynamic code generation, so this article will introduce you to the products and technologies is how to use the LLVM compiler to dynamically generated code execution, so as to realize the true sense of memory, so before further this technology, want to give you a little about LLVM technology itself.
LLVM - moving behind the hero of The Times
Short for LLVM is Low Level Virtual Machine, first of all, you don't be misled by the name, that it is like VMware and KVM Virtual Machine project, in fact it is a large similar GCC compiler framework.And the name of LLVM itself along with the continuous development of the project, has been unable to fully represent the project, just as the name continued.
As a compiler, LLVM itself and the GCC, is an open source project, the agreement is BSD, rather than the GPL used Linux and GCC.It is the earliest time of the university of Illinois at urbana-champaign champagne a research project, the principal is Daniel Chris Lattner, and the project itself is also very honored to won the award for 2012 ACM software system.
LLVM can as the back-end of multiple languages, it can provide a programmable language independent optimization and code generation for a lot of kinds of CPU function, compared with GCC, its overall architecture is more clear and faster, in addition, LLVM has not only is a programming framework, it also contains a lot of components, such as the most prestigious for C compiler Clang, in addition, there are used to link the GCC front-end parsers Dragonegg, etc.
Although the LLVM is a strange word for many programmers, but played a crucial role in the era of the whole mobile, both Android and iOS platform.
For iOS, Chris Latter now works at Apple, and Apple is one of the main sponsors LLVM current.Before the iOS, the birth of LLVM will bring great help to OSX itself development, because Apple was trapped in the GCC can't provide them the perfect support, especially related to Object - C compiler part, at a time when the presence of the LLVM
project to help them solve urgent needs, including behind the Grand Dispatch the shadow of the core functions are LLVM.Era to iOS, LLVM has is the core of the iOS platform bottom, no matter from the compilation of iOS itself, as well as the XCode debugging support, or the latest language Swift also are made by the father of the LLVM Chris Latter my design.
Although the Android platform is iOS rivals, but the Android platform for LLVM rely on more quickly, although before 4.4 is done with self-built Dalvik VM, using a JIT (Just - in - time Compiler) to perform this instant compiled form, but in terms of performance and user experience, so the Android version 4.4 introduced a mode of ART, ART is directly using the LLVM do AOT compilation Ahead of time, namely when installed directly to the program compiled to machine code, Android users using the App can at ordinary times like iOS users executing native code directly, such making that the overall performance of Android is near the level of the iOS, although at the time of installation, may need to spend more time to do the precompiled.So LLVM is behind the hero of the mobile era, what is the relationship between it and traditional data processing?In dive into LLVM is how to change the traditional data processing engine, let us know about the traditional data processing engine.
Traditional data processing engine
As is known to all, the original traditional data processing engine (such as Postgres, Oracle, etc.) the main bottleneck in the I/O, because at that time, the speed of the bottom hard disk are, for many years no progress, but with more nodes in parallel in recent years, SSD, development of I/O resources, such as large memory, more important is the column technology popularization and evolution, the I/O problems is no longer a great bottleneck, but came in the network and CPU bottlenecks emerge instead, due to network bottleneck is not the focus in this paper, so the main chat data processing engine CPU bottlenecks.
Figure 1 complicated data processing engine code
Although a new generation of computing engine based on memory, I/O performance than traditional performance improvement based on the hard drive is about 10 times, but at the same time, with the CPU bottleneck will be more apparent, based on our previous test results, in the traditional Postgres engine as an example, the operating data are cached in memory Page Cache, it make the simple Count (*) statistics are barely reached around 4 million lines per second, and operation of the real need is very few, in accordance with the 2010 X86 CPU, actually handle these calculation is very simple, but the actual performance is very low, why?Here, and talk to you now first X86 CPU performance characteristics, with the development of the technology itself, the processing capacity of X86 CPU itself is very powerful, but all in the Context performance small landslides will occur.If waiting to be processed commands and data is not cached in advance on the Cache, but need to access the memory, there will be a greater performance of landslides.Now that everyone understand the
bottleneck of the CPU itself, chat here from traditional database processing engine board have?Mainly has the following four points:
One is conditional logic redundancy, the data processing engine code is very complicated, because the SQL statement itself is very complex, so the data engine to support complex SQL statements, making data processing engine need complex conditional logic to handle, as shown in figure 1, or a Switch inside the loop there will be hundreds of thousands of case that choice logic, while the Switch cycle itself will be the compiler to a certain degree of optimization, but at the end of the branch of machine code instructions will to some extent, prevent the pipelining (instruction pipelining) and parallel execution (instruction level parallelism).
Secondly, virtual function calls, and the cause of the first question, because the data processing engine to support complex SQL statements, there are dozens of data types, such as, the application in dealing with the add this logic, the data processing engine needs according to the data source is an INT or BIGINT to choose different function to deal with, so in the actual processing, certainly must be transferred to the implementation of specific functions by virtual functions, the impact on the CPU must be very obvious, because a lot of time running costs of a virtual function calls itself, execution cost is higher than the function itself.Because of this, more inline function is the most common way of performance optimization also cannot be used.
Third is the need to constantly invoke data from memory, not a one-time load data from memory to the Cache, For example, common For loop, although know the next offset data, but still want to read from the memory it, read this expensive and prevent the CPU pipelining operations.
Four is because different x86 CPU s different, so support different extension instruction set, and the new instruction set for many operations can improve the performance of more than 100%, for example, some relatively new CPU support SSE 4.2, and some don't support, in order to ensure that data engine can across different hardware platforms, so rarely support some of the instruction set extension data engine, resulting in wasted could improve the performance of the didn't get the support.
Although these bottlenecks are difficult to overcome, but Google r&d Mr. Tenzing technology based on LLVM compiler framework to realize dynamically generated code inside Codegen this technique, and through this technology is based on the following class SQL system performance graphs distributed framework can also close to commercial charge level of parallel database.Codegen this way, that is before the SQL execution compile specific execution of code, the data before the engine itself is a large and complex, the framework of any computing including 1 + 1, will enter the framework of this vast and do judgment to select the appropriate function to perform the 1 + 1, so the final cost as the front column, is very high.The benefits of using Codegen is as follows:
One is to simplify the conditional branch, because at the time of the generated code, had to learn the information of the runtime, by expanding the for loop (because we already know cycles) and parse the data types, so I can like figure 2 if/switch the branch instruction can optimize away such a statement, it is very simple and effective.
Figure 2 if/branch switch statementsSecondly, memory load, we can use code generation to replace the data load, so that greatly reduce memory read, increase the utilization rate of CPU Cache, it's very helpful for CPU performance.
Figure 3 code generation to replace virtual function calls
A third is inline virtual function calls, so when the type of object instances at run time, we can use the code generation is shown in figure 3 to replace the virtual function call, and do inline, such expressions can be directly without the function call.In addition, after the inline function that the compiler to do further optimization, such as subexpression elimination.
Its 4 it is to use the latest instruction set, at the time of Codegen, because the Codegen itself is in the execution of the nodes, so it is very convenient can perceive its underlying what support the version of the latest CPU instruction set, such as the SSE 4.2 or SSE4.1, so Codegen will completely according to the specific set of instructions to support to compile the specific execution code, make its can use as much as possible the latest instruction set.
Figure 4 is based on the TPCH - Q1 benchmarks
Test results based on the figure 4 (run TPCH - Q1, 2.7 GB of data quantity, single node, Codegen time 150 ms), through such Codegen way indeed can effectively improve the efficiency of data processing engine using CPU, the overall performance to achieve more than 3 times, and the Cache Miss rate and before than big ascension.Codegen technology, of course, like other technologies, it also has its own cost and shortage, that is the code generation operations in the Codegen itself is also need a certain amount of time, including the For loop unrolling and multi-level pass optimization, while the processing performance of LLVM itself is strong, but this operation is very time consuming, typically about 100-200 ms, but had a 10 second query dimensional to 6 seconds, I believe that everyone can understand.
Dynamic code generation based on LLVM
The industry has been more and more memory computing product adopts dynamic code generation based on LLVM, already has a tendency to become memory computing standard, in addition to the above mentioned Mr. Tenzing inside Google on a wide range of use, and
Cloudera Impala and SAP HANA is using Codegen technology, the most popular Spark additional memory areas, Codegen also plans in the next version of the implementation.
Cloudera Impala is implemented
Although the Cloudera Impala are mixed in the industry, also ready to give up now inside the Cloudera seemingly Impala, change direction to Spark the development, but the Impala in terms of performance is recognized by everybody, the main reason is the use of dynamic code generation based on LLVM.Why use Codegen?In addition to good
performance, another factor is the chief designer Impala itself is Google Mr. Tenzing designer.
So let's see how Cloudera Impala concrete realization way.Due to the execution logic code generated dynamically at run time machine code, the cost will be very high, so the Impala at compile time, will first put some specific executive function, first compiled into LLVM IR (Intermediate Representation) among the morphology, resulting in a large, IR file, including all related function of IR code block, after before the start of the specific SQL command execution, data node DataNode will first according to specific SQL statements, starting from the root node recursive initialization execution tree (AST), after began to LLVM can be carried according to tree file inside of the corresponding function to invoke the IR IR code block afterlife cost to executable code, then continue to generate good code optimization to further improve performance.
SAP HANA design ideas
Memory is at least ten years ago there was a wave of agitation, then the enterprise products, the main representative for OLTP transactions accelerating Timeten and Altibase, and in 2010 began to promote the memory computing products, most representative is the SAP HANA.
SAP HANA on the Processing logic, fully USES the Vector calculation (Vector Processing), as far as possible using the latest SSE4.1 and SSE4.2 instructions, as well as in multi-core NUMA scenarios reduce running cost, make its multithreaded performance than close to 1 as possible.In addition, in terms of data structure, in order to as far as possible use of CPU Cache (Cache), and as far as possible to reduce the memory access, so introduced a Cache Sensitive the CSB (Cache Sensitive B +) instead of the traditional B tree tree.
Promotion, in order to progress and improve the overall architecture and performance, HANA also use LLVM to support dynamic compilation, SQL queries or MDX (Multi - Dimensional Expressions, multi-dimensional expression) query, etc., within the HANA will have been translated a public presentation layer, called L language, and before execution, will use LLVM compiled into the binary code and direct execution, the advantage is to avoid the traditional complicated logic inside the database engine, thus improving performance, and such L form of the language, very helpful for HANA to add more features in the future, so HANA in a very short period of time to support a lot of analysis of SQL statements, also
supports a variety of data mining algorithm, also provide a lot of customized script, let users according to their own business needs to expand the function of HANA.
Apache design idea of the Spark
As you all know, now the Apache Spark can be said to be the most popular recently open source data items, even the EMC's Pivotal company specializes in large data also began to ready to abandon its since more than a decade GreenPlum technology research, turned into the Spark technology research and development.And from the point of the industry, the degree of Spark a fire, only the IaaS bound it can compare.So we then straight to its core mechanism.
The Spark has been to use code generation in the SQL and DataFrames expression evaluation (expression evaluation), at the same time they are ready to make the code generation can be applied to all of the built-in expression, and on this basis to support more new instruction set.And they not only from Codegen internal components to improve the CPU efficiency, also hope to push the code generation to a broader range of places, for example, data serialization, also is to Shuffle data redistribution in the process of the data from memory to binary format conversion to wire - protocol stream format of such processes, because the Shuffle usually performance bottlenecks by data series, and through code generation, can significantly improve the serialization throughput, which in turn effect to Shuffle network throughput.
Figure 5 shuffle performance analysis
Figure 5 compares the single thread on the performance of the 8 million line complex do shuffle, respectively using Kryo way before and the latest code generation, the speed is more than 2 times of the former on the latter.