High-performance server architecture
In this article, we will share with you my some experience for many years for the development of the server.For this server, a more accurate definition should be processing a large number of discrete messages per second or request service program, web server is more suitable for this kind of situation, but not all web applications are strictly on the server.Using the "high performance request handler" is a very bad title, in order to simple narrative, the following will be referred to as the "server".
This article does not involve the multitasking application, handle multiple tasks at the same time in a single program is very common now.Such as your browser may be doing some parallel processing, but there is not much to challenge this kind of parallel programming.Real challenge appears in the server's architecture design constraints on performance, how to improve system performance by improving the structure.For on with gb of memory and G Hertz CPU running on a browser, through the DSL line multiple concurrent download task there won't be so challenging.Here, the focus of the application is not little mouth to suck through the straw, but how to through the faucet drink, here is how to solve the trouble in the restriction of hardware performance. (the author's meaning should be how to through the improvement of the network hardware to increase traffic)
Some people may be some of the views and Suggestions for me a question, or assume there is a better way, this is inevitable.In this article I don't want to play the role of god;Are talking about here is some of my own experience, these experiences for me, is not only effective on improving server performance, and to reduce the debugging difficulty and increase system scalability also have a role.But for some systems may be different.If there are other more suitable for your method, it was very good. But it is worth noting that, this article proposed some other alternative scheme of every piece of advice, I passed the experiment conclusion are pessimistic.Your own cleverness may have better performance in these experiments, but if so urged me to suggest that readers do this here, may cause the reader's innocent.You don't want to annoy readers, right?
The rest of this article will mainly affect server performance of four big killer:
1) Data Copies (Data Copies)
2) environmental switch (Context Switches)
3) Memory allocation (the Memory allocation)
4) Lock contention (Lock contention)
Part at the end of the article also puts forward some other important factors, but the above four is the main factor.If the server can handle most requests without the data copy, no context switching, no memory allocation, no lock contention, then I can assure you of the performance of the server must be very good.
Data Copies (Data Copies)
This section will be a little short, because most people draw a lesson on the data copying.Almost everyone knows that it's wrong to produce copies of data, it is obvious that in your career, you will have long seen it;And met the problem, because ten years ago people began to say the word.It's true for me.Nowadays, almost every college courses and almost all how-to document mentioned it.Even in some commercial brochures, "zero copy" is a popular language.
In spite of the disadvantages of data copy is obvious, but people will ignore it.Because the generated data copy code often hidden deep with camouflage, do you know what you call the library or drive data copies will be carried out in the code?The answer often beyond imagination.I/O "guess" program on the computer is to point to?Hash functions are examples of data copies in disguise, it brought my copy of the memory access consumption and more computing.Once pointed out that the hash algorithm as a kind of effective copy "+" seems to be avoided, but as far as I know, there are some very smart people said that it is rather difficult to do this.If you want to remove real data copies, whether because affect the server performance, or want to show in the hacker conference "zero copy" technology, you have to track data copies may occur everywhere, not credulous.
There is a kind of can avoid data copy method is to use buffer descriptor (or buffer descriptor chains) instead of directly using the buffer pointer, each buffer descriptor should consist of the following elements:
L a pointer to the buffer and the length of the buffer
L a pointer to the real data in the buffer and the length of the real data, or the length of the deviation
L in the form of two-way linked list provides a pointer to the other buffer
L a reference count
Now, the code can simply increase the reference count on the corresponding descriptors instead of a copy of the data in memory.Under the condition of the practice in some performance is quite good, including on the typical network protocol stack operation, but in some cases this approach is very big.In general, the buffer chains increase thestart and end of the buffer is easy, contribution to the reference count of the buffer, and the buffer chains of immediate release is also very easy.Increase buffer in the middle of the chains, the release of
a piece of a piece of buffer, or reference to part of the buffer to increase technology is more difficult.And segmentation, combination chains can let people will collapse.
I don't recommend use this technique in any circumstances, because when you want to search on the chain when you want a piece of, have to traverse the descriptor chains, again this is even worse than data copies.Most applicable where the technology is the application in a block of data, these large data blocks should be described in terms of what is said above that the distribution of the independent character, to avoid a copy, can also avoid the influence on the rest of the work server. (large blocks of data copy is consumed CPU, will affect the operation of other concurrent threads).
About data copy last to point out is: don't go to extremes in avoiding data copy.I have seen too much code in order to avoid data copy, instead, the final results worse than copying data, such as environmental switch or a large I/O request is broken.Data copies are expensive, but to avoid it, is diminishing returns (mean over do it, the effect is bad instead).In order to remove the last a few copies of data and change the code, which in turn make double code complexity, rather than spend time in other ways.
A Context switch (Context Switches)
Relative to the data copy effect obviously, a lot of people will ignore the influence of context switch on performance.In my experience, compared with the data copy, a context switch is for high load application die completely real killer.System spend more time on the thread, rather than spending on real thread to do useful work.And the astonishing thing is that (compared with the data copies) in the same level, cause a context switch is more common.The first cause of context switching is often active threads than cpus.With the increase of active threads relative to cpus, the number of context switches are also on the increase, if you're lucky, this growth is linear, but more common is exponentially.This simple fact explains why each connection corresponds to a single thread of multi-thread design pattern less scalability.For a system scalability, limited active threads number less than or equal to the number of CPU is more practical significance.Used a variation of this plan is only an active threads, although this solution avoids the environmental contention, and also to prevent the lock, but it cannot effectively use the CPU in increasing the value of the total throughput, so unless the program without CPU limits (non - CPU - bound), a (usually network I/O restrictions network I/O - bound), should continue to use the more practical solution.
An appropriate thread program plans to go out first thing to consider is how to create a thread to manage multiple connections.This usually means lead a select/poll, asynchronous I/O signals or completion port, and the background to use an event-driven framework.There is a lot of debate about which lead the API is the best.Dan KegelC10K problemIn this area is a good paper.Personally, the select/poll and the signal is usually a ugly, so I tend to use AIO or completion port, but in fact it is not good too much.Except maybe the select (), they are also good.So don't spend too much energy to explore what is happening in the front system within the outermost layers.
For the most simple multi-threaded event-driven server model, the concept of the interior has a request cache queue, client request by one or more monitoring thread after get in a queue, and then one or more worker threads to remove the request from the queue and processing.Conceptually, this is a very good model, there are a lot of way to achieve their code.This can cause a problem?The second cause of context switching is the handling of requests from one thread to another thread.Some people even put a response to the request and switch back to the original thread to do, it's worse, because each request at least 2 times context switching.A request from the monitor thread to into a worker thread, and back to thread in the process of monitoring, the use of a "smooth" method to switch to avoid the environment is very important.At this point, if the connection request assigned to multiple threads, or let all threads as monitoring thread in turn to service each connection request, it is not important.
Even in the future, there can be no way of knowing how many active threads in the server the same time. After all, every time may be requested from any connection to send to come over, some of the special mission of "background" threads will be awakened at any moment.So if you don't know how many threads are currently active, how to limit the number of active threads?Based on my experience, is also one of the most effective way is the simplest: with an old-fashioned counting semaphore, each thread execution when holding first semaphore.If semaphore has reached a maximum value, the thread to be awakened in listening mode when there may be an additional context switching, (listen thread is awakened because a connection request arrives, the listener thread holds a semaphore found when the semaphore is full, so instantly sleep), then it will be blocked on the semaphore, once all threads are blocked so listening mode, then they won't compete for resources, until one thread to release a semaphore, so environment switch system impact is negligible.More major, this method makes the most of the time dormant thread to avoid occupy a position in the active threads, this way is more elegant than other alternatives.
Once the request handling process is divided into two stages (listening and work), so further, these processes are divided into more in the future stage (more threads) is a very natural thing.The simplest case is a complete requests to complete the first step, and then is the second step (such as a response).But it will be more complicated: one stage may produce two different execution path, may simply generate a response (e.g., return the value of a cache).Each stage so need to know what to do next, according to the phase distribution function return values there are three possible:
L request need to be passed to another phase (or returns a descriptor pointer)
L request has been completed (return ok)
L request is blocked (returning "request blocking").The front, and release resources block to until the other thread
Should note that under this model, the phase of the queue was done within a single thread, rather than through the two threads.So keep the request in the next phase of the queue,
then remove this request from the queue to execute.This through many active queue and lock phase is not necessary.
This kind of dividing a complex task into smaller collaboration each other part of the way, look familiar, this is because it is very old.My method, from the CAR invented in 1978 "communication serialization process" (Communicating Sequential ProcessesCSP), which dates back to 1963 when the basis of Per Brinch Hansen and Matthew Conway, before I was born!However, when the term Hoare create CSP, he said the "process" is in the process of abstract mathematical sense, and the process in the CSP terms and operating system in the process of the same name and no relationship.In my opinion, this within the operating system provides a single thread, achieve like multi-threaded concurrent work together CSP method, make a lot of people have a headache in terms of scalability.
A practical example is that Matt WelshSEDA, this example shows that block execution (stage - execution) thoughts toward a more reasonable direction.SEDA is a good "server Aarchitecture done right" example, worthy of comment, it features:
1. SEDA batch multiple requests tend to emphasize a stage, and the way I tend to emphasize a request into multiple stages.
2. In my opinion, SEDA is a major defect for each stage to apply for an independent in the load response phase only thread thread pool for the redistribution of "background".As a result, the switch of the environment caused by the reasons, 1 and 2 are still a lot.
3. In the pure technical research projects, in SEDA is useful in Java, in practical applications, however, I think this kind of method is rarely, if ever.
Memory allocation (the Memory Allocator)
Apply for and free memory is the most common operation in the application, thus invented many intelligent technique makes the application more efficient memory.But otherwise method can compensate for the fact that in many occasions, the general memory allocation method is very inefficient.So, in order to reduce the apply to the system memory, I have three Suggestions.
Suggest a use pre-allocated.We all know that due to the use of static allocation is combined with artificial limits the function of the program is a kind of bad design.But there are still many other good early allocation scheme.Is generally thought that, through the system once allocated memory is better than a few times separately distribution, even if it wasted some memory in the program.If you can determine in the program will have a few memory usage, at programstartup pre-allocation is a reasonable choice.Even if not sure, at the beginning for the request handle pre-allocated may need all memory is better than in need a little at a time when distribution.Through the system or one-time continuous distribution memory can greatly reduce the error handling code.Are in the memory and nervous,
distribution may not be a good choice, but unless in the face of the most extreme system environment, otherwise the pre-allocation is a selection of a one-way ticket.
Suggested that the second is to use a memory release distribution lookaside list (monitoring list or backup list).Basic concept is to put the recent release of objects on the list, not really release it, when the need that object again soon, removed from the list directly, not through the system to allocate.Use lookaside list an additional benefit is can avoid complex object initialization and cleanup.
Usually, let lookaside list unlimited growth, even in the program free don't release object is a bad idea.In avoid introducing complex lock or competitive situations, irregular "clearing" the active object is very be necessary.A more appropriate way is, let the lookaside list consists of two independent lock list: a "new chain" and a "old chain". When using priority from the "new" chain distribution, and then finally rely on the "old" chain.In the chain of object is always in the "new" is released.Remove the thread is run by the following rules:
1. Lock the two chains
2. Save old chain head node
3. Hanging before a new chain on the head of the old chain
4. To unlock
5. In the spare time through the second step is to save the head node of begins to release the old chain all the objects.
Use this approach of system, the object will only be released only when really useless, release delay at least one clear interval (refer to remove threads running interval), but often no more than two intervals.Removal of threads will not happen and common thread lock contentions.In theory, and the same method can be applied to the multiple phases of the request, but at the moment, I still have not found so use.
There is a problem with using lookaside lists is, keep allocating objects need pointer to a linked list (linked list node), this could increase the use of memory.But even with this kind of situation, the benefits of using it can be more than made up for the cost of these additional memory.
The third piece of advice and we haven't discussed the lock.Despite what it say.Even using lookaside list, when the memory allocation of lock competition often is the biggest cost.The solution is to use thread private lookasid list, so you can avoid competition between multiple threads.Further, each processor a chain will be better, but this is only useful in non preemptive thread environment.Based on the extreme consideration, private lookaside list can even work together and a share of the chain.
Lock the competition (Lock Contention)
Efficient lock is very difficult to plan, so I call it a card law of cloth and dense scilla (see appendix).On the one hand, lock simplification (coarse-grained lock) leads to the serialization of parallel processing, thus reduced the efficiency of concurrent and system scalability;Lock, on the other hand, complicated (granular) lock on the space and time consumption during operation are likely to have on the performance of erosion.Preference for coarse-grained lock there will be a deadlock occurs, and prefer to fine-grained locks will compete.Between the two, there is a narrow path leads to the correctness and high efficiency, but where is the road?
Because tend to lock on the program logic, so if you want to work in does not affect normal program planning out the lock scheme on the basis of the basic is impossible.That's why people hate the lock, and design an extension single-threaded solution for yourself excuses.
We almost every system lock design begins with a "lock all super big lock", in the hope that it will not affect performance, when hopes (is almost inevitable), big lock is divided into multiple small lock, and then we continue to pray (performance not be affected), then, is to repeat the whole process (many small lock is divided into smaller lock), until the performance reached acceptable levels.In general, the above process to increase more than 20% - 50% of each time complexity and load lock, and to reduce 5% - 5% of the lock contentions.The end result is achieved moderate efficiency, but the actual productivity is inevitable.Designers began to crazy: "I already according to the guidance of the book design fine-grained locks, why the system performance is very bad?"
In my experience, the above methods from the base is not correct.Envisioned solution as a mountain, excellent scheme said the top of the mountain, bad said the valley.Begins with "super lock" above solution seemed to be all sorts of valley, concave, hill and dead end outside the mountain climber, is a typical bad climbing method;Starting from such a place there, still be inferior to the mountain a little easier.So what is the top right way?
The essential thing is to lock in your application form a graph, there are two axes:
L chart the longitudinal axis of the code.If you are carved out of the branch application architecture (refers to the front for the request of the divided stage), you may already have such a figure, divided into as many people saw the OSI seven layer network protocol architecture diagram.
L chart horizontal axis of the data set.At every stage of the request should have belong to this stage need data set.
Now, you have a grid graph, figure on each cell said a particular stage requires a particular data set.It's should abide by the rules of the most important: two requests should not compete, unless they are in the same stage requires the same data set.If you strictly abide by the rules, then you have succeeded half.
Once you define the figure above the grid, in your system of each type of lock can be identified.Your next goal is to ensure that the identified lock as uniform distribution between two axes, this part of the work is related to specific application.You have to, like a diamond cutting work according to your knowledge of program, find out the request phase and natural "texture line" between the data sets.Sometimes they are easy to find, sometimes it is hard to find out, the need to constantly review at this time to find it.In programming, code is divided into different stages is a very complicated things, I don't have a good advice, but for the definition of data set, are some Suggestions for you:
L if you can order number for the request, or to request the hash, or can the request ID and things, so according to the number or ID can better separation of data.
L sometimes, based on the data set to maximize utilization of resources, the request dynamically allocated to the data, relative to the distribution according to the inherent nature of the request will have more advantages.As if multiple integer arithmetic unit know the request of modern CPU separation.
L sure each stage is not the same specified data set is very useful, so as to ensure a stage for the data in the other stages are not.
If you "the lock on the longitudinal and transverse space (here refers to the actual distribution of lock)" space, and make sure the lock is evenly distributed on the grid, so congratulations you got a good solution.Now you are in a good mountain climbing, metaphorically speaking, you have a path to the top surface of gentle slope, but you haven't been to the top of the hill.Now is the time to statistics of lock contention, look at how to improve.Separate stages and data sets in different ways, and then statistical lock contention, until a satisfactory separation.When you do this far, then the infinite scenery will be present at your feet.
I have already finished this four main aspects of impact performance.But there are some of the more important aspects need to say, most of them can be attributed to your platform or system environment:
L your storage subsystem in the big and small data reading and writing, reading and writing data onto its immediately read and write and order, speaking, reading and writing is how to do?In pre-reading and delay to write?
L how do you use efficiency of the network protocol?Whether can improve performance by changing the parameters?If there is a similar to TCP_CORK MSG_PUSH, Nagle - toggling algorithm in an effort to avoid small message?
L does your system support Scatter - Gather the I/O (such as readv/writev)?Use these to improve performance, also can avoid using buffer chain (see section 1 data copies of the
relevant account) trouble.(note: in the process of dma transfers data, require the source physical address and physical address must be continuous. But in some computer system, such as IA, continuous memory address in physics is not continuous, the dma transfer is divided into several times to finish. If the transfer after a physical serial data to launch a disruption, at the same time, the host for the next piece of physical continuous transmission, is the way to block the dma mode. Scatter/gather way is different, in which a linked list is used to describe the physical discrete memory, and then tell the dma master list first address. Dma master after a physical serial data transmission, need not again hair interrupted, but according to the list under a continuous physical data, finally launch a interrupt. Apparently scatter/gather mode than block dma mode with high efficiency)
How much is your system's page size l?The cache size is it?To the size of boundary are useful for up?The system calls and context switching cost price is how much?
L do you know the locking primitives hunger?Did your event mechanism "jing group" problem?You wake/sleep mechanism such bad behavior: when aroused Y, X environment immediately switch to the Y, but X and didn't finish the work?
Here I think a lot of aspects, believe that you are also considered.Under certain circumstances, certain aspects of application mentioned here may not have value, but to consider the impact of these factors be useful.If the system manual, you didn't find these instructions, then go to find the answer.Write a test program to find out the answer;Anyway, writing such tests are good skill exercise.If you write the code to run on multiple platforms, so the related code abstract as a platform of library, and in the future on a support of some of the features mentioned here platforms, you will win the initiative.
Code for you, "know how", understand senior operations, and under the condition of different costs. This is different from traditional performance analysis, not about the specific implementation, it's about design. Low-level optimization is always the last resort of poor design.