Travis CI: minimum of distributed system (below)
About a year ago, we found that the time of the architecture of some
unreasonable.Especially the Hub, it undertake the task of too much.A Hub to receive new processing request, processing and promote the build log, to making it to synchronize the user information, it notifies the user build was successful.It deal with a large group of external API, all of them are in the process of a processing.
The Hub needs to evolve, but it is unlikely to free extension.Hub can only be run in a single process, and therefore as a single point of our most likely to happen.
Making the API is an interesting example.We are making the API heavy users, rely on these apis we build tasks to perform.Whether getting build configuration information, update the build status, or synchronous user data, cannot leave the apis.
Historically, when one of these apis is not available, the hub will stop the day is processing tasks, and move on to the next task.So, when making the API is not available, a lot of our build will fail.
We trust to these API gives a lot of, now, too, of course, but in the end, these resources is that we cannot control.These resources are not our own to maintain, but by another team, in other network systems, have their own weaknesses.
In the past, we didn't think so.In the past, we always put these resources as a trustworthy friend of us, believe that they will respond to our request at any time.
We were wrong.
A year ago, these apis have modified quietly for a function.This one although there is no documentation, but we are very dependent on function.This feature is so silent elimination were modified, and lead to the problem on our side.
As a result, our system completely confused.The reason is very simple, we put the dead simple API as his friend, we wait patiently for the API in response to our request.Each new submission, we are waiting for a long time, have a few minutes every time.
Our timeout is too loose.For this reason, when finally to making the API request timeout, our system has an error occurs.That night, we took a long time to deal with the problem.
Even a small problem, when some point to a piece of, also can destroy a system.
We began to isolate the API request, set the shorter timeout.In order to ensure that we will not lead to build fails because of lot of interruption, we also added a retry mechanism.In order to ensure that we can better handle external abnormalities, we each retry will extend the expiration time in turn.
You should receive those in your control external API subject to the reality of failure.If you can't put these failure completely isolated, it is necessary to consider if you go to deal with them.
How to deal with every single point error scenarios are based on commercial considerations.We can afford a build out abnormal?Of course, it's not the end of the world.If because of the external system, we can make hundreds of building an exception?We can't, because whatever the reason, the abnormal build enough influence to our customers.
Travis CI was originally a kind of guy.It is always very optimistic that everything will work correctly.
Unfortunately, that's not true.Every thing at any time may lead to chaos, but our code but never thought about this.We have done a lot of effort, and now we are still in the effort, to change this situation, to improve our code internal or external API system exception handling ability.
Back to our system, the hub, undertake the task of easily lead to abnormal, so we will be divided into many small applications.Each application has its own purpose and undertake the task of.
Completes the task, so that we can more easily extended system.Most tasks are directly running from top to bottom.
Now we have three processes;To build a new submit, notice, and deal with the build log.
All of a sudden, we have a new problem.
Although our application has been separated, but they rely on a called Travis - the core of the core.Core including lot number of Travis CI of all parts of the business logic.This is a real onebig ball of mud。
Dependence on the core means the core code changes may affect all applications.Our application is divided according to their job, but not our code.
We are still in for the first architectural design and pay for it.If you increase the function, or modify the code, for public share a little bit of change is likely to be a problem.
In order to ensure that all application code can work normally, when Travis - have modified the core, we need to deploy all applications to verify.
Task does not only mean you have to it from the perspective of code space.Also need to physically separate the task itself.
Complex dependence affected the deployment, likewise, it also affected your ability to deliver the new code, new function.
We will slowly code depends on the smaller, the real is isolated from the task of each application from the code.Fortunately, the code itself has very good isolation degree, so the process is much easier.
Have an application requires special attention, because it is the biggest challenge we do extension.
Log function has two: when the build log data block in through the message queue, update the database corresponding to the line, and then push it to the Pusher for real-time user interface updates.
Log block in the form of flow at different process at the same time come in, and then be a process to deal with.This process up process of 100 messages per second.
Normally such processing log flow way is also OK, but it also means that it is hard to deal with some moment all of a sudden growth of log messages, so that only the expansion of the process for our system will become a big obstacle.
The problem is that the process is in accordance with the order of the message to the message queue for processing, and Travis all things rely on these news in CI.
Update the database of a log flow means that updates a row containing all of the log data.Update user interface log means, of course, add a new node in the DOM tree.
In order to solve this difficult problem, we need to change a lot of code.
But first, we need to know what is a better solution, good solution should be able to let us easily extended log processing part.
We decided to deal with the order as an attribute of the message itself, rather than implicit dependent on their order was placed in the message queue.
The idea is by Leslie Lamport a paper published in 1978《Time, Clocks, and the
Ordering of Events in a Distributed System》The inspiration.
In this paper, Lamport described in a distributed system, using incrementing counter to keep the order of the events.When a message is sent, the sender will increase before the message is the receiver receives the value of the counter.
We can simplify the idea, because in our scenario a log block can only come from a sender.Increasing process as long as the value of the counter, can let after log collection work easier.
The rest of the work is on the log blocks according to the value of the counter arrangement.
Difficulty is that this design is equivalent to allow to write the piece of log database, these small log piece only at the end of the corresponding task will be written to the full in the log.
But this will directly affect the user interface.We have to face in disorderly way of news is coming.This change affect the scope of large, but it is, in turn, simplify the many parts of the code.
On the surface, this seems to be irrelevant.But depends on the order that you should not need to rely on can bring more potential complexity.
Now we don't have to depend on how information is transmitted, because now we can get their order at any time.
We have modified the code, because the code makes a hypothesis, any information is order, and this assumption is completely wrong.In a distributed system, events can be in any order at any time.We only need to make sure that we can put these pieces back together again.
You can get this question from our blog more detailed instructions.
In 2013, we have been running build 45000 times a day.We are still in the price for the design of the earlier, but we are slowly improving design.
We have a problem now.All system components or share the same database.If the database problems, nature will be the problems all the components.The failure last week we just met once.
It also means the number of log write (now can reach 300 times per second) the performance of the affected our API, when users browse our user interface may be a little more slowly.
In addition, when we when considered from the number of build tasks, our next challenge is how to expand our data capacity.
Travis CI in 500 to build running on the server, it can no longer be a small distributed system.We are now begin to address the problem was from a relatively small dimensions to consider, but even in that dimension, you can also meet lots of interesting challenges.According to our experience, simple and direct solution is always better than the more complex.