Algolia distributed search network architecture
Algolia is a company to do offline mobile search engine, two years to build the distributed network worldwide.Today 12 areas user query, 2 billion a month for the world server time of 6.7 ms, an average of 90% of the query response < 15 ms, not powerusageeffectiveness less than ten minus six power, and monthly downtime < 3 s...
This article is Algolia REST API to its establishment and extend the experience summary, including how to different locations around the world to ensure high availability of data and consistent, and how to through the Anycast DNS query routing to the user location nearest server.What are the unique its architecture, this paper makes a detailed analysis.
The following is a translation
Algolia, founded in 2012, its business is to provide an offline search engine for mobile devices to the SDK.Julien said, when in the company to create, they never thought they could build a to use distributed search network in the world.
Now, Algolia monthly need support from 12 regional users from all over the world more than 2 billion times.On such a scale, Algolia still server response time can be controlled in 6.7 millisecond, and returns the result for the user in 15 milliseconds.Algolia service unavailable ratio less than ten minus six power, namely service downtime per month be controlled within 3 seconds.
Based on the nature of the movement, the offline mobile SDK technical limitations faced by be amplified.At the same time, because not be able to use some traditional server-side design concept, Algolia must develop a unique strategy to deal with these challenges.
Data volume of misunderstanding
Before the architecture design, we must pinpoint the usage scenarios of our business.When considering the expansion of business needs, is especially important to do so.We must fully understand the user needs to index data volume, GB, TB or PB.Depends on the need to support the use cases and architecture will be completely different.
In reference to search, what people thinks of above all is some very big cases, such as Google's web page index, or Facebook index based on trillions of post.Calm down, and think about every day can see all kinds of search box, you will find that, most of them are not based on a large scale of large data sets.For example, Netflix search based on about 10000 titles, while Amazon the database contains 200 million or so.Here, you will find that, in view of the above cases, only need to use one server can support all data!Here, of course, is not to say that the data is stored on a host is a good idea, but must consider is that across the host synchronization will cause a lot of complexity and performance cost.
High availability make way
In the building of a SaaS API, high availability is a need to focus on the field, for removing the single point of failure (SPOF) is indeed very difficult.After weeks of brainstorming, we finally design out our own best architecture, a user-oriented search architecture.
Master-slave vs. the Lord
May wish to temporarily reduce the usage scenario and will use cases as "index" on the stored in only one host, then the availability to build will be reduced to "put the servers in different data center".With this set, we can think of the first solution is to use the master-slave architecture, the primary server is responsible for receiving all index operation, then in one or more from the server backup.By this way, we can do it very easily on all server load balance.
Then, the problem appeared, this architecture design only guarantee the high availability of a search query.For a service company, the transmission of all index operation to the primary server this architecture hides a very big risk.Because once the master server downtime, all client will appear error index.
Therefore, we must implement a main main architecture, and the main key elements of the design of main structure is how to result in a set of server synchronization.In this way, we need to do is the consistency of the system under any circumstances, even in the network partition between the host.
Introduction of distributed uniformity
For a search engine, to achieve distributed coherence can write operation must be orderly string into a unique operation flow.If appear at the same time a number of operations into the situation, the system must be assigned for each operation sequence ID ID (serialization).Through these ids, the system can guarantee all the backup in the correct sequence of operations.
And want to get a sequence ID (inflows will increase each assignment 1), we must under the between hosts on a sequence ID have a Shared global state.Here, open source software ZooKeeper is a common choice, also through the following steps when we start using the ZooKeeper:
Step 1: when a host receives a job, it will use a temporary name copies assignment to a copy of all. Step 2: host access to distributed lock.
Step 3: read the latest from the ZooKeeper on all host sequence ID, and sends a command to copy the temporary file as "sequence ID + 1" operation.This step is equivalent to a two-phase commit. Step 4: if you determine the information received from most of the (legal) host, namely the Zookeeper will sequence ID + 1.
Step 5: distributed lock release.
Step 6: in the end, the client will receive the results sent homework.In most cases, can get ideal result.
Unfortunately, the serialization and cannot be applied to the production environment, if access to distributed lock host collapse, or restart in step 3, 4, we are likely to face such a situation: submit successful operation on the part of the host.In this way, we need a more complex series solution.
Through a TCP connection to the way they are packaged into an external service virtually raised the ZooKeeper barriers, at the same time also need to use a very large timeout (the default setting is 4 seconds).
Therefore, any fault occurs, whether because of hardware or software, throughout the timeout setting time, the system will be frozen.Seems to be acceptable, but the Algolia scenarios, we need a high frequency of production fault test (similar to Netflix Monkey test method).
The Raft consistency algorithm
Fortunately, when we encounter these problems, issued a Raft consistency algorithm.It is obvious that this algorithm is very suitable for our use case.RAFT of state machine is our index, and the log is to perform index assignment list.On the PAXOS protocol and its variants I already have a certain understanding, but there is no deep to have enough confidence to achieve personally, and RAFT is more clear.Although the RAFT was no stable open source implementations, but it is clear that it can perfectly match our needs, and I have enough confidence to our architecture design based on the it
For consistency algorithm implementation, the most difficult part is to ensure that there are no bugs in the system.In order to ensure this, I chose the monkey method for testing, before restarting to random kill a process with sleep.In order to further test, I even through a firewall to simulate network interruption and degradation (degradation).This type of test to help us find a lot of bugs, and after the trouble-free operation for days, I am very confirm this implementation: no problem.
The application or file system level copy?
Instead of copying the final result in the file system, we have decided to write assigned to all hosts on the local implementation.To do this choice is based on the following two reasons:
; To do so quickly.Indexes on all hosts in parallel, obviously faster than the replication may be the
result of the large volume (binary).
; Compatible with multiple regional strategy.If after the index replication, we all need a process to
rewrite index.This means we may need very large data, and mass data transmission at different
geographical location to do global obviously is not efficient, and such as from London to
Each host will use a correct order receive all writing assignments, and the independent processing immediately.That is to say, all the machines will eventually in the same state, may be different but the same time.
Consistency of compromise
In a distributed computing environment, the CAP theorem says distributed system cannot simultaneously meet the following three features:
; Consistency, Consistency: the data on the same time all nodes are the same.
; The Availability (Availability) : ensure that each request will receive its success or failure of the
; How fault tolerance (Partition) : any information is missing, or any part of the system fails, the
system can continue running.
Here, we have made concessions on consistency.We do not guarantee the same time all the data on the node is the same, but they must be updated in the end.In other words, we allow small nodes in the scene are not synchronized.In fact this is not a problem, because when a user performs a write operation, we will perform this operation on all hosts.In the update in time, the first to update the last update and host between hosts can't more than 1 second, so usually end users feel at all.Is not the only consistent is probably have received the latest update is performed, however, this does not conflict with our cases.
The overall architecture
The definition of cluster
Distributed consistency between the host is a prerequisite for high availability infrastructure building, but unfortunately, this is one of the bottleneck of the performance of the system.Consistency guarantees require multiple interactions between the host, so can achieve the consistency of the guarantee amount per second between the host and works is the existence of time delay.That is to say, the host must be as close as possible to get higher amount of consistency guarantee per second.Thus, in order to support a number of different geographical location, at the same time also will not reduce the performance of write operation, we need to set up multiple clusters, each cluster has three hosts to act as the backup machine.
For consistency, each region has at least one cluster, but this is clearly not as well:
; We don't all user requests can be tucked into the same host.
; The more the number of users, each user can perform a second write operation is less, this is
because the consensus per operation number is fixed.
In order to solve this problem, use the same concept we decided at the district level, each region has multiple clusters composed of 3 host.Each cluster can handle more than one customer, the number is decided by customer's data volume.This concept is similar to the physical machine virtualization, we can put multiple clients in the same cluster, in addition to a user appear under dynamic growth or change its usage.In order to achieve this goal, we need to increase or automation the following operations:
; When a cluster data or write too many, will be one of the customer to another cluster.
; If the volume of a query is too large, add a new host for the cluster.
; If the customer the amount of data is too big, change the number of partitions, or to be split across
After using the above strategy, a customer may not always be allocated to a cluster.Allocation depends on the individual usage and usage of cluster.In this way, we need a plan to distribute the client to the specified cluster.
A client will be assigned to a cluster
Typically, distribution of the cluster method for the customer is the only configuration for each customer a DNS entry, similar to Amazon Cloudfront ways of working, each customer through customerID.cloudfront.net form a unique DNS entry (DNS entry), then according to the customer are assigned to different set of hosts.
We also decided to use this method.Each customer is assigned a unique application ID, the corresponding APPID. Algolia. IO DNS records in the table.DNS record will specify a particular cluster, because all hosts in the cluster belong to a part of the DNS records, so there is a through DNS load balancing.We also use health check mechanism to detect the host fault, once found the fault machine will be removed from the DNS resolution.
Health check mechanism and alone cannot provide a good SLA, even in the DNS records with a low TTL (TTL is the customer allowed to cache of DNS answer time).Here is a failure occurs on the main problems, the user may still cache this host.In before the expiration of the cache, the user will still be constantly send queries to this host.In many cases, the system may not comply with the TTL Settings.In practice, we see 1 minute TTL some DNS server may be modified into 30 minutes of TTL.
In order to further improve the availability, and avoid the host fault impact on users, we have for each customer to generate another set of DNS records, APPID - 1. Algolia. IO, APPID - 2. Algolia. IO and APPID - 3. Algolia. IO.This is done to when the TCP connection timeout, API client can try other DNS record again.Our implementation is to shuffle the DNS records, and then try again according to the order.
Compared the use of a load balancer, strictly control the retry with apis in the client timeout logic, system has had a more robust and cost allocation mechanism smaller customers.
Then, we found that popular. IO TLD performance is not good in terms of
performance.Contrast. IO, the anycast network scenario, the NET can have more of the DNS server.In order to solve the. IO because a lot of overtime to DNS slower, we switch to the algolia.net domain name, and backward compatibility algolia. IO domain name.
Scalability of cluster?
Potential impact in not under the condition of existing customers, because the isolation between the clusters, cluster allows more services to support more customers.But ChanJiQun scalability problems still need to consider.
Based on writing the consistency of the safeguard, number of write operations per second as the primary limiting factor cluster scalability.In order to remove this restriction, on the basis of the guarantee normal consistency confirmation, we will add a lot of methods in the API is a set of operation is compressed into a write.But there still exist problems, some customers still don't use batch way of write operation, thereby affect the other users in a cluster indexing speed.
To reduce the performance degradation in this case, we make the following two changes to architecture:
; Add a batch strategy, contention in consistency confirmation, will be on the premise of consistency
confirmation and automatically each customer write operation integrated into one.In practice, to do
so means that the reset operation order, but will not affect the semantics of the operation.For
example, if there are 1000 jobs in the competition for consistency confirmation resources, of which
990 are from the same customer, we will take 990 writes that merged into one, even in the middle
of the sequence on the 990 jobs may be with some other user's homework.
; Based on application ID, add a consistency scheduler (consensus scheduler) to control the number
needs to be done to confirm consistency of write operations per second, so that you can avoid a
customer all bandwidth.
Before we realize the promotion, we by returning a 429 HTTP status code to control the speed limit.But soon proved this treatment will greatly affect the user experience, the response of the customer has to wait for it, then retry.At present, our biggest client in a 3 on the cluster host write operations performed 1 billion times a day, on average 11500 times per second, the highest peak of up to 150000 per second.
The second problem is to choose the most appropriate hardware setup, so as to avoid potential bottlenecks, such as CPU/IO to avoid impact on cluster scalability.Since the beginning, we chose to use their own physical server, which can fully control the performance of the services, and to avoid the waste of resources.And for a long time, we constantly hit a wall in the process of selecting the appropriate hardware.
At the end of 2012, we start with a lower configuration: Intel Xeon E3 1245 v2, 2 x Intel SSD 320 series 120 gb in raid 0 and 32 gb of RAM.This configuration is price is very
reasonable, but also more powerful than a cloud platform, at the same time allow US to provide service in Europe and the US - East.
This configuration allows us to I/O operation to adjust the kernel and virtual memory, that is essential for the optimal use of hardware resources.Even so, we soon discovered that the service is limited by memory and I/O.At that time, we use do index 10 gb of memory, so only 20 gb of memory to the cache file is used to search queries.In view of the millisecond response time of services index, customer index must be placed in memory, and 20 gb capacity is too small.
After the first configuration, we try to use different hardware host, such as single/double CPU, 128 gb and 256 gb of memory, and different size and type of SSD.
After several attempts, we finally find the best Settings: Intel Xeon E5 1650 v2, 128 gb of memory and 2 x400gb Intel S3700 SSD.On the persistence, the SSD model is very important.Before find the right model, used for many years we damaged a lot of SSD.
In the end, we establish good architecture allows us in any area of expansion, as long as meet a condition: the need to have resources available at any time.Maybe you will feel very strange, in 2015, the moment we are also considering to maintain physical server, but if you focus on service quality and price you will see all this is worth it.Contrast using AWS, we may search engine on three different geographical position backup, completely in memory, in order to gain a better performance.
The number of control process
Each host contains only three process.The first is explained all the query code embedded in a module nginx server.In response to a query, we map the index files in the memory, and within the nginx workers process execute queries directly, thus to avoid any process or host communication.The only exceptions rare is customer data cannot be saved in the same host.
The second process is redis key-value stores, we use it to check the speed and restrictions, and use it to log into each application ID stored real-time and counter.These counters are used to establish our real-time dashboard, when users connect to account can be viewed, doing the latest API calls on the visualization and the debug is very helpful.
The last process is the generator (builder).This process is responsible for handling all write operations.When nginx process receives a write operation, it will operate are forwarded to the generator to carry out consistency check.At the same time, it is also responsible for indexing, and contains a large number of monitoring code used to check errors services, such as collapse, index, slow index error, etc.Based on the severity of the problem, some will pass the Twilio apis to SMS to inform, while others reported to PagerDuty directly.When an error
is discovered in a production environment, and the error did not get the corresponding report, then we will replace after the record for the processing of the wrong type.
Easy to deploy
Simple stack can be very easy to deploy.Before the code is deployed, we conducted a large number of unit testing and the Regression Test (Non - Regression Test).After all test through, we will be gradually deployed to the cluster.
For service providers, our test should be zero impact production environment, and transparent to the user.At the same time, we also expect to create a host in consistency to confirm the fault scenarios, and check whether all things such as expected.In order to achieve these two goals, we developed independently each host in the cluster, and follow the steps below:
1. Get new nginx and builder of binary files
2. Restart your nginx web server and in the case of zero user query lost to release new nginx.
3. Shut down and release new builder.This will trigger a RAFT on the host's deployment failure, can
let us make sure whether failover as expected.
In the process of architecture development, reduce the complexity of system management is also a persistent goal.We don't expect the deployment architecture constraints.
Achieving global coverage
Service is becoming more and more globalization, a region on earth to support all areas of the query is obviously unrealistic.For example, if the service hosted in the US - East host will certainly affect in other parts of the user's availability.In this case, the US - East user delay may be only a few milliseconds, while the time delay of Asian clients could reach several hundred milliseconds, this has not calculated bandwidth limitations brought by the optical fiber saturated overseas.
On this problem, we see a lot of companies have carried a CDN for search engines.For us to compare the benefits of doing so will cause more problems: to improve as a small part of the query are frequently submitted at the same time, brought a invalid cache this nightmare.For us, the most practical way is in a different location for backup, and loaded into memory in order to improve the query efficiency.
We need here is a copy in the area of the existing cluster on the backup.Copy can be stored on a host, because the copy will only be responsible for the search query.All write operations will still transfer to the customer's original cluster.
Every customer can choose the data backup hosting data center, so the backup machine in an area can receive data from multiple clusters, and sends the data to multiple backup cluster.
Based on the operation flow, the mechanism is also used for consistency.In after confirmation of the consistency, each cluster is responsible for converting their own version of a write operation flow to a, thus each backup machine can be used to replace any empty job has nothing to do with the copy of homework.Later, this operation flows are sent to all copies do batch operation, to avoid delay as possible.Too much interaction between a single send job would cause a backup machine confirm the operation.
On the cluster, the write operation will be kept in the host, until it is confirmed all the backup machine.
DNS the last part of the process is to redirect the user to the nearest location, in order to ensure this, we added another DNS records to the APPID-dsn.algolia.net form to handle the problem of data center recently.First we use the Route53 DNS, but soon encountered the limit.
Routing mechanism based on time delay is limited by the AWS regions, because we have a lot of AWS uncovered the geographical position, such as India, Hong Kong, Canada and Russia.
Location-based routing is bad, what is your need for each of the countries points out that the DNS.Many hosting DNS providers are using the traditional way, but in our case it is difficult to support this, cannot provide enough correlation.For example, we can have multiple data centers in the United States.
After doing a lot of benchmarking and discussion, we use NSOne based on the following reasons:
For us, their very suitable for Anycast network, load balance and do better.For example, they both in India and Africa with a POP.
Their filtering logic is very good.We can be specified for each user associated with the host (including the backup machine), and through the geographic filters will classify them according to the distance.
They support EDNS customer terminal network.At the same time, we use the end-user IP rather than their DNS server IP.
In terms of performance, we implemented worldwide, the second level of
synchronization, you can be in the Product Hunt 's search (managed in the US - East, the US - West, India, Australia and Europe) or.Hacker News' search (hosted in the US - East, the US - West, India and Europe) for testing