Introduction#
When discussing the internet development of enterprises, especially for those whose main business is internet-related, we often frequently mention the term "high availability" during the development process.
High Availability (English: high availability, abbreviated as HA), an IT term, refers to the ability of a system to perform its functions without interruption, representing the degree of system availability. It is one of the guidelines for system design. High availability systems can run longer compared to the various components that make up the system.
High availability is usually achieved by improving the system's fault tolerance. Defining what constitutes high availability for a system often requires specific analysis based on each case's circumstances.
Quoted from Wikipedia
The main purpose of high availability is to ensure "business continuity," meaning the business always provides services normally in the eyes of users. High availability is born only from good architectural solutions and high costs; therefore, for an architect, learning to design high availability architecture is essential.
We usually break down a large system into multiple layers, that is, a layered approach, dividing it into independent "layers" such as application layer, middleware, data storage layer, etc., with each layer further divided into detailed components. Each component then provides services externally, ensuring that they do not exist in isolation, thus giving meaning to the services.
The high availability of architecture naturally requires that all components in the architecture and the services they expose externally adhere to "high availability" design. Any component that does not achieve "high availability" means that the system may be at risk of downtime.
To achieve "high availability," any component relies on two core elements: "redundancy" and "automatic failover." Single-point services are the arch-nemesis of high availability, so a qualified high availability component should be clustered (at least existing on two machines), ensuring that services do not become unavailable due to the failure of a single node, with other nodes immediately stepping in.
Assuming the availability of any node is 90%, then a cluster of two machines would be $ 1 - (100\% - 90\%)^2 = 99\% $. Thus, to improve availability, increasing the number of redundant machines is a simple and practical choice.
Having "redundancy" alone is not sufficient. The switching process after a machine failure is still time-consuming and labor-intensive, and prone to errors, so we also need to leverage tools to achieve "fully automated failover," thus achieving near real-time failover, which is the main significance of high availability.
In the industry, the availability of a system is often measured by the number of "nines":
Availability Level | System Availability | Annual Downtime | Monthly Downtime | Weekly Downtime | Daily Downtime |
---|---|---|---|---|---|
Unavailable | 90% | 876 hours | 73 hours | 16.8 hours | 144 minutes |
Basic Availability | 99% | 87.6 hours | 7.3 hours | 1.68 hours | 14.4 minutes |
Higher Availability | 99.9% | 8.76 hours | 43.8 minutes | 10.1 minutes | 1.44 minutes |
High Availability | 99.99% | 52.56 minutes | 4.38 minutes | 1.01 minutes | 8.64 seconds |
Achieving 99% availability is already quite simple, especially in the era of cloud computing, and such downtime can severely impact business; for such companies, it may be a matter of life and death.
The main business of large enterprises usually requires five nines or more, as a temporary system failure can affect tens of thousands of people. (Of course, while requirements are set, it is also normal if they cannot be achieved, as seen recently when several major cloud providers experienced outages...)
Microservices Architecture#
Currently, most domestic internet companies adopt microservices architecture:
Microservices Architecture
The architecture can be seen to be divided into the following layers:
- Access Layer: Mainly supported by F5 hardware or LVS software to handle all traffic entry points, usually accompanied by large-scale DDoS traffic cleaning machines.
- Load Balancing Layer: Mostly Nginx, primarily responsible for distributing traffic and implementing rate limiting, etc.
- Gateway Layer: Mainly responsible for flow control, risk control, protocol conversion, etc.
- Site Layer: Mainly responsible for assembling Json data when calling basic services and returning it to the client.
- Basic Service Layer: Actually belongs to microservices along with the site layer, being at the same level; however, the basic service is part of the infrastructure and can be called by various upper business layers.
- Storage Layer: This refers to databases such as MySQL, Oracle, Postgres, etc., generally called by services and returned to the site layer.
- Middleware: Redis, MQ, etc., mainly serve to accelerate data access, and we will briefly introduce the roles of each component below.
Access Layer & Load Balancing Layer#
The high availability of these two layers is closely related to Keep Alive, so we can look at them together.
Access Layer & Load Balancing Layer
Two LVS provide services externally in a master-backup form, where only the master works, and the backup will take over in real-time after the master fails. Keep Alived software is used on the master-backup machines, allowing the backup to monitor the master’s operational status in real-time.
After the master fails, the elastic IP will quickly transfer to the backup, which is commonly referred to as "IP drift," thus solving the high availability of LVS.
Keep Alived's HeartBeat detection usually checks via ICMP or TCP port scanning, and it can also be used to check the Nginx port, allowing for timely removal of abnormal Nginx instances.
Microservices Layer#
Under the load balancing, the "Gateway Layer," "Site Layer," and "Basic Service Layer" together constitute the most critical components of the microservices architecture. Of course, these components also need to communicate through RPC frameworks such as Dubbo or gRPC.
Therefore, for microservices to achieve high availability, it means that these RPC frameworks must also provide the capability to support high availability for microservices. We will take Dubbo as an example to learn how it achieves high availability:
Dubbo Architecture
The general idea is as follows:
- Provider registers services with the Registry.
- Consumer subscribes to and pulls the Provider service list from the Registry.
- Consumer selects a Provider based on its load balancing strategy and sends a request.
- When a Provider is unavailable, it will be detected by the Registry and pushed to the Consumer, and then removed.
This achieves automatic failover.
It is not difficult to see that the Registry plays a role similar to Keep Alived in this process.
Middleware#
For middleware services like Redis and ZooKeeper, achieving high availability is also quite important.
ZooKeeper#
Taking ZooKeeper as an example:
ZooKeeper Architecture
From the diagram, we can see the main roles of ZooKeeper:
- Leader (there can only be one Leader in a cluster)
- The unique scheduler and processor of transaction requests: ensures the sequential processing of cluster transactions. Executes all write requests from Followers to ensure transaction consistency.
- The scheduler for servers within the cluster: after processing transaction requests, broadcasts data to each Follower and counts the number of successful writes. If more than half succeed, the Leader will consider the write request successful and notify all Followers to commit this write operation. This ensures that if the cluster crashes and recovers or restarts, this write operation will not be lost.
- Follower
- Handles non-transaction requests from clients and forwards transaction requests to the Leader server.
- Participates in voting for transaction request proposals.
- Participates in voting for Leader elections.
The main issue here is that there is only one Leader, which poses a single point of failure risk. To address this, ZooKeeper maintains a connection between the Leader and Followers using a HeartBeat mechanism, allowing Followers to vote to elect a replacement Leader when the Leader has issues (ZooKeeper Atomic Broadcast, a consistency protocol designed specifically for ZooKeeper that supports crash recovery).
In addition to ZooKeeper, there are also protocol algorithms like Paxos and Raft that can be used for Leader elections.
Redis#
Redis has two deployment modes: "Master-Slave Mode" and "Cluster Sharding Mode." The high availability of Redis needs to be determined based on its deployment mode.
Master-Slave Mode#
Redis Master-Slave Architecture
Master-Slave Mode refers to one master and multiple slaves (one or more slave nodes). The master node is primarily responsible for read and write operations, then synchronizes data to multiple slave nodes. Clients can also send read requests to multiple slave nodes to alleviate the pressure on the master node.
However, like ZooKeeper, since there is only one master node, there is a single point of failure risk, so a third-party arbitrator mechanism must be introduced to determine whether the master node is down and quickly elect a slave node to take over as the master node after determining that the master node is down.
This third-party arbitrator in Redis is generally referred to as "Sentinel." Of course, the Sentinel process itself may also fail, so for safety, multiple Sentinels need to be deployed (i.e., a Sentinel cluster).
Redis Master-Slave Mode + Sentinel Cluster Architecture
The Sentinel cluster uses the Gossip protocol to receive information about whether the master server is offline and uses the Raft protocol to elect a new master node after determining that the master node is down.
Cluster Sharding Mode#
While the Master-Slave Mode seems perfect, it has the following issues:
- The write pressure on the master node is difficult to reduce. Since only one master node can receive write requests, if there are many write requests in a high-concurrency situation, it may saturate the master node's network card, causing the master node to be unable to serve externally.
- The storage capacity of the master node is limited by the single-machine storage capacity. Since both the master and slave nodes store full cache data, as the business volume grows, the cached data may rise sharply until it reaches the storage bottleneck.
- Synchronization storms: Since data is synchronized from Master to Slave, if there are multiple slave nodes, the pressure on the Master node will be significant.
To solve these issues, the Cluster Sharding Mode was born.
Data is sharded, with each shard being managed by a corresponding master node for read and write operations, thus distributing the write pressure across multiple master nodes, and each node only stores part of the data, solving the single-machine storage bottleneck issue.
However, it is important to note that each master node still has a single point issue, so high availability must be ensured for each master node.
Redis Cluster Sharding Architecture
When the Proxy receives read and write commands executed by the Client, it first calculates a value based on the Key. If this value falls within the range of values managed by the corresponding Master (generally, each number is referred to as a slot, with Redis having a total of 16,384 slots), it will send this Redis command to the corresponding Master for execution.
It can be seen that each Master node is only responsible for handling a portion of the Redis data, and to avoid the single point issue of each Master, multiple slave nodes are also equipped to form a cluster. When the master node fails, the cluster will elect a master node from the slave nodes using the Raft algorithm.
Of course, the costs are also explosive, to the point that I have yet to operate a Cluster Sharding deployment.
MQ (Kafka)#
MQ, or Message Queue.
The high availability of MQ is usually also achieved by utilizing data sharding to enhance high availability and horizontal scalability.
Kafka High Availability Cluster Architecture
It can be seen that each Topic's Partition is distributed across other message servers, so if a Partition becomes unavailable, a leader can be elected from the followers to continue providing services.
However, unlike ES and Redis Cluster, the Follower Partition is considered cold backup, meaning that under normal circumstances, it does not provide external services. Only after a Leader fails and a Leader is elected from the Follower can it provide external services.
Storage Layer#
The storage layer is usually a database. A database is also a fundamental condition for a system to run persistently. Here, we will briefly discuss the high availability design of MySQL as an example.
Referring to the high availability designs above, you will find that the high availability concept of MySQL is similar to those of the other architectures mentioned; it also consists of master-slave and sharding (commonly referred to as database and table partitioning) architectures.
MySQL Master-Slave Architecture#
The master-slave architecture is similar to LVS and generally uses Keep Alived to achieve high availability:
MySQL Master-Slave Architecture
If the Master fails, Keepalived will also detect it in a timely manner, so the slave will upgrade to the master, and the elastic IP will also "drift" to the original slave to take effect. Therefore, the MySQL addresses configured in engineering are generally set to use elastic IPs or private DNS resolution domains to ensure high availability.
MySQL Sharding Architecture#
When the data volume reaches a certain level, database and table partitioning becomes a necessary operation. Similar to Redis's sharding cluster, multiple slaves need to be equipped for each master.
MySQL Sharding Cluster Architecture
Conclusion#
Observing the components in the architectures above, you will find that high availability architectures become more intricate and complex as data volume increases, transitioning from a one-master-multiple-slave cluster to a multi-master-slave cluster. However, synchronizing data between them remains a significant challenge. Therefore, many components still adopt a one-master approach, with synchronization occurring between the master and multiple slaves.
However, it is important to note that even after ensuring high availability for each component, the entire architecture may not truly achieve complete high availability. In production, many unexpected situations can challenge our systems, such as sudden traffic spikes (e.g., during flash sales), malicious attacks (DDoS, etc.), code memory leaks (leading to program unresponsiveness), errors during deployment to production (as seen with Facebook's outage), power outages in data centers (which can rely on disaster recovery in different locations), etc.
Therefore, while ensuring high availability in architecture, we also need to implement measures such as system isolation, rate limiting, circuit breaking, risk control, degradation, and restricting operational permissions for critical operations to ensure system availability.