Content, State, Communication

Communication and Scaling

Learning Objectives

You know how to horizontally scale tracking clients for communication.
You know of Command Query Responsibility Segregation (CQRS) and Event Sourcing, and their benefits in applications.
You know what backpressure means and know of strategies to mitigate it.

In the chapter on State Management, we emphasized the need to keep the server stateless. However, in the examples that used server-sent events and WebSockets, the server had to keep track of the clients connected to the server.

This is an inherent limitation in client-server communication: to be able to send messages to clients, the server needs to keep track of the connections. This means that the server is not stateless, and additional measures are needed to scale.

Horizontal scaling and state

To horizontally scale a system that tracks clients, communication between servers is needed. In a scaled version, clients connect to servers via a load balancers, while the servers are connected to a message queue or a stream. When one server receives a message, it publishes the message to the queue, and all connected servers retrieve and forward the message to their clients.

Figure 1 illustrates this high-level architecture:

Figure 1 — Scaling a system where clients are connected to an individual server requires communication between the servers.

In advanced deployments, the message queue or stream is often distributed across multiple servers. This distribution increases throughput and redundancy, reducing the risk of a single point of failure. Distributed message queues can handle larger volumes of messages and maintain availability even when individual servers fail.

Further strategies to improve scaling and reliability include:

Consumer groups: Rather than having every server process every message, consumer groups allow messages to be distributed among a set of servers based on criteria such as topic or channel. Each server in a consumer group processes only a subset of messages, which balances the workload and minimizes redundant processing.
Retry strategies: Failures will happen. Implementing retry mechanisms — either on the server or within the message queue — helps ensure that message deliveries are retried until they succeed. Techniques like exponential backoff, which gradually increase wait times between retries, help avoid overwhelming the system.
Fault tolerance: In the event of a server failure, the message queue should be capable of rerouting messages to an alternative server. Redundant servers can monitor each other and automatically take over when one fails, thereby minimizing system downtime.

Loading Exercise...

CQRS and Event Sourcing

Distributed messaging systems can use patterns like Command Query Responsibility Segregation (CQRS) and Event Sourcing. CQRS means separating the responsibilities of handling commands (state-changing operations) and queries (data retrieval). This separation allows each side to be optimized and scaled independently. Event sourcing, on the other hand, means recording every state change as an immutable event. Instead of storing only the latest state, the system maintains a full history of events that have occurred.

These patterns can streamline the way state changes are managed and communicated across multiple servers.

Loading Exercise...

Decoupling reads and writes

When the responsibilities for handling commands and queries are separated, each side can be optimized and scaled independently. For example, while one set of servers handles incoming commands and processes events, another set can be dedicated solely to serving read requests. This division can help simplify the internal architecture, and can help manage the load more effectively by isolating write-heavy operations from read-heavy ones.

Command query responsibility segregation (CQRS) is very similar to primary-secondary replication, which is used when scaling databases.

Immutable state with event sourcing

Event sourcing complements CQRS by recording every state change as an immutable event. Instead of storing only the latest state, event sourcing maintains a full history of events that have occurred.

Whenever an event happens, it is appended to an event log, and sent to a message queue for distribution to other servers. Each server processes the event and updates its local state accordingly. This is illustrated in Figure 2 below.

Figure 2 — event sourcing flow.

The benefits of event sourcing include:

Auditability and debugging: As all changes are logged as events, it becomes easier to trace the evolution of the system state, which helps in debugging and auditing. As an example, if there is a crash or an erronous state, the events leading to the state can be replayed to find the cause.
Eventual consistency across distributed components: When a change happens, the event is published to all servers through a message queue. This ensures that servers in a horizontally scaled environment will eventually reach the same state, despite processing commands asynchronously.
Event replay and recovery: In the event of a failure, the system can rebuild its state by replaying the events from the log. Similarly, if a new server is added to the cluster, it can catch up by replaying the events from the beginning.

Loading Exercise...

Handling peaks and backpressure

Systems that manage a large number of client connections must be prepared to handle peaks in traffic. Peaks in client requests can lead to a state called backpressure, where the rate of incoming messages is greater than the system’s capacity to process them. Managing backpressure is needed to maintain responsiveness and preventing resource exhaustion.

There are multiple strategies to manage peaks and implement effective backpressure handling:

Rate limiting: The system may use rate limiting to limit the number of incoming requests to a certain acceptable threshold to prevent system overload.
Circuit breakers: Circuit breaker patterns are a defensive measure to isolate failing components. When a service or server begins to fail, the circuit breaker can temporarily block further requests, preventing the failure from cascading throughout the system. Once the issue is resolved, the circuit breaker resets, and normal operations resume.
Buffering and queuing: Introducing buffers or intermediate queues can smooth out traffic surges. In this case, messages are temporarily stored and processed at a controlled rate, ensuring that no single part of the system is overwhelmed. Using buffers or queues also allows monitoring traffic patterns and reacting if needed.

Load shedding and circuit breakers are typically implemented using a load balancer. As an example, Traefik has rate limiting and circuit breaker functionality. On the other hand, buffering and queueing can also bed implemented on the server side (but load balancers also typically offer buffering capabilities).

When discussing message queues with Redis and comparing the number of database-related requests that a server could process with and without a message queue, we were using the above-mentioned strategy buffering and queueing.

Loading Exercise...

← WebSockets

Overarching Project →