-
Client Layer:
- User Action: A user draws a line on the whiteboard using the client application (browser or mobile app).
- Local State Management: The client's local state (in the browser or app) captures the drawing action immediately, providing a smooth user experience.
- WebSocket Connection: The client uses a WebSocket connection to send drawing events (like mouse positions or strokes) to the backend server. This connection ensures low-latency, real-time communication.
-
Load Balancer:
- The Cloud Load Balancer receives the incoming WebSocket request from the client and forwards it to one of the available WebSocket servers.
- If multiple WebSocket servers are running (e.g., for scalability and fault tolerance), a load balancer (like HAProxy) distributes the connections based on factors like current load or geographic location.
-
WebSocket Server:
- The WebSocket server processes the incoming drawing event. It acknowledges the request to ensure the client knows that their action was received.
- The WebSocket server broadcasts this event to other connected clients (users) so that everyone can see the drawing update in real-time.
-
State Management:
- The State Manager is responsible for maintaining a consistent view of the whiteboard across all clients. When the WebSocket server receives an update, it forwards it to the State Manager.
- The State Manager uses ZooKeeper to coordinate updates if multiple state managers are running in parallel, ensuring no conflicting updates occur.
- The State Manager writes updates to Redis (for quick, in-memory caching) and MongoDB (for long-term storage and historical data).
-
Caching Layer:
- To reduce latency and improve response times, the State Manager first writes drawing updates to the Redis cache. This allows for quick retrieval of recent updates for new clients joining the session.
- For session management, the system might also use a session cache (like ElastiCache) to store temporary session data.
-
Persistent Storage:
- Once the drawing event is cached, the system performs an asynchronous write to the MongoDB database to ensure that all drawing data is durably stored. This includes user actions, timestamps, and any metadata.
- Larger assets, such as images or file uploads, are stored in an Object Storage system (like S3) for scalability and durability.
-
Coordination with ZooKeeper:
- ZooKeeper is used for leader election and coordination between distributed components. For example, if a state manager goes down, ZooKeeper helps elect a new leader to maintain system availability.
- The WebSocket servers and state managers use ZooKeeper to stay in sync and recover from network partitions.
-
Broadcast to Other Clients:
- After processing the update, the WebSocket server sends the updated drawing data to all other connected clients in real-time. This ensures that everyone sees the latest changes almost instantly.
- The State Manager ensures that even clients who reconnect later can retrieve the latest drawing state from the cache or database.
- Real-time Interactions: User actions are captured locally, sent to the server via WebSockets, processed by the WebSocket server, and distributed to all connected clients.
- Eventual Consistency: The State Manager ensures data is written to both the cache (Redis) for fast access and the database (MongoDB) for persistence. The system prioritizes availability and responsiveness, with data consistency ensured over time.
- Fault Tolerance and Coordination: ZooKeeper ensures the system remains operational during network failures, and the system can scale horizontally by adding more WebSocket servers or state managers.
balance between low-latency user experience and the need for eventual consistency.
Given the architecture, we can analyze how each aspect of the CAP theorem fits:
- In a highly distributed system, where WebSocket servers, state managers, and data stores are spread across multiple nodes, network partitions are inevitable (e.g., server crashes or network delays).
- Partition Tolerance is a must-have for this system since users expect the board to function even if some parts of the system become temporarily unreachable.
- By design, we've incorporated tools like ZooKeeper for coordination and leader election, ensuring that the system can recover from partitions gracefully.
Given that Partition Tolerance is non-negotiable, we're left with choosing between Consistency and Availability.
Consistency (C):
- Ensuring that all users see the same, up-to-date whiteboard content is important, especially in this collaborative application.
- However, enforcing strict consistency in real-time applications can lead to delays (latency) because updates may need to propagate across all nodes before being visible to users.
- The State Managers and the Redis cache help synchronize updates, but due to the distributed nature, strict consistency would be challenging without impacting responsiveness.
Availability (A):
- Users expect app to be highly responsive and available at all times, especially since it’s real-time.
- Prioritizing availability means that during a network partition, the system should still accept drawing inputs from clients, even if those updates cannot be immediately synchronized with all nodes.
- By using local caches (e.g., Redis) and eventual consistency strategies (like writing updates to MongoDB asynchronously), we ensure the system remains available, even if some nodes are temporarily not.
Based on the design:
- leans towards Availability (A) and Partition Tolerance (P), sacrificing Consistency (C) to some extent.
- Eventual consistency is a common pattern in real-time collaborative applications:
- For example, if one user is drawing while a network partition occurs, other users may see a slight delay before the updates are reflected once the partition resolves.
- Using caches (Redis) and asynchronous writes (MongoDB) helps keep the system responsive while ensuring that all data is eventually consistent.
To better balance availability and consistency:
- Conflict-free Replicated Data Types (CRDTs) or Operational Transformations (OTs), which allow users to continue working offline and then merge changes without conflicts.
- multi-version concurrency control in the state management layer, where each update has a version that can be reconciled later.
| CAP Property | Impact on System |
|---|---|
| Consistency (C) | Partially sacrificed for responsiveness and availability. Uses eventual consistency strategies to ensure data synchronization after partitions resolve. |
| Availability (A) | Prioritized to ensure users can keep drawing even during network issues. Handles updates locally and syncs later. |
| Partition Tolerance (P) | Required due to the distributed nature of the system. Uses ZooKeeper for fault detection and leader election. |