Skip to content

Instantly share code, notes, and snippets.

@claeusdev
Last active November 8, 2024 06:29
Show Gist options
  • Save claeusdev/2e9a41e12f3bc5eb27da174b4140b94c to your computer and use it in GitHub Desktop.
Save claeusdev/2e9a41e12f3bc5eb27da174b4140b94c to your computer and use it in GitHub Desktop.

Revisions

  1. claeusdev renamed this gist Nov 8, 2024. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  2. claeusdev created this gist Nov 8, 2024.
    91 changes: 91 additions & 0 deletions whiteboard.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,91 @@

    ### **Scenario: A User Draws on the Whiteboard**

    1. **Client Layer**:
    - **User Action**: A user draws a line on the whiteboard using the client application (browser or mobile app).
    - **Local State Management**: The client's local state (in the browser or app) captures the drawing action immediately, providing a smooth user experience.
    - **WebSocket Connection**: The client uses a **WebSocket connection** to send drawing events (like mouse positions or strokes) to the backend server. This connection ensures low-latency, real-time communication.

    2. **Load Balancer**:
    - The **Cloud Load Balancer** receives the incoming WebSocket request from the client and forwards it to one of the available WebSocket servers.
    - If multiple WebSocket servers are running (e.g., for scalability and fault tolerance), a **load balancer** (like HAProxy) distributes the connections based on factors like current load or geographic location.

    3. **WebSocket Server**:
    - The WebSocket server processes the incoming drawing event. It acknowledges the request to ensure the client knows that their action was received.
    - The WebSocket server broadcasts this event to other connected clients (users) so that everyone can see the drawing update in real-time.

    4. **State Management**:
    - The **State Manager** is responsible for maintaining a consistent view of the whiteboard across all clients. When the WebSocket server receives an update, it forwards it to the State Manager.
    - The State Manager uses **ZooKeeper** to coordinate updates if multiple state managers are running in parallel, ensuring no conflicting updates occur.
    - The State Manager writes updates to **Redis** (for quick, in-memory caching) and **MongoDB** (for long-term storage and historical data).

    5. **Caching Layer**:
    - To reduce latency and improve response times, the State Manager first writes drawing updates to the **Redis cache**. This allows for quick retrieval of recent updates for new clients joining the session.
    - For session management, the system might also use a session cache (like **ElastiCache**) to store temporary session data.

    6. **Persistent Storage**:
    - Once the drawing event is cached, the system performs an **asynchronous write** to the **MongoDB database** to ensure that all drawing data is durably stored. This includes user actions, timestamps, and any metadata.
    - Larger assets, such as images or file uploads, are stored in an **Object Storage system (like S3)** for scalability and durability.

    7. **Coordination with ZooKeeper**:
    - **ZooKeeper** is used for **leader election** and coordination between distributed components. For example, if a state manager goes down, ZooKeeper helps elect a new leader to maintain system availability.
    - The WebSocket servers and state managers use ZooKeeper to stay in sync and recover from network partitions.

    8. **Broadcast to Other Clients**:
    - After processing the update, the WebSocket server sends the updated drawing data to all other connected clients in real-time. This ensures that everyone sees the latest changes almost instantly.
    - The State Manager ensures that even clients who reconnect later can retrieve the latest drawing state from the cache or database.

    ### **Summary of Data Flow in the System**
    - **Real-time Interactions**: User actions are captured locally, sent to the server via WebSockets, processed by the WebSocket server, and distributed to all connected clients.
    - **Eventual Consistency**: The State Manager ensures data is written to both the cache (Redis) for fast access and the database (MongoDB) for persistence. The system prioritizes availability and responsiveness, with data consistency ensured over time.
    - **Fault Tolerance and Coordination**: ZooKeeper ensures the system remains operational during network failures, and the system can scale horizontally by adding more WebSocket servers or state managers.

    balance between **low-latency user experience** and the need for **eventual consistency**.

    <img width="919" alt="Screenshot 2024-11-08 at 06 24 27" src="https://gist.github.com/user-attachments/assets/be96f4ee-3be7-41cf-aecc-099778fc1e0a">



    ### Applying CAP Theorem to Distributed Whiteboard System

    Given the architecture, we can analyze how each aspect of the CAP theorem fits:

    #### 1. **Partition Tolerance (P)**
    - In a **highly distributed system**, where WebSocket servers, state managers, and data stores are spread across multiple nodes, **network partitions are inevitable** (e.g., server crashes or network delays).
    - **Partition Tolerance is a must-have** for this system since users expect the board to function even if some parts of the system become temporarily unreachable.
    - By design, we've incorporated tools like **ZooKeeper** for coordination and leader election, ensuring that the system can recover from partitions gracefully.

    Given that Partition Tolerance is non-negotiable, we're left with choosing between **Consistency** and **Availability**.

    #### 2. **Consistency vs. Availability Trade-off**

    **Consistency (C)**:
    - Ensuring that all users see the same, up-to-date whiteboard content is important, especially in this collaborative application.
    - However, enforcing strict consistency in real-time applications can lead to delays (latency) because updates may need to propagate across all nodes before being visible to users.
    - The **State Managers** and the **Redis cache** help synchronize updates, but due to the distributed nature, strict consistency would be challenging without impacting responsiveness.

    **Availability (A)**:
    - Users expect app to be highly **responsive and available** at all times, especially since it’s real-time.
    - Prioritizing availability means that during a network partition, the system should still accept drawing inputs from clients, even if those updates cannot be immediately synchronized with all nodes.
    - By using **local caches (e.g., Redis)** and **eventual consistency** strategies (like writing updates to MongoDB asynchronously), we ensure the system remains available, even if some nodes are temporarily not.

    #### **Which Trade-off Does this System Choose?**

    Based on the design:
    - leans towards **Availability (A)** and **Partition Tolerance (P)**, sacrificing **Consistency (C)** to some extent.
    - **Eventual consistency** is a common pattern in real-time collaborative applications:
    - For example, if one user is drawing while a network partition occurs, other users may see a slight delay before the updates are reflected once the partition resolves.
    - Using **caches (Redis)** and **asynchronous writes (MongoDB)** helps keep the system responsive while ensuring that all data is eventually consistent.

    ### Strategies to Achieve Balance
    To better balance availability and consistency:
    - **Conflict-free Replicated Data Types (CRDTs)** or **Operational Transformations (OTs)**, which allow users to continue working offline and then merge changes without conflicts.
    - **multi-version concurrency control** in the state management layer, where each update has a version that can be reconciled later.

    ### Summary

    | CAP Property | Impact on System |
    |----------------------|-------------------------------------------------------|
    | **Consistency (C)** | Partially sacrificed for responsiveness and availability. Uses eventual consistency strategies to ensure data synchronization after partitions resolve. |
    | **Availability (A)** | Prioritized to ensure users can keep drawing even during network issues. Handles updates locally and syncs later. |
    | **Partition Tolerance (P)** | Required due to the distributed nature of the system. Uses ZooKeeper for fault detection and leader election. |