Skip to content

Instantly share code, notes, and snippets.

@hugs
Created May 6, 2025 21:59
Show Gist options
  • Save hugs/b12edab5a6f6fe1b5928c5936ddcae2b to your computer and use it in GitHub Desktop.
Save hugs/b12edab5a6f6fe1b5928c5936ddcae2b to your computer and use it in GitHub Desktop.

Revisions

  1. hugs created this gist May 6, 2025.
    95 changes: 95 additions & 0 deletions valet.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,95 @@
    # Valet Vision Architecture Overview

    Valet Vision is a flexible, open-source automation platform built on the Raspberry Pi. It enables precise control and observation of mobile devices through a combination of computer vision, hardware control, and network-accessible APIs.

    ---

    ## 📐 Architecture

    At its core, **Valet Vision** runs an HTTP server on a Raspberry Pi. This server exposes a simple JSON-over-HTTP API to:

    - Access the **camera feed**
    - Simulate **virtual mouse**, **keyboard**, and **stylus** inputs
    - Capture **screenshots** and stream **live MJPEG video**

    All requests can be made locally or over the network. The API is platform-agnostic, allowing automation scripts to run:

    - **Locally on the Valet** itself
    - On another machine within the same network

    🗺️ Architecture diagram: [valetnet.dev/overview](https://valetnet.dev/overview/)

    ---

    ## 🔌 USB Gadget Protocol

    Valet Vision uses Linux’s **USB Gadget** protocol to emulate USB peripherals to a connected mobile device. This includes:

    - Virtual **touch stylus**
    - USB **keyboard** and **mouse**
    - Optional **Ethernet gadget** mode to share or restrict the Pi’s network access with the mobile device

    This allows both input simulation and network configuration of the attached phone or tablet, all via a single USB connection.

    ---

    ## 👁️ Vision + AI Capabilities

    - Screenshots can be fetched as `image/jpeg` or `image/png`
    - A **live video stream** is available as an MJPEG feed over HTTP

    Valet Vision includes:

    - **OpenCV** for computer vision and object detection
    - **Tesseract OCR** for text recognition

    These tools allow automation scripts to detect and act on visual UI elements in screenshots.

    For more demanding AI workloads, you have options:
    - Add the [Raspberry Pi AI Kit (GPU accelerator)](https://www.raspberrypi.com/products/ai-kit/)
    - Run ML inference remotely on a more powerful machine
    - Use hosted services to offload heavy image processing

    Importantly, **Valet Vision operates fully offline by default**—it does not require or depend on any cloud services.

    ---

    ## 📲 Push Button Module (PBM)

    For full-device automation, Valet Vision can control hardware side buttons (e.g., power, volume up/down) via an optional **Push Button Module** (PBM):

    - PBM actuators are **digitally controlled servos**
    - Connected via **Dynamixel Protocol 2.0** over serial from the Pi
    - Placement is flexible—PBM arms can be positioned on either side of the device

    Support for PBM control via HTTP API is on the roadmap.

    🔗 More on the Dynamixel Protocol: [emanual.robotis.com](https://emanual.robotis.com/docs/en/dxl/protocol2/)

    ---

    ## 🧪 Developer Tools & Open Source

    The software that powers Valet Vision is fully open source:

    - Core Server: [checkbox-server on GitHub](https://github.com/tapsterbot/checkbox-server)
    - Python Client: [checkbox-client-python](https://github.com/tapsterbot/checkbox-client-python)

    ### Example: Simulate a Tap
    ```bash
    curl -X POST $HOST/api/touch/tap
    -H "Content-Type: application/json"
    -d '{"x": 0, "y": 0}'
    ```

    ---

    ## 🧩 Modular, Local-First Automation

    Valet Vision is designed to be adaptable:

    - Fully autonomous (all logic and inference on-device)
    - Or controlled remotely from any machine on the same network
    - No mandatory cloud infrastructure

    This makes it ideal for labs, local QA environments, and regulated industries where network control and data locality matter.