Skip to content

Instantly share code, notes, and snippets.

@manueldeprada
Created September 2, 2025 15:34
Show Gist options
  • Save manueldeprada/4dd9b56a37cb191878149a6be05bcfe8 to your computer and use it in GitHub Desktop.
Save manueldeprada/4dd9b56a37cb191878149a6be05bcfe8 to your computer and use it in GitHub Desktop.

Revisions

  1. manueldeprada created this gist Sep 2, 2025.
    180 changes: 180 additions & 0 deletions proposal.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,180 @@
    # Cached CI for HF proposal: Record‑and‑Bake HTTP Cache

    ---

    ## 0) Problem statement (today’s flakes)

    We’re seeing repeated CI failures in a fresh container when tests make live HTTP calls. Example from today, 5 or so rerun failures on:

    * `tests/models/pix2struct/test_image_processing_pix2struct.py::Pix2StructImageProcessingTest::test_expected_patches`
    * `tests/models/pix2struct/test_image_processing_pix2struct.py::Pix2StructImageProcessingTest::test_call_vqa`

    Typical errors:

    * `HTTPError('429 Client Error: Too Many Requests for url: https://huggingface.co/ybelkada/fonts/resolve/main/Arial.TTF')`
    * `PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x...>` when doing `Image.open(requests.get(..., stream=True).raw)` for
    `https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/australia.jpg`.

    We want fully deterministic, **offline** CI runs without touching library/test code.

    ---

    ## 1) Proposed approach (high-level)

    Adopt a **record → bake → replay** workflow:

    1. **Record HTTP requests (one-off or on demand)**: execute the tests once with network access, while transparently recording all HTTP traffic to a local cache.
    2. **Bake caches into the Docker image**: copy the recorded caches into the image as a layer.
    3. **Replay offline in CI**: tests read exclusively from the baked caches. No network calls, no flakes.

    This covers both kinds of traffic we have:

    * **Hugging Face Hub traffic** (fonts, model/dataset files): relies on the hub’s own on-disk cache directories.
    * **Arbitrary `requests.get(...)` traffic** (e.g. direct image URLs): captured and replayed by a global HTTP cache installed via `sitecustomize.py`.

    ---

    ## 2) Key pieces & exactly what we use

    We use **both** of the following because our tests rely on Hub downloads **and** direct `requests.get(...)` calls:

    ### 2.1 `sitecustomize.py` + `requests-cache`

    * Python auto-imports `sitecustomize` at startup if present on the path.
    * We install a global `requests-cache` that transparently caches and replays **any** `requests` traffic (e.g., direct image URLs used by PIL).
    * We also force body materialization so `stream=True` responses get fully cached, avoiding truncated content errors.

    ### 2.2 Hugging Face Hub on-disk caches

    * We set cache env vars so Hub blobs (fonts, small images, model shards) land in known directories we can bake into the image:

    * `HF_HOME=/opt/hf_cache`
    * `HF_HUB_CACHE=/opt/hf_cache/hub`
    * `TRANSFORMERS_CACHE=/opt/hf_cache/transformers`
    * `HF_DATASETS_CACHE=/opt/hf_cache/datasets`

    ---

    ## 3) Implementation sketch

    ### 3.1 `sitecustomize.py`

    Place this file into the image at `site-packages/sitecustomize.py` so it auto-loads:

    ```python
    # sitecustomize.py
    import os, json, time, atexit
    try:
    import requests, requests_cache
    except Exception:
    requests = None; requests_cache = None

    if requests and requests_cache:
    cache_dir = os.environ.get("HTTP_CACHE_DIR", "/opt/http_cache")
    os.makedirs(cache_dir, exist_ok=True)
    cache_path = os.path.join(cache_dir, "requests_cache")
    log_path = os.path.join(cache_dir, "http_log.ndjson")

    # Never expire; regenerate the cache explicitly when needed
    requests_cache.install_cache(cache_path, backend="sqlite", expire_after=None)

    def _log_and_materialize(r, *_, **__):
    # Ensure streamed responses are fully cached; log basic metadata
    try:
    _ = r.content
    except Exception:
    pass
    try:
    with open(log_path, "a") as f:
    f.write(json.dumps({
    "ts": time.time(),
    "method": getattr(r.request, "method", None),
    "url": r.url,
    "status": getattr(r, "status_code", None),
    "from_cache": getattr(r, "from_cache", False),
    }) + "
    ")
    except Exception:
    pass

    try:
    sess = requests_cache.get_session()
    sess.hooks.setdefault("response", []).append(_log_and_materialize)
    except Exception:
    pass

    @atexit.register
    def _print_log_loc():
    print(f"[http-recorder] Log: {log_path} Cache: {cache_path}.sqlite")
    ```

    ### 3.2 Generate the caches (record once, online)

    Run the test suite **once** with network access to populate both the Hub caches and the `requests-cache` DB. Since slow tests are skipped by default, this will pull only small assets.

    ```bash
    # Choose stable cache locations
    export HF_HOME=/opt/hf_cache
    export HF_HUB_CACHE=/opt/hf_cache/hub
    export TRANSFORMERS_CACHE=/opt/hf_cache/transformers
    export HF_DATASETS_CACHE=/opt/hf_cache/datasets
    export HTTP_CACHE_DIR=/opt/http_cache

    # Optional: faster first pull
    export HF_HUB_ENABLE_HF_TRANSFER=1

    # Run tests to record everything needed
    pytest -q

    # Package caches as build artifacts
    tar -C /opt -czf hf_cache.tgz hf_cache
    tar -C /opt -czf http_cache.tgz http_cache
    ```

    ### 3.3 Bake caches into the Docker image

    Add the recorded caches as layers and include `sitecustomize.py` so runtime replays from cache automatically.

    ```dockerfile
    FROM python:3.11-slim

    # Cache locations (must match recording step)
    ENV HF_HOME=/opt/hf_cache \
    HF_HUB_CACHE=/opt/hf_cache/hub \
    TRANSFORMERS_CACHE=/opt/hf_cache/transformers \
    HF_DATASETS_CACHE=/opt/hf_cache/datasets \
    HTTP_CACHE_DIR=/opt/http_cache

    # Minimal deps used by tests
    RUN pip install --no-cache-dir requests requests-cache huggingface_hub pillow

    # Auto-loads and patches requests globally
    COPY sitecustomize.py /usr/local/lib/python3.11/site-packages/sitecustomize.py

    # Bring in recorded caches (created in 3.2)
    ADD hf_cache.tgz /opt/
    ADD http_cache.tgz /opt/

    # Ensure readability for non-root users in CI
    RUN chmod -R a+rX /opt/hf_cache /opt/http_cache

    WORKDIR /workspace
    # COPY your repo here in the real Dockerfile
    ```

    ### 3.4 CI usage (stay online, rely on cache)

    No special flags are required. Keep CI online; the runtime will prefer:

    * **Hub blobs from `HF_*_CACHE`** for models/datasets/fonts/images.
    * **`requests-cache`** for any direct `requests.get(...)` (e.g., the `australia.jpg` URL), served from the baked SQLite.

    If a new URL appears in tests, it will be fetched live during CI; consider periodically re-running step **3.2** to refresh caches.

    ---

    ## 4) Notes

    * This design eliminates the observed flakes by ensuring both Hub assets and direct HTTP resources are already cached. Remaining network calls are rare metadata checks and new URLs.
    * Because slow tests are disabled by default, the baked caches should remain small (fonts, tiny images, small model shards).
    * If future flakes reappear due to new URLs, just regenerate caches (3.2) and rebuild the image.