Created
September 2, 2025 15:34
-
-
Save manueldeprada/4dd9b56a37cb191878149a6be05bcfe8 to your computer and use it in GitHub Desktop.
Revisions
-
manueldeprada created this gist
Sep 2, 2025 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,180 @@ # Cached CI for HF proposal: Record‑and‑Bake HTTP Cache --- ## 0) Problem statement (today’s flakes) We’re seeing repeated CI failures in a fresh container when tests make live HTTP calls. Example from today, 5 or so rerun failures on: * `tests/models/pix2struct/test_image_processing_pix2struct.py::Pix2StructImageProcessingTest::test_expected_patches` * `tests/models/pix2struct/test_image_processing_pix2struct.py::Pix2StructImageProcessingTest::test_call_vqa` Typical errors: * `HTTPError('429 Client Error: Too Many Requests for url: https://huggingface.co/ybelkada/fonts/resolve/main/Arial.TTF')` * `PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x...>` when doing `Image.open(requests.get(..., stream=True).raw)` for `https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/australia.jpg`. We want fully deterministic, **offline** CI runs without touching library/test code. --- ## 1) Proposed approach (high-level) Adopt a **record → bake → replay** workflow: 1. **Record HTTP requests (one-off or on demand)**: execute the tests once with network access, while transparently recording all HTTP traffic to a local cache. 2. **Bake caches into the Docker image**: copy the recorded caches into the image as a layer. 3. **Replay offline in CI**: tests read exclusively from the baked caches. No network calls, no flakes. This covers both kinds of traffic we have: * **Hugging Face Hub traffic** (fonts, model/dataset files): relies on the hub’s own on-disk cache directories. * **Arbitrary `requests.get(...)` traffic** (e.g. direct image URLs): captured and replayed by a global HTTP cache installed via `sitecustomize.py`. --- ## 2) Key pieces & exactly what we use We use **both** of the following because our tests rely on Hub downloads **and** direct `requests.get(...)` calls: ### 2.1 `sitecustomize.py` + `requests-cache` * Python auto-imports `sitecustomize` at startup if present on the path. * We install a global `requests-cache` that transparently caches and replays **any** `requests` traffic (e.g., direct image URLs used by PIL). * We also force body materialization so `stream=True` responses get fully cached, avoiding truncated content errors. ### 2.2 Hugging Face Hub on-disk caches * We set cache env vars so Hub blobs (fonts, small images, model shards) land in known directories we can bake into the image: * `HF_HOME=/opt/hf_cache` * `HF_HUB_CACHE=/opt/hf_cache/hub` * `TRANSFORMERS_CACHE=/opt/hf_cache/transformers` * `HF_DATASETS_CACHE=/opt/hf_cache/datasets` --- ## 3) Implementation sketch ### 3.1 `sitecustomize.py` Place this file into the image at `site-packages/sitecustomize.py` so it auto-loads: ```python # sitecustomize.py import os, json, time, atexit try: import requests, requests_cache except Exception: requests = None; requests_cache = None if requests and requests_cache: cache_dir = os.environ.get("HTTP_CACHE_DIR", "/opt/http_cache") os.makedirs(cache_dir, exist_ok=True) cache_path = os.path.join(cache_dir, "requests_cache") log_path = os.path.join(cache_dir, "http_log.ndjson") # Never expire; regenerate the cache explicitly when needed requests_cache.install_cache(cache_path, backend="sqlite", expire_after=None) def _log_and_materialize(r, *_, **__): # Ensure streamed responses are fully cached; log basic metadata try: _ = r.content except Exception: pass try: with open(log_path, "a") as f: f.write(json.dumps({ "ts": time.time(), "method": getattr(r.request, "method", None), "url": r.url, "status": getattr(r, "status_code", None), "from_cache": getattr(r, "from_cache", False), }) + " ") except Exception: pass try: sess = requests_cache.get_session() sess.hooks.setdefault("response", []).append(_log_and_materialize) except Exception: pass @atexit.register def _print_log_loc(): print(f"[http-recorder] Log: {log_path} Cache: {cache_path}.sqlite") ``` ### 3.2 Generate the caches (record once, online) Run the test suite **once** with network access to populate both the Hub caches and the `requests-cache` DB. Since slow tests are skipped by default, this will pull only small assets. ```bash # Choose stable cache locations export HF_HOME=/opt/hf_cache export HF_HUB_CACHE=/opt/hf_cache/hub export TRANSFORMERS_CACHE=/opt/hf_cache/transformers export HF_DATASETS_CACHE=/opt/hf_cache/datasets export HTTP_CACHE_DIR=/opt/http_cache # Optional: faster first pull export HF_HUB_ENABLE_HF_TRANSFER=1 # Run tests to record everything needed pytest -q # Package caches as build artifacts tar -C /opt -czf hf_cache.tgz hf_cache tar -C /opt -czf http_cache.tgz http_cache ``` ### 3.3 Bake caches into the Docker image Add the recorded caches as layers and include `sitecustomize.py` so runtime replays from cache automatically. ```dockerfile FROM python:3.11-slim # Cache locations (must match recording step) ENV HF_HOME=/opt/hf_cache \ HF_HUB_CACHE=/opt/hf_cache/hub \ TRANSFORMERS_CACHE=/opt/hf_cache/transformers \ HF_DATASETS_CACHE=/opt/hf_cache/datasets \ HTTP_CACHE_DIR=/opt/http_cache # Minimal deps used by tests RUN pip install --no-cache-dir requests requests-cache huggingface_hub pillow # Auto-loads and patches requests globally COPY sitecustomize.py /usr/local/lib/python3.11/site-packages/sitecustomize.py # Bring in recorded caches (created in 3.2) ADD hf_cache.tgz /opt/ ADD http_cache.tgz /opt/ # Ensure readability for non-root users in CI RUN chmod -R a+rX /opt/hf_cache /opt/http_cache WORKDIR /workspace # COPY your repo here in the real Dockerfile ``` ### 3.4 CI usage (stay online, rely on cache) No special flags are required. Keep CI online; the runtime will prefer: * **Hub blobs from `HF_*_CACHE`** for models/datasets/fonts/images. * **`requests-cache`** for any direct `requests.get(...)` (e.g., the `australia.jpg` URL), served from the baked SQLite. If a new URL appears in tests, it will be fetched live during CI; consider periodically re-running step **3.2** to refresh caches. --- ## 4) Notes * This design eliminates the observed flakes by ensuring both Hub assets and direct HTTP resources are already cached. Remaining network calls are rare metadata checks and new URLs. * Because slow tests are disabled by default, the baked caches should remain small (fonts, tiny images, small model shards). * If future flakes reappear due to new URLs, just regenerate caches (3.2) and rebuild the image.