manueldeprada · September 2, 2025 15:34 · Sep 2, 2025
diff --git a/proposal.md b/proposal.md
@@ -0,0 +1,180 @@
+# Cached CI for HF proposal: Record‑and‑Bake HTTP Cache
+
+---
+
+## 0) Problem statement (today’s flakes)
+
+We’re seeing repeated CI failures in a fresh container when tests make live HTTP calls. Example from today, 5 or so rerun failures on:
+
+* `tests/models/pix2struct/test_image_processing_pix2struct.py::Pix2StructImageProcessingTest::test_expected_patches`
+* `tests/models/pix2struct/test_image_processing_pix2struct.py::Pix2StructImageProcessingTest::test_call_vqa`
+
+Typical errors:
+
+* `HTTPError('429 Client Error: Too Many Requests for url: https://huggingface.co/ybelkada/fonts/resolve/main/Arial.TTF')`
+* `PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x...>` when doing `Image.open(requests.get(..., stream=True).raw)` for
+  `https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/australia.jpg`.
+
+We want fully deterministic, **offline** CI runs without touching library/test code.
+
+---
+
+## 1) Proposed approach (high-level)
+
+Adopt a **record → bake → replay** workflow:
+
+1. **Record HTTP requests (one-off or on demand)**: execute the tests once with network access, while transparently recording all HTTP traffic to a local cache.
+2. **Bake caches into the Docker image**: copy the recorded caches into the image as a layer.
+3. **Replay offline in CI**: tests read exclusively from the baked caches. No network calls, no flakes.
+
+This covers both kinds of traffic we have:
+
+* **Hugging Face Hub traffic** (fonts, model/dataset files): relies on the hub’s own on-disk cache directories.
+* **Arbitrary `requests.get(...)` traffic** (e.g. direct image URLs): captured and replayed by a global HTTP cache installed via `sitecustomize.py`.
+
+---
+
+## 2) Key pieces & exactly what we use
+
+We use **both** of the following because our tests rely on Hub downloads **and** direct `requests.get(...)` calls:
+
+### 2.1 `sitecustomize.py` + `requests-cache`
+
+* Python auto-imports `sitecustomize` at startup if present on the path.
+* We install a global `requests-cache` that transparently caches and replays **any** `requests` traffic (e.g., direct image URLs used by PIL).
+* We also force body materialization so `stream=True` responses get fully cached, avoiding truncated content errors.
+
+### 2.2 Hugging Face Hub on-disk caches
+
+* We set cache env vars so Hub blobs (fonts, small images, model shards) land in known directories we can bake into the image:
+
+  * `HF_HOME=/opt/hf_cache`
+  * `HF_HUB_CACHE=/opt/hf_cache/hub`
+  * `TRANSFORMERS_CACHE=/opt/hf_cache/transformers`
+  * `HF_DATASETS_CACHE=/opt/hf_cache/datasets`
+
+---
+
+## 3) Implementation sketch
+
+### 3.1 `sitecustomize.py`
+
+Place this file into the image at `site-packages/sitecustomize.py` so it auto-loads:
+
+```python
+# sitecustomize.py
+import os, json, time, atexit
+try:
+    import requests, requests_cache
+except Exception:
+    requests = None; requests_cache = None
+
+if requests and requests_cache:
+    cache_dir = os.environ.get("HTTP_CACHE_DIR", "/opt/http_cache")
+    os.makedirs(cache_dir, exist_ok=True)
+    cache_path = os.path.join(cache_dir, "requests_cache")
+    log_path = os.path.join(cache_dir, "http_log.ndjson")
+
+    # Never expire; regenerate the cache explicitly when needed
+    requests_cache.install_cache(cache_path, backend="sqlite", expire_after=None)
+
+    def _log_and_materialize(r, *_, **__):
+        # Ensure streamed responses are fully cached; log basic metadata
+        try:
+            _ = r.content
+        except Exception:
+            pass
+        try:
+            with open(log_path, "a") as f:
+                f.write(json.dumps({
+                    "ts": time.time(),
+                    "method": getattr(r.request, "method", None),
+                    "url": r.url,
+                    "status": getattr(r, "status_code", None),
+                    "from_cache": getattr(r, "from_cache", False),
+                }) + "
+")
+        except Exception:
+            pass
+
+    try:
+        sess = requests_cache.get_session()
+        sess.hooks.setdefault("response", []).append(_log_and_materialize)
+    except Exception:
+        pass
+
+    @atexit.register
+    def _print_log_loc():
+        print(f"[http-recorder] Log: {log_path}  Cache: {cache_path}.sqlite")
+```
+
+### 3.2 Generate the caches (record once, online)
+
+Run the test suite **once** with network access to populate both the Hub caches and the `requests-cache` DB. Since slow tests are skipped by default, this will pull only small assets.
+
+```bash
+# Choose stable cache locations
+export HF_HOME=/opt/hf_cache
+export HF_HUB_CACHE=/opt/hf_cache/hub
+export TRANSFORMERS_CACHE=/opt/hf_cache/transformers
+export HF_DATASETS_CACHE=/opt/hf_cache/datasets
+export HTTP_CACHE_DIR=/opt/http_cache
+
+# Optional: faster first pull
+export HF_HUB_ENABLE_HF_TRANSFER=1
+
+# Run tests to record everything needed
+pytest -q
+
+# Package caches as build artifacts
+tar -C /opt -czf hf_cache.tgz hf_cache
+tar -C /opt -czf http_cache.tgz http_cache
+```
+
+### 3.3 Bake caches into the Docker image
+
+Add the recorded caches as layers and include `sitecustomize.py` so runtime replays from cache automatically.
+
+```dockerfile
+FROM python:3.11-slim
+
+# Cache locations (must match recording step)
+ENV HF_HOME=/opt/hf_cache \
+    HF_HUB_CACHE=/opt/hf_cache/hub \
+    TRANSFORMERS_CACHE=/opt/hf_cache/transformers \
+    HF_DATASETS_CACHE=/opt/hf_cache/datasets \
+    HTTP_CACHE_DIR=/opt/http_cache
+
+# Minimal deps used by tests
+RUN pip install --no-cache-dir requests requests-cache huggingface_hub pillow
+
+# Auto-loads and patches requests globally
+COPY sitecustomize.py /usr/local/lib/python3.11/site-packages/sitecustomize.py
+
+# Bring in recorded caches (created in 3.2)
+ADD hf_cache.tgz /opt/
+ADD http_cache.tgz /opt/
+
+# Ensure readability for non-root users in CI
+RUN chmod -R a+rX /opt/hf_cache /opt/http_cache
+
+WORKDIR /workspace
+# COPY your repo here in the real Dockerfile
+```
+
+### 3.4 CI usage (stay online, rely on cache)
+
+No special flags are required. Keep CI online; the runtime will prefer:
+
+* **Hub blobs from `HF_*_CACHE`** for models/datasets/fonts/images.
+* **`requests-cache`** for any direct `requests.get(...)` (e.g., the `australia.jpg` URL), served from the baked SQLite.
+
+If a new URL appears in tests, it will be fetched live during CI; consider periodically re-running step **3.2** to refresh caches.
+
+---
+
+## 4) Notes
+
+* This design eliminates the observed flakes by ensuring both Hub assets and direct HTTP resources are already cached. Remaining network calls are rare metadata checks and new URLs.
+* Because slow tests are disabled by default, the baked caches should remain small (fonts, tiny images, small model shards).
+* If future flakes reappear due to new URLs, just regenerate caches (3.2) and rebuild the image.