Skip to content

Instantly share code, notes, and snippets.

@JKamsker
Last active October 4, 2025 17:09
Show Gist options
  • Save JKamsker/395ddb25cd24f7447216f82d4672a42b to your computer and use it in GitHub Desktop.
Save JKamsker/395ddb25cd24f7447216f82d4672a42b to your computer and use it in GitHub Desktop.

Revisions

  1. JKamsker renamed this gist Oct 4, 2025. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  2. JKamsker revised this gist Oct 4, 2025. No changes.
  3. JKamsker created this gist Oct 4, 2025.
    249 changes: 249 additions & 0 deletions gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,249 @@
    # LiteDB PR #2649 — Cache Profiles & On-Demand Cleanup

    **Scope touched:** `Engine/Disk/MemoryCache.cs`, `Engine/Structures/PageBuffer.cs`, `LiteDB.Tests/Internals/Cache_Tests.cs`
    **Theme:** Make the in-memory page cache tunable (profiles) and keep its free list in check with an on-demand cleaner guarded by a semaphore and per-page eviction flags.

    ---

    ## Executive summary

    Promising direction: explicit cache “profiles” (Mobile/Desktop/Server) and a safer cleanup handshake (`TryBeginEvict`/`MarkReusable`) add real tuning and safety hooks. However, the current implementation wires cleanup into hot paths and repeatedly calls `ConcurrentQueue<T>.Count` (which is O(n) on .NET). That risks CPU overhead and read-path tail latency. The profile defaults are conservative and may over-trim the free list, increasing LOH churn during bursts. Unit tests don’t verify behavior, only “doesn’t throw”.

    I’d land the concept after tightening the hot-path costs, fixing cleanup triggering semantics, and adding behavior-level tests. Details (with references) below.

    ---

    ## What changed (high-level)

    * **Profiles & tuning knobs**

    * New enum and API: `MemoryCache.CacheProfile { Mobile, Desktop, Server }` and `ApplyProfile(profile)` with per-instance thresholds:
    `_maxFreePages`, `_idleBeforeEvict`, `_minCleanupInterval`, `_cleanupBatchSize`, and an ops counter `_opsSinceLastCleanup`.
    * Defaults: `DEFAULT_MAX_FREE_PAGES = 200`, `DEFAULT_IDLE = 60s`, `DEFAULT_INT = 30s`, `DEFAULT_BATCH = 128`, and `OPS_CLEANUP_STEP = 256`.

    * **Eviction handshake and timestamps**

    * `PageBuffer` gains an atomic eviction flag: `TryBeginEvict()` and `MarkReusable()`.
    * When a page is freed, its `Timestamp` is set to `UtcNow.Ticks` (used as “freed at” time while in the free list).

    * **On-demand cleanup**

    * `TryCleanupOnDemand(force: bool)` decides when to clean based on: min interval, `_free.Count` vs `_maxFreePages`, `_cleanupSignal`, and `_opsSinceLastCleanup`.
    * `Cleanup()` drains up to `_cleanupBatchSize` free pages that have been idle longer than `_idleBeforeEvict`, under `_cleanLock` (a non-reentrant `SemaphoreSlim`).

    * **Triggers**

    * The read hot path (`GetReadablePage`) increments `_opsSinceLastCleanup` and runs cleanup every 256 operations.
    * Free-list producers (`DiscardPage`, `Extend`) set `Timestamp`, `Enqueue`, and check `_free.Count` to potentially force cleanup.

    * **Tests**

    * New tests exercise `ApplyProfile` and simple read paths but don’t assert cache state invariants; they mostly ensure “no throw” / counters non-negative.

    ---

    ## The good

    1. **Profiles make behavior explicit and portable.**
    Different footprints (mobile/desktop/server) can be expressed without scattering “magic numbers” across the engine. This is real extensibility, and the numbers are at least defensible starting points.

    2. **Eviction flag = safer cleanup under concurrency.**
    `TryBeginEvict`/`MarkReusable` gives the cleaner a handshake before touching shared buffers, reducing the odds of reclaiming a page that a reader/writer still races to use.

    3. **Opportunistic cleanup beats global sweeps.**
    Moving from “periodic global sweep” toward “small, bounded batches when needed” is the right design to reduce contention and keep the free list from growing unbounded.

    4. **Per-instance tuning.**
    The thresholds are instance fields, not globals — great for testing and for hosting scenarios that need different policies.

    ---

    ## The bad

    1. **Hot-path O(n) counting on `ConcurrentQueue`.**
    Multiple sites use `_free.Count` to decide whether to signal/force cleanup (e.g., `DiscardPage`, `Extend`, `TryCleanupOnDemand`). `ConcurrentQueue<T>.Count` walks the queue. Once the free list grows to thousands, those counts add real CPU overhead right inside hot paths.

    2. **Cleanup work can run inline on read callers.**
    `GetReadablePage` increments a global op counter and, every 256 reads, runs `TryCleanupOnDemand()`. If `Cleanup()` proceeds, the reader’s thread can spend time zeroing and dequeuing up to 128 pages. That’s a recipe for tail-latency spikes under read-heavy load.

    3. **Trigger semantics can drop signals.**
    `TryCleanupOnDemand` clears `_cleanupSignal` **before** it tries to run cleanup. If `Cleanup()` bails (lock busy, early return), the signal is already cleared and the queue can remain above the cap until some later incidental trigger. This creates an intermittent “stuck full” state.

    4. **Tests don’t verify outcomes.**
    The new tests check that `ApplyProfile` runs and that certain calls don’t throw, but they don’t assert that cleanup actually reduces the free list under the right conditions, that idle thresholds are respected, or that readers aren’t blocked too long.

    ---

    ## The risky / “evil” bits (potential regressions)

    These are not moral judgments; they’re the places where performance or correctness is most at risk.

    1. **Allocation churn after aggressive trims.**
    With `_maxFreePages` at 200 by default (≈1.6 MiB of 8 KiB pages), bursty workloads that occasionally need thousands of pages will frequently drop below the working-set floor and then re-allocate large arrays on the next burst. That defeats cache locality, increases LOH/GC pressure, and can slow write bursts noticeably.

    2. **CPU spikes from bulk zeroing.**
    Cleanup zeroes each page (`page.Clear()`) before dropping it. That’s an 8 KiB memset per page × up to 128 pages per batch. If this runs inline on a request thread, you can see visible hiccups even when the system isn’t memory-constrained.

    3. **Double counting → cleanup thrash.**
    Because `_free.Count` gates both the *decision* to trigger and the forced path, a queue hovering around the threshold can cause repeated O(n) counts and small, frequent cleanups — paying overhead without meaningfully reducing pressure.

    4. **Eviction flag lifecycle is easy to get wrong.**
    The new handshake is good, but any path that drops a page without calling `MarkReusable()` will leave `_evicting` set and may cause later code to skip that page or treat it as permanently “in eviction”. The diff shows uses of `TryBeginEvict`/`MarkReusable`, but tests don’t assert flag symmetry.

    ---

    ## Performance & stability impact to expect

    * **Throughput:** Lower on heavy read/write contention due to extra accounting + cleanup attempts in hot paths.
    * **Latency:** Occasional spikes when a reader thread pays the cleanup bill.
    * **GC/LOH:** More churn when bursts follow a just-trimmed free list; higher allocation rate translates to more frequent GCs.
    * **Predictability:** The “clear signal then maybe bail” pattern makes cleanup somewhat nondeterministic under load.

    ---

    ## Recommendations (prioritized, low-risk first)

    1. **Replace `_free.Count` with a cheap, accurate counter.**
    Maintain an `int _freeCount` updated with `Interlocked.Increment/Decrement` on enqueue/dequeue. Only fall back to `_free.IsEmpty`/`TryDequeue` where needed. This removes the O(n) scans from hot paths.

    2. **Don’t run cleanup inline on request threads.**

    * Option A: Move cleanup to a background work item via `ThreadPool.UnsafeQueueUserWorkItem` and coalesce triggers (e.g., set a flag and timestamp; a single worker runs until the queue falls below cap or batch quota is met).
    * Option B: Keep inline but cap wall-time spent (e.g., stop after N pages **or** M milliseconds, whichever comes first).
    Either option mitigates tail-latency spikes.

    3. **Fix signal semantics.**
    In `TryCleanupOnDemand`, only clear `_cleanupSignal` after you’ve actually **entered** cleanup (i.e., after acquiring `_cleanLock`). If the lock is busy, keep the signal set so the next opportunity still sees the pending work.

    4. **Debounce threshold checks.**
    Cache `_freeCount` snapshots and use hysteresis (e.g., “trigger above cap + 10%” and “stop below cap − 10%”). This prevents thrashing when the free list hovers around the boundary.

    5. **Make profile selection explicit and reversible.**
    Keep `ApplyProfile` but don’t call it implicitly inside hot constructors. Let hosts opt in (or supply a callback/setting). If you do pick a default, thread it from `EngineSettings` so existing hosts can override without forking.

    6. **Tune defaults to deployment realities.**
    Consider higher `DEFAULT_MAX_FREE_PAGES` (or tie to segment sizes / memory budget). For servers, the current `Server` profile raises the cap to 360; many workloads will still benefit from a deeper baseline.

    7. **Bound cleanup cost per run.**
    Keep `_cleanupBatchSize`, but also pause between batches if `_cleanLock` is contended or if a max time slice is exceeded. Make cleanup cooperative, not monopolizing.

    8. **Clarify eviction-flag lifecycle.**
    Ensure every successful `TryBeginEvict()` path reaches `MarkReusable()` on the success/keep path; only the final “evict and drop” path should leave the page out of circulation. Add debug asserts in DEBUG builds.

    ---

    ## Test coverage gaps (and what to add)

    1. **Behavioral assertions**

    * Arrange: Populate free list over `_maxFreePages`, set timestamps to simulate idleness.
    * Act: Trigger cleanup.
    * Assert:

    * Pages idle < threshold remain.
    * Pages idle ≥ threshold are removed.
    * `_freeCount` falls toward cap; no negative counts; no stuck `_cleanupSignal`.
    * `TryBeginEvict` → `MarkReusable` symmetry holds.

    2. **Latency guardrails**

    * Use a fake time source / stopwatch to assert cleanup does not block reads for > X ms.
    * Inject a “fake clock” to control `_minCleanupInterval` without sleeping.

    3. **Thrash scenario**

    * Hover `_freeCount` around the cap and assert you don’t run cleanup more often than N times per M operations.

    4. **Profile correctness**

    * After `ApplyProfile(Server)`, assert the four knobs equal the expected values.
    * Parameterize tests across profiles.

    5. **Dispose semantics**

    * Ensure `Dispose()` stops cleanup cleanly (no deadlock on `_cleanLock`, no NREs if a concurrent read triggers after disposal).

    ---

    ## Suggested code tweaks (sketches)

    * **Cheap counter:**

    ```csharp
    // fields
    private int _freeCount;

    // enqueue
    _free.Enqueue(page);
    Interlocked.Increment(ref _freeCount);

    // dequeue loop in Cleanup
    if (_free.TryDequeue(out var page)) {
    Interlocked.Decrement(ref _freeCount);
    ...
    }

    // threshold checks
    if (Volatile.Read(ref _freeCount) > _maxFreePages) { ... }
    ```

    * **Safe signal clearing:**

    ```csharp
    private void TryCleanupOnDemand(bool force = false)
    {
    if (_disposed) return;

    var now = DateTime.UtcNow;
    var since = now - _lastCleanupRunUtc;

    if (!force && since < _minCleanupInterval &&
    Volatile.Read(ref _freeCount) <= _maxFreePages &&
    !Volatile.Read(ref _cleanupSignal) &&
    Volatile.Read(ref _opsSinceLastCleanup) < OPS_CLEANUP_STEP) return;

    if (!_cleanLock.Wait(0)) {
    // leave _cleanupSignal set if we intended to run
    Volatile.Write(ref _cleanupSignal, true);
    return;
    }

    try {
    Volatile.Write(ref _cleanupSignal, false);
    Interlocked.Exchange(ref _opsSinceLastCleanup, 0);
    Cleanup(); // bounded
    }
    finally {
    _cleanLock.Release();
    }
    }
    ```

    * **Offload cleanup:**

    ```csharp
    if (Interlocked.Exchange(ref _cleanupScheduled, 1) == 0) {
    ThreadPool.UnsafeQueueUserWorkItem(_ => {
    try { TryCleanupOnDemand(force: true); }
    finally { Volatile.Write(ref _cleanupScheduled, 0); }
    }, null);
    }
    ```

    * **Hysteresis:**
    Trigger when `freeCount > capHigh`, stop when `freeCount < capLow` (e.g., `capLow = capHigh * 0.8`).

    ---

    ## Compatibility / upgrade notes

    * Current code appears to keep defaults unless a host calls `ApplyProfile`. That’s good. If you later wire a default profile in the engine, expose a setting so embedded/mobile hosts can opt out.

    * The cleanup semantics (dropping pages after zeroing) are behaviorally safe but may reduce cache hit rate. Document the trade-off and expose the knobs clearly.

    ---

    ## Final verdict

    **Direction: 👍** The cache needs tuning hooks and bounded, concurrent-safe cleanup. This PR introduces both.
    **Ship now? Not yet.** Fix the hot-path O(n) counts, avoid running heavy cleanup on request threads, correct the cleanup-signal semantics, and add behavioral tests. With those tightened, you’ll have a solid and maintainable foundation for future tuning.