LiteDB PR #2649 — Cache Profiles & On-Demand Cleanup

Scope touched: Engine/Disk/MemoryCache.cs, Engine/Structures/PageBuffer.cs, LiteDB.Tests/Internals/Cache_Tests.cs Theme: Make the in-memory page cache tunable (profiles) and keep its free list in check with an on-demand cleaner guarded by a semaphore and per-page eviction flags.

Executive summary

Promising direction: explicit cache “profiles” (Mobile/Desktop/Server) and a safer cleanup handshake (TryBeginEvict/MarkReusable) add real tuning and safety hooks. However, the current implementation wires cleanup into hot paths and repeatedly calls ConcurrentQueue<T>.Count (which is O(n) on .NET). That risks CPU overhead and read-path tail latency. The profile defaults are conservative and may over-trim the free list, increasing LOH churn during bursts. Unit tests don’t verify behavior, only “doesn’t throw”.

I’d land the concept after tightening the hot-path costs, fixing cleanup triggering semantics, and adding behavior-level tests. Details (with references) below.

What changed (high-level)

Profiles & tuning knobs
- New enum and API: MemoryCache.CacheProfile { Mobile, Desktop, Server } and ApplyProfile(profile) with per-instance thresholds: _maxFreePages, _idleBeforeEvict, _minCleanupInterval, _cleanupBatchSize, and an ops counter _opsSinceLastCleanup.
- Defaults: DEFAULT_MAX_FREE_PAGES = 200, DEFAULT_IDLE = 60s, DEFAULT_INT = 30s, DEFAULT_BATCH = 128, and OPS_CLEANUP_STEP = 256.
Eviction handshake and timestamps
- PageBuffer gains an atomic eviction flag: TryBeginEvict() and MarkReusable().
- When a page is freed, its Timestamp is set to UtcNow.Ticks (used as “freed at” time while in the free list).
On-demand cleanup
- TryCleanupOnDemand(force: bool) decides when to clean based on: min interval, _free.Count vs _maxFreePages, _cleanupSignal, and _opsSinceLastCleanup.
- Cleanup() drains up to _cleanupBatchSize free pages that have been idle longer than _idleBeforeEvict, under _cleanLock (a non-reentrant SemaphoreSlim).
Triggers
- The read hot path (GetReadablePage) increments _opsSinceLastCleanup and runs cleanup every 256 operations.
- Free-list producers (DiscardPage, Extend) set Timestamp, Enqueue, and check _free.Count to potentially force cleanup.
Tests
- New tests exercise ApplyProfile and simple read paths but don’t assert cache state invariants; they mostly ensure “no throw” / counters non-negative.

The good

Profiles make behavior explicit and portable. Different footprints (mobile/desktop/server) can be expressed without scattering “magic numbers” across the engine. This is real extensibility, and the numbers are at least defensible starting points.
Eviction flag = safer cleanup under concurrency. TryBeginEvict/MarkReusable gives the cleaner a handshake before touching shared buffers, reducing the odds of reclaiming a page that a reader/writer still races to use.
Opportunistic cleanup beats global sweeps. Moving from “periodic global sweep” toward “small, bounded batches when needed” is the right design to reduce contention and keep the free list from growing unbounded.
Per-instance tuning. The thresholds are instance fields, not globals — great for testing and for hosting scenarios that need different policies.

The bad

Hot-path O(n) counting on ConcurrentQueue. Multiple sites use _free.Count to decide whether to signal/force cleanup (e.g., DiscardPage, Extend, TryCleanupOnDemand). ConcurrentQueue<T>.Count walks the queue. Once the free list grows to thousands, those counts add real CPU overhead right inside hot paths.
Cleanup work can run inline on read callers. GetReadablePage increments a global op counter and, every 256 reads, runs TryCleanupOnDemand(). If Cleanup() proceeds, the reader’s thread can spend time zeroing and dequeuing up to 128 pages. That’s a recipe for tail-latency spikes under read-heavy load.
Trigger semantics can drop signals. TryCleanupOnDemand clears _cleanupSignal before it tries to run cleanup. If Cleanup() bails (lock busy, early return), the signal is already cleared and the queue can remain above the cap until some later incidental trigger. This creates an intermittent “stuck full” state.
Tests don’t verify outcomes. The new tests check that ApplyProfile runs and that certain calls don’t throw, but they don’t assert that cleanup actually reduces the free list under the right conditions, that idle thresholds are respected, or that readers aren’t blocked too long.

The risky / “evil” bits (potential regressions)

These are not moral judgments; they’re the places where performance or correctness is most at risk.

Allocation churn after aggressive trims. With _maxFreePages at 200 by default (≈1.6 MiB of 8 KiB pages), bursty workloads that occasionally need thousands of pages will frequently drop below the working-set floor and then re-allocate large arrays on the next burst. That defeats cache locality, increases LOH/GC pressure, and can slow write bursts noticeably.
CPU spikes from bulk zeroing. Cleanup zeroes each page (page.Clear()) before dropping it. That’s an 8 KiB memset per page × up to 128 pages per batch. If this runs inline on a request thread, you can see visible hiccups even when the system isn’t memory-constrained.
Double counting → cleanup thrash. Because _free.Count gates both the decision to trigger and the forced path, a queue hovering around the threshold can cause repeated O(n) counts and small, frequent cleanups — paying overhead without meaningfully reducing pressure.
Eviction flag lifecycle is easy to get wrong. The new handshake is good, but any path that drops a page without calling MarkReusable() will leave _evicting set and may cause later code to skip that page or treat it as permanently “in eviction”. The diff shows uses of TryBeginEvict/MarkReusable, but tests don’t assert flag symmetry.

Performance & stability impact to expect

Throughput: Lower on heavy read/write contention due to extra accounting + cleanup attempts in hot paths.
Latency: Occasional spikes when a reader thread pays the cleanup bill.
GC/LOH: More churn when bursts follow a just-trimmed free list; higher allocation rate translates to more frequent GCs.
Predictability: The “clear signal then maybe bail” pattern makes cleanup somewhat nondeterministic under load.

Recommendations (prioritized, low-risk first)

Replace _free.Count with a cheap, accurate counter. Maintain an int _freeCount updated with Interlocked.Increment/Decrement on enqueue/dequeue. Only fall back to _free.IsEmpty/TryDequeue where needed. This removes the O(n) scans from hot paths.
Don’t run cleanup inline on request threads.
- Option A: Move cleanup to a background work item via ThreadPool.UnsafeQueueUserWorkItem and coalesce triggers (e.g., set a flag and timestamp; a single worker runs until the queue falls below cap or batch quota is met).
- Option B: Keep inline but cap wall-time spent (e.g., stop after N pages or M milliseconds, whichever comes first). Either option mitigates tail-latency spikes.
Fix signal semantics. In TryCleanupOnDemand, only clear _cleanupSignal after you’ve actually entered cleanup (i.e., after acquiring _cleanLock). If the lock is busy, keep the signal set so the next opportunity still sees the pending work.
Debounce threshold checks. Cache _freeCount snapshots and use hysteresis (e.g., “trigger above cap + 10%” and “stop below cap − 10%”). This prevents thrashing when the free list hovers around the boundary.
Make profile selection explicit and reversible. Keep ApplyProfile but don’t call it implicitly inside hot constructors. Let hosts opt in (or supply a callback/setting). If you do pick a default, thread it from EngineSettings so existing hosts can override without forking.
Tune defaults to deployment realities. Consider higher DEFAULT_MAX_FREE_PAGES (or tie to segment sizes / memory budget). For servers, the current Server profile raises the cap to 360; many workloads will still benefit from a deeper baseline.
Bound cleanup cost per run. Keep _cleanupBatchSize, but also pause between batches if _cleanLock is contended or if a max time slice is exceeded. Make cleanup cooperative, not monopolizing.
Clarify eviction-flag lifecycle. Ensure every successful TryBeginEvict() path reaches MarkReusable() on the success/keep path; only the final “evict and drop” path should leave the page out of circulation. Add debug asserts in DEBUG builds.

Test coverage gaps (and what to add)

Behavioral assertions
- Arrange: Populate free list over _maxFreePages, set timestamps to simulate idleness.
- Act: Trigger cleanup.
- Assert:
  - Pages idle < threshold remain.
  - Pages idle ≥ threshold are removed.
  - _freeCount falls toward cap; no negative counts; no stuck _cleanupSignal.
  - TryBeginEvict → MarkReusable symmetry holds.
Latency guardrails
- Use a fake time source / stopwatch to assert cleanup does not block reads for > X ms.
- Inject a “fake clock” to control _minCleanupInterval without sleeping.
Thrash scenario
- Hover _freeCount around the cap and assert you don’t run cleanup more often than N times per M operations.
Profile correctness
- After ApplyProfile(Server), assert the four knobs equal the expected values.
- Parameterize tests across profiles.
Dispose semantics
- Ensure Dispose() stops cleanup cleanly (no deadlock on _cleanLock, no NREs if a concurrent read triggers after disposal).

Suggested code tweaks (sketches)

Cheap counter:

// fields
private int _freeCount;

// enqueue
_free.Enqueue(page);
Interlocked.Increment(ref _freeCount);

// dequeue loop in Cleanup
if (_free.TryDequeue(out var page)) {
    Interlocked.Decrement(ref _freeCount);
    ...
}

// threshold checks
if (Volatile.Read(ref _freeCount) > _maxFreePages) { ... }

Safe signal clearing:

private void TryCleanupOnDemand(bool force = false)
{
    if (_disposed) return;

    var now = DateTime.UtcNow;
    var since = now - _lastCleanupRunUtc;

    if (!force && since < _minCleanupInterval &&
        Volatile.Read(ref _freeCount) <= _maxFreePages &&
        !Volatile.Read(ref _cleanupSignal) &&
        Volatile.Read(ref _opsSinceLastCleanup) < OPS_CLEANUP_STEP) return;

    if (!_cleanLock.Wait(0)) {
        // leave _cleanupSignal set if we intended to run
        Volatile.Write(ref _cleanupSignal, true);
        return;
    }

    try {
        Volatile.Write(ref _cleanupSignal, false);
        Interlocked.Exchange(ref _opsSinceLastCleanup, 0);
        Cleanup(); // bounded
    }
    finally {
        _cleanLock.Release();
    }
}

Offload cleanup:

if (Interlocked.Exchange(ref _cleanupScheduled, 1) == 0) {
    ThreadPool.UnsafeQueueUserWorkItem(_ => {
        try { TryCleanupOnDemand(force: true); }
        finally { Volatile.Write(ref _cleanupScheduled, 0); }
    }, null);
}

Hysteresis: Trigger when freeCount > capHigh, stop when freeCount < capLow (e.g., capLow = capHigh * 0.8).

Compatibility / upgrade notes

Current code appears to keep defaults unless a host calls ApplyProfile. That’s good. If you later wire a default profile in the engine, expose a setting so embedded/mobile hosts can opt out.
The cleanup semantics (dropping pages after zeroing) are behaviorally safe but may reduce cache hit rate. Document the trade-off and expose the knobs clearly.

Final verdict

Direction: 👍 The cache needs tuning hooks and bounded, concurrent-safe cleanup. This PR introduces both. Ship now? Not yet. Fix the hot-path O(n) counts, avoid running heavy cleanup on request threads, correct the cleanup-signal semantics, and add behavioral tests. With those tightened, you’ll have a solid and maintainable foundation for future tuning.

JKamsker/Review.md