Scope touched: Engine/Disk/MemoryCache.cs, Engine/Structures/PageBuffer.cs, LiteDB.Tests/Internals/Cache_Tests.cs
Theme: Make the in-memory page cache tunable (profiles) and keep its free list in check with an on-demand cleaner guarded by a semaphore and per-page eviction flags.
Promising direction: explicit cache “profiles” (Mobile/Desktop/Server) and a safer cleanup handshake (TryBeginEvict/MarkReusable) add real tuning and safety hooks. However, the current implementation wires cleanup into hot paths and repeatedly calls ConcurrentQueue<T>.Count (which is O(n) on .NET). That risks CPU overhead and read-path tail latency. The profile defaults are conservative and may over-trim the free list, increasing LOH churn during bursts. Unit tests don’t verify behavior, only “doesn’t throw”.
I’d land the concept after tightening the hot-path costs, fixing cleanup triggering semantics, and adding behavior-level tests. Details (with references) below.
-
Profiles & tuning knobs
- New enum and API:
MemoryCache.CacheProfile { Mobile, Desktop, Server }andApplyProfile(profile)with per-instance thresholds:_maxFreePages,_idleBeforeEvict,_minCleanupInterval,_cleanupBatchSize, and an ops counter_opsSinceLastCleanup. - Defaults:
DEFAULT_MAX_FREE_PAGES = 200,DEFAULT_IDLE = 60s,DEFAULT_INT = 30s,DEFAULT_BATCH = 128, andOPS_CLEANUP_STEP = 256.
- New enum and API:
-
Eviction handshake and timestamps
PageBuffergains an atomic eviction flag:TryBeginEvict()andMarkReusable().- When a page is freed, its
Timestampis set toUtcNow.Ticks(used as “freed at” time while in the free list).
-
On-demand cleanup
TryCleanupOnDemand(force: bool)decides when to clean based on: min interval,_free.Countvs_maxFreePages,_cleanupSignal, and_opsSinceLastCleanup.Cleanup()drains up to_cleanupBatchSizefree pages that have been idle longer than_idleBeforeEvict, under_cleanLock(a non-reentrantSemaphoreSlim).
-
Triggers
- The read hot path (
GetReadablePage) increments_opsSinceLastCleanupand runs cleanup every 256 operations. - Free-list producers (
DiscardPage,Extend) setTimestamp,Enqueue, and check_free.Countto potentially force cleanup.
- The read hot path (
-
Tests
- New tests exercise
ApplyProfileand simple read paths but don’t assert cache state invariants; they mostly ensure “no throw” / counters non-negative.
- New tests exercise
-
Profiles make behavior explicit and portable. Different footprints (mobile/desktop/server) can be expressed without scattering “magic numbers” across the engine. This is real extensibility, and the numbers are at least defensible starting points.
-
Eviction flag = safer cleanup under concurrency.
TryBeginEvict/MarkReusablegives the cleaner a handshake before touching shared buffers, reducing the odds of reclaiming a page that a reader/writer still races to use. -
Opportunistic cleanup beats global sweeps. Moving from “periodic global sweep” toward “small, bounded batches when needed” is the right design to reduce contention and keep the free list from growing unbounded.
-
Per-instance tuning. The thresholds are instance fields, not globals — great for testing and for hosting scenarios that need different policies.
-
Hot-path O(n) counting on
ConcurrentQueue. Multiple sites use_free.Countto decide whether to signal/force cleanup (e.g.,DiscardPage,Extend,TryCleanupOnDemand).ConcurrentQueue<T>.Countwalks the queue. Once the free list grows to thousands, those counts add real CPU overhead right inside hot paths. -
Cleanup work can run inline on read callers.
GetReadablePageincrements a global op counter and, every 256 reads, runsTryCleanupOnDemand(). IfCleanup()proceeds, the reader’s thread can spend time zeroing and dequeuing up to 128 pages. That’s a recipe for tail-latency spikes under read-heavy load. -
Trigger semantics can drop signals.
TryCleanupOnDemandclears_cleanupSignalbefore it tries to run cleanup. IfCleanup()bails (lock busy, early return), the signal is already cleared and the queue can remain above the cap until some later incidental trigger. This creates an intermittent “stuck full” state. -
Tests don’t verify outcomes. The new tests check that
ApplyProfileruns and that certain calls don’t throw, but they don’t assert that cleanup actually reduces the free list under the right conditions, that idle thresholds are respected, or that readers aren’t blocked too long.
These are not moral judgments; they’re the places where performance or correctness is most at risk.
-
Allocation churn after aggressive trims. With
_maxFreePagesat 200 by default (≈1.6 MiB of 8 KiB pages), bursty workloads that occasionally need thousands of pages will frequently drop below the working-set floor and then re-allocate large arrays on the next burst. That defeats cache locality, increases LOH/GC pressure, and can slow write bursts noticeably. -
CPU spikes from bulk zeroing. Cleanup zeroes each page (
page.Clear()) before dropping it. That’s an 8 KiB memset per page × up to 128 pages per batch. If this runs inline on a request thread, you can see visible hiccups even when the system isn’t memory-constrained. -
Double counting → cleanup thrash. Because
_free.Countgates both the decision to trigger and the forced path, a queue hovering around the threshold can cause repeated O(n) counts and small, frequent cleanups — paying overhead without meaningfully reducing pressure. -
Eviction flag lifecycle is easy to get wrong. The new handshake is good, but any path that drops a page without calling
MarkReusable()will leave_evictingset and may cause later code to skip that page or treat it as permanently “in eviction”. The diff shows uses ofTryBeginEvict/MarkReusable, but tests don’t assert flag symmetry.
- Throughput: Lower on heavy read/write contention due to extra accounting + cleanup attempts in hot paths.
- Latency: Occasional spikes when a reader thread pays the cleanup bill.
- GC/LOH: More churn when bursts follow a just-trimmed free list; higher allocation rate translates to more frequent GCs.
- Predictability: The “clear signal then maybe bail” pattern makes cleanup somewhat nondeterministic under load.
-
Replace
_free.Countwith a cheap, accurate counter. Maintain anint _freeCountupdated withInterlocked.Increment/Decrementon enqueue/dequeue. Only fall back to_free.IsEmpty/TryDequeuewhere needed. This removes the O(n) scans from hot paths. -
Don’t run cleanup inline on request threads.
- Option A: Move cleanup to a background work item via
ThreadPool.UnsafeQueueUserWorkItemand coalesce triggers (e.g., set a flag and timestamp; a single worker runs until the queue falls below cap or batch quota is met). - Option B: Keep inline but cap wall-time spent (e.g., stop after N pages or M milliseconds, whichever comes first). Either option mitigates tail-latency spikes.
- Option A: Move cleanup to a background work item via
-
Fix signal semantics. In
TryCleanupOnDemand, only clear_cleanupSignalafter you’ve actually entered cleanup (i.e., after acquiring_cleanLock). If the lock is busy, keep the signal set so the next opportunity still sees the pending work. -
Debounce threshold checks. Cache
_freeCountsnapshots and use hysteresis (e.g., “trigger above cap + 10%” and “stop below cap − 10%”). This prevents thrashing when the free list hovers around the boundary. -
Make profile selection explicit and reversible. Keep
ApplyProfilebut don’t call it implicitly inside hot constructors. Let hosts opt in (or supply a callback/setting). If you do pick a default, thread it fromEngineSettingsso existing hosts can override without forking. -
Tune defaults to deployment realities. Consider higher
DEFAULT_MAX_FREE_PAGES(or tie to segment sizes / memory budget). For servers, the currentServerprofile raises the cap to 360; many workloads will still benefit from a deeper baseline. -
Bound cleanup cost per run. Keep
_cleanupBatchSize, but also pause between batches if_cleanLockis contended or if a max time slice is exceeded. Make cleanup cooperative, not monopolizing. -
Clarify eviction-flag lifecycle. Ensure every successful
TryBeginEvict()path reachesMarkReusable()on the success/keep path; only the final “evict and drop” path should leave the page out of circulation. Add debug asserts in DEBUG builds.
-
Behavioral assertions
-
Arrange: Populate free list over
_maxFreePages, set timestamps to simulate idleness. -
Act: Trigger cleanup.
-
Assert:
- Pages idle < threshold remain.
- Pages idle ≥ threshold are removed.
_freeCountfalls toward cap; no negative counts; no stuck_cleanupSignal.TryBeginEvict→MarkReusablesymmetry holds.
-
-
Latency guardrails
- Use a fake time source / stopwatch to assert cleanup does not block reads for > X ms.
- Inject a “fake clock” to control
_minCleanupIntervalwithout sleeping.
-
Thrash scenario
- Hover
_freeCountaround the cap and assert you don’t run cleanup more often than N times per M operations.
- Hover
-
Profile correctness
- After
ApplyProfile(Server), assert the four knobs equal the expected values. - Parameterize tests across profiles.
- After
-
Dispose semantics
- Ensure
Dispose()stops cleanup cleanly (no deadlock on_cleanLock, no NREs if a concurrent read triggers after disposal).
- Ensure
-
Cheap counter:
// fields private int _freeCount; // enqueue _free.Enqueue(page); Interlocked.Increment(ref _freeCount); // dequeue loop in Cleanup if (_free.TryDequeue(out var page)) { Interlocked.Decrement(ref _freeCount); ... } // threshold checks if (Volatile.Read(ref _freeCount) > _maxFreePages) { ... }
-
Safe signal clearing:
private void TryCleanupOnDemand(bool force = false) { if (_disposed) return; var now = DateTime.UtcNow; var since = now - _lastCleanupRunUtc; if (!force && since < _minCleanupInterval && Volatile.Read(ref _freeCount) <= _maxFreePages && !Volatile.Read(ref _cleanupSignal) && Volatile.Read(ref _opsSinceLastCleanup) < OPS_CLEANUP_STEP) return; if (!_cleanLock.Wait(0)) { // leave _cleanupSignal set if we intended to run Volatile.Write(ref _cleanupSignal, true); return; } try { Volatile.Write(ref _cleanupSignal, false); Interlocked.Exchange(ref _opsSinceLastCleanup, 0); Cleanup(); // bounded } finally { _cleanLock.Release(); } }
-
Offload cleanup:
if (Interlocked.Exchange(ref _cleanupScheduled, 1) == 0) { ThreadPool.UnsafeQueueUserWorkItem(_ => { try { TryCleanupOnDemand(force: true); } finally { Volatile.Write(ref _cleanupScheduled, 0); } }, null); }
-
Hysteresis: Trigger when
freeCount > capHigh, stop whenfreeCount < capLow(e.g.,capLow = capHigh * 0.8).
-
Current code appears to keep defaults unless a host calls
ApplyProfile. That’s good. If you later wire a default profile in the engine, expose a setting so embedded/mobile hosts can opt out. -
The cleanup semantics (dropping pages after zeroing) are behaviorally safe but may reduce cache hit rate. Document the trade-off and expose the knobs clearly.
Direction: 👍 The cache needs tuning hooks and bounded, concurrent-safe cleanup. This PR introduces both. Ship now? Not yet. Fix the hot-path O(n) counts, avoid running heavy cleanup on request threads, correct the cleanup-signal semantics, and add behavioral tests. With those tightened, you’ll have a solid and maintainable foundation for future tuning.