Building an offline realtime sync engine

So you want to write a sync system for a web app with offline and realtime support? Good luck. You might find the following resources useful.

Overview articles

Database in a browser, a spec (Stepan Parunashvili)

What problem are we trying to solve with a sync system?
The web of tomorrow (Nikita Prokopov)

Lots of ideas. Published in 2015, maybe it was ahead of its time?

Sync engines in the industry

How Figma’s multiplayer technology works (Evan Wallace)

Fantastic summary of Figma's sync system and why they ultimately rejected fully decentralized CRDTs while still using some ideas from the CRDT literature.
Linear's realtime sync system (Tuomas Artman)

Linear has a sophisticated sync system with offline capabilities. This talks describes how it works.
Muse's Sync (Adan Wulf, Addam Wiggins, Mark Mcgranaghan)

In-depth discussion of the why and how behind Muse's realtime sync system

Exisiting sync systems

Every app has slightly different needs. So my guess is that you will need to build your own system. But it's worth taking a look at pre-built sync engines, either as buy-over-build or to steal ideas:

Firebase

Confusingly Google offers two realtime databases under the Firebase brand. Cloud Firestore and Firebase Realtime Database. Lots of good ideas, but they're hard to disentangle from the marketing copy.
Couchdb / Pouchdb

Couchdb is a database based around replication. The CouchDB book, while a bit dated, is well written and worth reading.

Pouchdb offered a local Javascript database with server replication before it was cool. It's still pretty impressive today, and free software.

Databases

When you're building a sync engine, you're essentially building a database with replication - whether you realize it or not. So it's a good idea to review some of the literature on databases and replication.

Datomic

Learn as much as you can about Datomic - Datalog vs SQL, inserts/updates as pure data structures, pull syntax, EAV tuples, immutable facts, database as a value, unbundling the database. There's so much to learn. Datomic may not be the right database for your backend (although maybe it is? Check it out) but it's without a doubt one of the best-designed systems out there.

Datomic with Rich Hickey

The docs are excellent.

Datascript is a client-side version of Datomic (but there's no built-in sync engine).
Postgres is the most mature RDBMS out there (but doesn't help you replicate data to the client). Learn as much as you can from the decades of research and practical wisdom that went into Postgres. The chapter in the Postgres manual on Isolation levels is excellent

SaaS in the sync space

Some SaaS are trying to offer multiplayer, syncing, caching etc as a service. No clear winner yet. But even if you're not going to use these, it's worth reading the API docs for inspiration.

DDIA

Kleppmann, Designing Data-Intensive Applications, O'Reilly (2017)

If you haven't read this book, drop everything and start by reading it. Every chapter is full of insights and summaries of how to use and build databases (which is what you're doing essentially).

Local-first software

Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan: “Local-first software: You own your data, in spite of the cloud”. ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! ’19), October 2019 (link)

Conflict-free replicated data types (CRDTs)

I recommend learning about CRDTs. Not because a sync system should be built on CRDTs (it probably shouldn't unless you're building a truly decentralized, peer-to-peer system), but because everyone's talking about them and it's useful to understand their limitations (and strengths). Also the literature contains many concepts that are helpful even if you don't need costly decentralization because can rely on a centralized server.

Martin Kleppmann, Alastair R. Beresford: "A Conflict-Free Replicated JSON Datatype". IEEE Transactions on Parallel and Distributed Systems 28(10):2733–2746, April 2017 (link)

Very accessible paper on what ended up being published as automerge

Many more papers and other resources are avaialble on crdt.tech.

Distributed systems

CS students study this topic in college. If, like me, you skipped this part of your eductation, it's worth learning the basic theory to get a better overview of the problem space.

I recommend Linsey Kuper's lectures at UC Santa Cruz, which she's generously made available on Youtube. The course also has a website.

It's fun and you'll learn about Lamport diagrams and consistency models.

There's also aphyr's braindump of interesting ideas in distributed systems. He's also has a nice page describing consistency models.

IndexedDB

If you're storing data in a browser for offline use, it's probably going to end up in IndexedDB. Here's a lot more information about this janky corner of the web platform.

ImSingee/building-sync-systems.md