Skip to content

Instantly share code, notes, and snippets.

@pbailis
Last active April 27, 2020 11:46
Show Gist options
  • Save pbailis/5660980 to your computer and use it in GitHub Desktop.
Save pbailis/5660980 to your computer and use it in GitHub Desktop.

Revisions

  1. pbailis revised this gist May 28, 2013. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -6,6 +6,8 @@ Notes from courses like Lorenzo Alvisi's [Distributed Computing](http://www.cs.u

    There are a bunch of classics on [causality](http://www.cs.princeton.edu/courses/archive/fall11/cos518/papers/time-clocks.pdf), [Paxos](http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf) (and more [practical takes on Paxos](http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf)), and [distributed snapshots](http://ece842.com/S13/readings/chandy85.pdf).

    **Edit: aside from these below, Alex Feinberg and Henry Robinson's lists at [this Quora post](http://www.quora.com/What-are-the-seminal-papers-in-distributed-systems-Why) contain a bunch of good practically-oriented but theoretically grounded papers.**

    Practical databases:

    * **Consistency in Partitioned Networks** [PDF](http://delab.csd.auth.gr/~dimitris/courses/mpc_fall05/papers/invalidation/acm_csur85_partitioned_network_consistency.pdf) [ACM](http://dl.acm.org/citation.cfm?id=5508) A nice, practical discussion of techniques database systems can employ to ensure consistency under partitions. This survey predates CAP by several decades but is well-written and summarizes several important ideas.
  2. pbailis revised this gist May 28, 2013. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -12,7 +12,7 @@ Practical databases:
    * **Megastore: Providing Scalable, Highly Available
    Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/papers/jbaker-megastore.pdf) Megastore gives a reasonable example of a Paxos-based database architecture.
    * **Consistency Tradeoffs in Modern Distributed Database System Design** [PDF](http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf) [IEEE](http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6127847) is a great paper from Daniel Abadi reminding us that, aside from behavior during failures, highly available ("AP") systems also achieve low latency.
    * There are many remnants of the **Bayou** project in many "AP" systems today. The project was aimed at disconnected operation in a proto-smartphone/mobile computing era; a good overview is [The Bayou Architecture: Support for Data Sharing among Mobile Users](http://www.dpi.ufv.br/~nacif/cmovel/bayou.pdf). Also good is [Managing update conflicts in Bayou, a weakly connected replicated storage system](http://www.cs.princeton.edu/courses/archive/fall11/cos518/papers/bayou.pdf). Definitely a more practically oriented paper. *Optimistic Replication* [PDF](http://www.ysaito.com/survey.pdf) [ACM](http://dl.acm.org/citation.cfm?id=1057980) is a great survey of similar techniques.
    * There are many remnants of the **Bayou** project in many "AP" systems today. The project was aimed at disconnected operation in a proto-smartphone/mobile computing era; a good overview is [The Bayou Architecture: Support for Data Sharing among Mobile Users](http://www.dpi.ufv.br/~nacif/cmovel/bayou.pdf). Also good is [Managing update conflicts in Bayou, a weakly connected replicated storage system](http://www.cs.princeton.edu/courses/archive/fall11/cos518/papers/bayou.pdf). Definitely a more practically oriented paper. **Optimistic Replication** [PDF](http://www.ysaito.com/survey.pdf) [ACM](http://dl.acm.org/citation.cfm?id=1057980) is a great survey of similar techniques.

    More formal stuff:

  3. pbailis revised this gist May 28, 2013. 1 changed file with 3 additions and 3 deletions.
    6 changes: 3 additions & 3 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -11,11 +11,11 @@ Practical databases:
    * **Consistency in Partitioned Networks** [PDF](http://delab.csd.auth.gr/~dimitris/courses/mpc_fall05/papers/invalidation/acm_csur85_partitioned_network_consistency.pdf) [ACM](http://dl.acm.org/citation.cfm?id=5508) A nice, practical discussion of techniques database systems can employ to ensure consistency under partitions. This survey predates CAP by several decades but is well-written and summarizes several important ideas.
    * **Megastore: Providing Scalable, Highly Available
    Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/papers/jbaker-megastore.pdf) Megastore gives a reasonable example of a Paxos-based database architecture.
    * *Consistency Tradeoffs in Modern Distributed Database System Design* [PDF](http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf) [IEEE](http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6127847) is a great paper from Daniel Abadi reminding us that, aside from behavior during failures, highly available ("AP") systems also achieve low latency.
    * There are many remnants of the *Bayou* project in many "AP" systems today. The project was aimed at disconnected operation in a proto-smartphone/mobile computing era; a good overview is [The Bayou Architecture: Support for Data Sharing among Mobile Users](http://www.dpi.ufv.br/~nacif/cmovel/bayou.pdf). Also good is [Managing update conflicts in Bayou, a weakly connected replicated storage system](http://www.cs.princeton.edu/courses/archive/fall11/cos518/papers/bayou.pdf). Definitely a more practically oriented paper. *Optimistic Replication* [PDF](http://www.ysaito.com/survey.pdf) [ACM](http://dl.acm.org/citation.cfm?id=1057980) is a great survey of similar techniques.
    * **Consistency Tradeoffs in Modern Distributed Database System Design** [PDF](http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf) [IEEE](http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6127847) is a great paper from Daniel Abadi reminding us that, aside from behavior during failures, highly available ("AP") systems also achieve low latency.
    * There are many remnants of the **Bayou** project in many "AP" systems today. The project was aimed at disconnected operation in a proto-smartphone/mobile computing era; a good overview is [The Bayou Architecture: Support for Data Sharing among Mobile Users](http://www.dpi.ufv.br/~nacif/cmovel/bayou.pdf). Also good is [Managing update conflicts in Bayou, a weakly connected replicated storage system](http://www.cs.princeton.edu/courses/archive/fall11/cos518/papers/bayou.pdf). Definitely a more practically oriented paper. *Optimistic Replication* [PDF](http://www.ysaito.com/survey.pdf) [ACM](http://dl.acm.org/citation.cfm?id=1057980) is a great survey of similar techniques.

    More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549)
    * (Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806))
    * (Even better, **A short introduction to failure detectors for asynchronous distributed systems** [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806))
    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
  4. pbailis revised this gist May 28, 2013. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -16,5 +16,6 @@ Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/pa

    More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549) (Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806))
    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549)
    * (Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806))
    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
  5. pbailis revised this gist May 28, 2013. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -17,4 +17,4 @@ Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/pa
    More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549) (Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806))
    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
  6. pbailis revised this gist May 28, 2013. 1 changed file with 1 addition and 3 deletions.
    4 changes: 1 addition & 3 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -16,7 +16,5 @@ Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/pa

    More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549)
    * Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549) (Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806))
    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
  7. pbailis revised this gist May 28, 2013. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -18,4 +18,5 @@ More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549)
    * Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)

    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
  8. pbailis revised this gist May 28, 2013. 1 changed file with 1 addition and 2 deletions.
    3 changes: 1 addition & 2 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -17,6 +17,5 @@ Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/pa
    More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549)
    * Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)

    * Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)
    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
  9. pbailis revised this gist May 28, 2013. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -17,6 +17,6 @@ Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/pa
    More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549)
    * Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)
    * Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)

    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
  10. pbailis revised this gist May 28, 2013. 1 changed file with 1 addition and 2 deletions.
    3 changes: 1 addition & 2 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -17,7 +17,6 @@ Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/pa
    More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549)

    ** Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)
    * Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)

    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
  11. pbailis revised this gist May 28, 2013. 1 changed file with 3 additions and 1 deletion.
    4 changes: 3 additions & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -17,5 +17,7 @@ Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/pa
    More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549)
    ** Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)

    ** Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)

    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)
  12. pbailis created this gist May 28, 2013.
    21 changes: 21 additions & 0 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,21 @@
    Context: I was asked for a list of interesting reading relating to "distributed databases, behavior under partitions and failures, failure detection." Here's what I came up with in about an hour.

    For textbooks, "Introduction to Reliable and Secure Distributed Programming" is a superb introduction to distributed computing from a formal perspective; it's really not about "programming" or "engineering" but about distributed system fundamentals like consensus, distributed registers, and broadcast. Used in Berkeley's [Distributed Computing course](http://www.cs.berkeley.edu/~alig/cs294-91/) (and [HT to @lalithsuresh](https://twitter.com/lalithsuresh/status/339089590159814656)) [Book Site](http://www.distributedprogramming.net/)

    Notes from courses like Lorenzo Alvisi's [Distributed Computing](http://www.cs.utexas.edu/users/lorenzo/corsi/cs371d/13S/notes/) class can be great.

    There are a bunch of classics on [causality](http://www.cs.princeton.edu/courses/archive/fall11/cos518/papers/time-clocks.pdf), [Paxos](http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf) (and more [practical takes on Paxos](http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf)), and [distributed snapshots](http://ece842.com/S13/readings/chandy85.pdf).

    Practical databases:

    * **Consistency in Partitioned Networks** [PDF](http://delab.csd.auth.gr/~dimitris/courses/mpc_fall05/papers/invalidation/acm_csur85_partitioned_network_consistency.pdf) [ACM](http://dl.acm.org/citation.cfm?id=5508) A nice, practical discussion of techniques database systems can employ to ensure consistency under partitions. This survey predates CAP by several decades but is well-written and summarizes several important ideas.
    * **Megastore: Providing Scalable, Highly Available
    Storage for Interactive Services** [PDF](http://pdos.csail.mit.edu/6.824-2012/papers/jbaker-megastore.pdf) Megastore gives a reasonable example of a Paxos-based database architecture.
    * *Consistency Tradeoffs in Modern Distributed Database System Design* [PDF](http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf) [IEEE](http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6127847) is a great paper from Daniel Abadi reminding us that, aside from behavior during failures, highly available ("AP") systems also achieve low latency.
    * There are many remnants of the *Bayou* project in many "AP" systems today. The project was aimed at disconnected operation in a proto-smartphone/mobile computing era; a good overview is [The Bayou Architecture: Support for Data Sharing among Mobile Users](http://www.dpi.ufv.br/~nacif/cmovel/bayou.pdf). Also good is [Managing update conflicts in Bayou, a weakly connected replicated storage system](http://www.cs.princeton.edu/courses/archive/fall11/cos518/papers/bayou.pdf). Definitely a more practically oriented paper. *Optimistic Replication* [PDF](http://www.ysaito.com/survey.pdf) [ACM](http://dl.acm.org/citation.cfm?id=1057980) is a great survey of similar techniques.

    More formal stuff:

    * **Unreliable failure detectors for reliable distributed systems** [PDF](http://www.ecommons.cornell.edu/bitstream/1813/7192/1/95-1535.pdf) [ACM](http://dl.acm.org/citation.cfm?id=226647) A very theoretical but highly celebrated paper relating the problem of failure detection and consensus; together with **The Weakest Failure Detector for Solving Consensus** make for a great if tough tutorial on failure detectors (may be better off reading a textbook) [PDF](http://courses.csail.mit.edu/6.852/05/papers/jacm96.pdf) [ACM](http://dl.acm.org/citation.cfm?id=234549)
    ** Even better, *A short introduction to failure detectors for asynchronous distributed systems* [PS.GZ](ftp://ftp.irisa.fr/pub/techreports/2004/PI-1613.ps.gz) [ACM](http://dl.acm.org/citation.cfm?doid=1052796.1052806)
    * **The Byzantine Generals Problem** [PDF](http://www.eecs.harvard.edu/cs262/Readings/byzantineGenerals.pdf) [ACM](http://dl.acm.org/citation.cfm?id=357176) introduces the problem of byzantine fault tolerance, albeit in typical Lamport style (i.e., with a cute but sometimes distracting story)