Mesos 0.17.0 Released, Featuring Autorecovery for the Replicated Log

The latest Mesos release, 0.17.0 was made available for download last week, introducing a new autorecovery feature for the replicated log that handles additional disk failure and operator error scenarios.

What is the replicated log?

The replicated log is a distributed append-only log that stores arbitrary data as entries. Replication comes into play as at least a quorum of log replicas are stored on different machines, protecting the data from individual disk or network failures. The replicated log uses the Paxos algorithm. We will describe additional implementation details in both an upcoming blog post and project documentation.

Mesos has had replicated log since version 0.9.0 and Apache Aurora, a framework used in production at Twitter, uses replicated log for its persistent storage. Additionally, the Mesos master will start using the replicated log to persist some cluster state with the introduction of the registrar.

New feature in Mesos 0.17.0: autorecovery

Before Mesos version 0.17.0, the replicated log could become inconsistent if a replica’s disk was replaced or the data was accidentally deleted by an operator. That meant that an operator couldn’t safely, or easily, replace a replica’s disk while the other replicas kept running. Instead, an operator had to stop all replicas, replace the disk, copy a pre-existing log, then restart all the replicas.

As of this release, a replica with a new disk can automatically recover without operator involvement. As long as a quorum of replicas are available, applications using the replicated log won’t notice a thing!

Future work

The new autorecovery feature enables replacing a single replica at a time, but it does not allow an operator to easily add or remove replicas, known as reconfiguration. You can follow MESOS-683 for the latest on this future work.