See the getting started page for more information about downloading, building, and deploying Mesos.
See our community page for more details.
mesos-slave process on a host exits (perhaps due to a Mesos bug or
because the operator kills the process while upgrading Mesos),
any executors/tasks that were being managed by the
mesos-slave process will
continue to run. When
mesos-slave is restarted, the operator can control how
those old executors/tasks are handled:
mesos-slaveprocess are killed.
mesos-slaveprocess and continue running uninterrupted.
Hence, enabling framework checkpointing enables tasks to tolerate Mesos slave
upgrades and unexpected
mesos-slave crashes without experiencing any
Slave recovery works by having the slave checkpoint information (e.g., Task Info, Executor Info, Status Updates) about the tasks and executors it is managing to local disk. If a framework enables checkpointing, any subsequent slave restarts will recover the checkpointed information and reconnect with any executors that are still running.
Note that if the operating system on the slave is rebooted, all executors and tasks running on the host are killed and are not automatically restarted when the host comes back up.
A framework can control whether its executors will be recovered by setting the
checkpoint flag in its
FrameworkInfo when registering with the master. Enabling this feature results in increased I/O overhead at each slave that runs tasks launched by the framework. By default, frameworks do not checkpoint their state.
Three configuration flags control the recovery behavior of a Mesos slave:
strict: Whether to do slave recovery in strict mode [Default: true].
recover: Whether to recover status updates and reconnect with old executors [Default: reconnect].
NOTE: If no checkpointing information exists, no recovery is performed and the slave registers with the master as a new slave.
recovery_timeout: Amount of time allotted for the slave to recover [Default: 15 mins].
recovery_timeoutto recover, any executors that are waiting to reconnect to the slave will self-terminate.
NOTE: If none of the frameworks have enabled checkpointing, the executors and tasks running at a slave die when the slave dies and are not recovered.
A restarted slave should re-register with master within a timeout (75 seconds by default: see the
--slave_ping_timeout configuration flags). If the slave takes longer than this timeout to re-register, the master shuts down the slave, which in turn will shutdown any live executors/tasks. Therefore, it is highly recommended to automate the process of restarting a slave (e.g., using a process supervisor such as monit or
systemdand POSIX isolation
There is a known issue when using
systemd to launch the
mesos-slave while also using only
posix isolation mechanisms that prevents tasks from recovering. The problem is that the default KillMode for systemd processes is
cgroup and hence all child processes are killed when the slave stops. Explicitly setting
process allows the executors to survive and reconnect.
The following excerpt of a
systemd unit configuration file shows how to set the flag:
[Service] ExecStart=/usr/bin/mesos-slave KillMode=process
NOTE: There are also known issues with using
cgroupsbased isolation, for now the suggested non-Posix isolation mechanism is to use Docker containerization.