Trading automatic failover for really good manual failover
A couple of months ago we made the decision to stop using mysql-mmm on one of our client sites. We had been burned by a couple of incidents where automatic failover had caused more problems than we had imagined it would solve. Data ended up being written to the wrong server a few times, causing a bit of a nightmare to merge the forked database after the fact. Although we had other concerns about mysql-mmm, a major factor in our decision was our assessment that in our specific case of assigning MySQL master-slave IP addresses, human-triggered manual failover was essentially as good as automatic failover, and carried much less unexpected risks.
A couple of days later, we felt relieved to read about GitHub's reaction to its own site hiccup:
"There are many situations in which automated failover is an excellent strategy for ensuring the availability of a service. After careful consideration, we've determined that ensuring the availability of our primary production database is not one of these situations. To this end, we've made changes to our Pacemaker configuration to ensure failover of the 'active' database role will only occur when initiated by a member of our operations team."
While our situation was different, and that quote is best understood within the larger context, the sentiment resonated with us. We are often wooed by the possibilities of technology: automating this and super-optimizing that. We are obsessive about putting controls and procedures in place to minimize the risk of systems failing, and minimizing any adverse effects if and when they do fail. The reality, however, is that the possibilities are just that -- possibilities. We test our failover solutions extensively, but we are also prone to forget that the "if everything goes well" failover procedures are being applied in cases where everything obviously didn't go well.
In the case of our MySQL master-slave setup, we still have virtual IP addresses so that the web application configuration doesn't even need to be changed when a database fails. The virtual IP addresses simply need to be re-assigned, which is a very simple task. And this simple task can be done by our hosting partner or us. Both parties are guaranteed to be immediately notified that something had gone wrong anyway!