I've recently been accumulating some MySQL configuration variables that have defaults which have proven to be problematic in a high-volume production environment. The thing they all have in common is a network blip or two can trigger some very undesirable behavior.
If a client is having trouble connecting to MySQL, the server will give up waiting after connect_timeoutseconds and increment the counter which tracks the number of connect errors it has seen for the host. Then, when that value reaches max_connect_errors, the client will be locked out until you issue a FLUSH HOSTS command. Worse yet, if you have occasionally network blips and never need to restart your MySQL boxes, these errors can accumulate over time and eventually cause you middle of the night pain.
See Host 'host name' is blocked in the MySQL docs. Sadly, there is no way to disable this check entirely. Setting the variable to 0 doesn't accomplish that. Your only real solutions are (a) setting it to a very high value (max_connect_errors=1844674407370954751), and (b) running an occasional FLUSH HOSTS command.
This is related to the above problem. In situations of network congestion (either at the client or server), it's possible for an initial connection to take several seconds to complete. But the default value forconnect_timeout is 5 seconds. When you trip over that, the max_connect_errors problem above kicks in.
To avert this, try setting connect_timeout to a value more like 15 or 20. And also consider makingthread_cache_size a non-zero value. That will help in situations when the server occasionally gets a high number of new connections in a very short period of time.
MySQL does a reverse DNS lookup on every incoming connection by default. This sucks. It seems that no matter how good your infrastructure is, there are blips in DNS service. MySQL's host cache exists to keep those lookups to a minimum. Yet I've seen this cause pain off and on for eight years now. I can only assume there's a bug in the host cache or the resolver library when this happens.
I recommend adding skip-name-resolve to your /etc/my.cnf to skip DNS entirely. Just use IP addresses or ranges for your GRANTs. It seems that slow replies from DNS servers can also help you to trip overconnect_timeout as well. Imagine having 2 or 3 DNS servers configured but the first one is unavailable.
When the network connection between a master and slave database is interrupted in a way that neither side can detect (like a firewall or routing change), you must wait until slave_net_timeout seconds have passed before the salve realizes that something is wrong. It'll then try to reconnect to the master and pick up where it left off. That's awesome.
However, the default value is 3600 seconds. That's a full hour! FAIL.
Who wants their slaves to sit idle for that long before checking to see if something might be wrong? I can't think of anyone who wants that.
My suggestion, if you're in a busy environment, is that you set that to something closer to 30 seconds.