Full-stack Philosophies

James Morle's Blog

RSS Feed

Simplicity Is Good

Posted on 7:57 am November 14, 2011 by James Morle

This is a post about the importance of appropriately simplistic architectures. I frequently get involved with the creation of full-stack architectures, and in particular the architecture of the database platform. There are some golden rules when designing such systems, but one of the most important ones is to keep the design as simple as possible. This isn't a performance enhancement, this is an availability enhancement. Complexity, after all, is the enemy of availability.

Despite it being a sensible goal, it is incredibly common to come up against quite stubborn resistance to simplicity. Frequently, the objections will be based upon the principles of the complex solution being a 'better way' to do things. I have two closely linked examples of this in action.

Case 1: Real Application Cluster Interconnects

A cluster interconnect is an incredibly important component of the architecture. The cluster exists, after all, as an availability feature (and possibly a scalability feature), and so the foundations of the cluster must be robust in order for it to deliver that availability. The cluster interconnect is the life blood of the cluster. And yet, it has such a very simplistic set of requirements:

  • Point-to-point communication between all nodes of the cluster
  • Low latency
  • n+1 availability of network paths
  • Multicast support between the nodes
  • (optionally) Jumbo frames support

It explicitly does not need any of the more 'fancy' networking features, such as:

  • routing of any kind
  • spanning tree support
  • VLANs
  • access to any other networks

It just needs a dedicated pair (or more) of discrete, layer 2 networks. They don't need to be bonded, the networks do not even need to be aware of each other - they are completely independent (at least, that is certainly the case since the HAIP functionality of Oracle 11gR2). They do need real switches though - crossover cables fail ungracefully in the event of a peer host losing power. But they don't need anything high end - just something better than crossover cables, and with enough bandwidth for the required traffic rates. The latency difference between the majority of switches is barely a consideration. The switches don't really even need redundant power supplies, though it's not a terrible idea to insulate yourself from this type of failure and it brings no detriment apart from a marginal cost increase.

So, something like a pair of unmanaged, layer-2 GbE Ethernet switches are the perfect solution. Something like a Netgear JGS516 would probably do the job, from a brief scan of the specification. They are about £100 ($150) each, net cost of £200 for a nice, robust solution. Or if you wanted to really push the boat out, something like a fully managed L2 switch with redundant power such as the HP E2810-24G will set you back all of £700 ($1000) each. Cisco shops might spend a bit more and go for something like a 3750G for about £2800 each.

But. Somebody will always push back on this. They will plumb the cluster nodes into the full core/edge corporate dream stack topology with fully active failover between a pair of core switches. Surely, at a cost of more than four orders of magnitude more than the bargain basement Netgear solution, this must be better, right? Wrong.

There are numerous aspects that are incorrect in this assumption:

  1. The assumption that higher cost means 'better'
  2. That there will be an increase in availability
  3. That every networking requirement is the same as every other

First of all, these network topologies are not designed for cluster interconnects. They are designed for corporate networks, connecting thousands of ports into a flexible and secure network. RAC interconnects are tiny, closed networks and need none of that functionality. More precisely, they need none of that complexity. Corporate networks also have a different level of failure sensitivity to cluster interconnects; if a user's PC goes offline for a couple of minutes, or even half an hour, the recovery from that failure is instant once the fault is rectified - the user is immediately back in action. Cluster interconnects are not so forgiving; if a cluster's networks go AWOL for a few minutes, the best you can hope for is a single node of the cluster still standing when the fault is rectified. That is how clusters are designed to operate: If the network disappears, the cluster must assume it is unsafe to allow multiple nodes to access the shared storage. The net result of this failure behaviour is that a relatively short network outage can result in a potentially lengthy full (and manual) restart of the whole cluster, restart of the application, balancing of services, warming of caches, and so on. It would not be an exaggeration for this to be a one hour or greater outage. Not terrific for a highly available cluster.

But hang on a minute - this über-expensive networking technology never goes down, right? Not true. What exactly is this active/active core switch topology? Think about it. It's a kind of cluster itself, with each switch running complex software to determine the health of its peer and managing a ton of state information between them. The magic word in that sentence was the word software - anything that is running software has a great deal of failure potential. Not only that, but clustered software has a great deal of potential to fail on all nodes concurrently. This is a unique attribute of distributed software and one that does not exist in discrete hardware designs. In discrete hardware designs it is incredibly unlikely that more than one component will fail concurrently. Software is great at catastrophic failure, most particularly when it is combined with some element of human error during upgrades, reconfiguration, or just plain tinkering. Not even humans can make two independent hardware switches fail concurrently, unless they are being creative with power supply.

Just to highlight this point, I should state here that I have personally witnessed failures of entire core/edge switch topologies on three occasions in the last five years. It does not matter that the cluster nodes are connected to the same edge switches when this kind of failure occurs, because every component in the network is a logical contributor to the larger entity and will become unavailable as part of a larger meltdown. If you are a Blackberry user, you have experienced one yourself recently. The Blackberry issue proves the potential, but in their case the topology was at least appropriate - they have a requirement to interconnect thousands of devices. In our clusters, we have no such requirement and we should not be implementing overly complex and thus unreliable network topologies accordingly.

Case 2: The Great SAN Splurge

Now let's think about Storage Area Networking. And let's not restrict this thought to Fibre Channel, because the same principles apply to an Ethernet-based SAN. In fact, let me just clear off the Ethernet SAN piece first: Don't use your corporate network for storage connectivity. It's the wrong thing to do for all the reasons stated in the first case on this page.

So, now we can focus on Fibre Channel SANs. Fibre Channel has become the backbone of the data centre, allowing storage devices to be located in sensible locations, perhaps in different rooms to the servers, and for everything to be able to be connected to everything else with optimised structured cabling. The zoning of the fabric then determines which devices are allowed to see other devices. All very well and good, but how is this implemented? Unsurprisingly, this is implemented using an exactly analogous solution to the core/edge Ethernet network design in the previous case. Two active core switches lie at the heart of a multi-tier network and provide failover capability for each other. A cluster. This cluster can (and does) fail for exactly the same reasons given in the former case, and yes, I have also seen this occur in real-life - twice in the the last five years.

The failure implications for a SAN meltdown can be even more serious than a cluster meltdown. All I/O will stop and, if the outage goes on long enough, all databases in the data centre will crash and need to be restarted.

There are a few other implications with this topology in large data centres. Notably, it is common for the storage arrays to be connected via different physical switches than the servers, implying that there are a number of Inter Switch Links (ISLs) to go through. These ISLs can become congested and cause severe bottlenecks in throughput which can be extremely tricky to track down. In extreme cases, ISLs can be the cause of multi-minute I/O response times, which will also cause clusters and databases to crash.

So that preamble paints the SAN picture, and sets the stage for the following questions:

Why are all devices in the SAN connected to all other devices? Why are the handful of nodes that make up your critical database part of a SAN of thousands of other devices? Why are they not just connected via simple switches to the storage array?

There is only one reason, and that is data centre cabling. But it doesn't really follow: If your database servers are in a rack, or a few racks next to each other, put a pair of physically and logically discrete switches into the top of the rack, attach all the nodes, and then connect the storage array using the same number ports that you would have connected to the switches if they had been edge switches. The destination of the those cables would be the storage array rather than the core switches, but the number of cable runs is pretty much the same, and results in a more robust solution. There is no exposure to catastrophic loss of service in the SAN, because there are two completely discrete SANs between the servers and the storage.

Fibre Channel networks are vertical in nature: server nodes do not communicate with other server nodes over the SAN, they only communicate with the storage array. Server nodes do not need to be connected to thousands of storage arrays, either. The connectivity requirement for a given platform is actually rather simplistic.

Note: I am writing from the viewpoint of a typical RDBMS implementation, not from the viewpoint of massively parallel HPC or big data systems. Clearly, if there truly are thousands of devices that do need to be connected, this argument does not apply.


The common theme between these two cases is this: Don't connect things that don't need to be connected. Yes, it is easier to cable up, and arguably easier to manage, but it has a knock-on effect of dictating an implementation that does not suit the requirement. It results in a less reliable, more complex solution, with the cart very much before the horse. Don't trade off administrative simplicity against architectural simplicity: It will sneak up and bite you.

As Albert Einstein said, "Make things as simple as possible, but not simpler." Wise words indeed.

6 comments on “Simplicity Is Good

    • It's a good point, thought a little of topic for the point I was trying to make. The fencing mechanism in RAC is indeed a little Heath Robinson/Rube Goldberg...

  1. Interesting points. Surely the reason all servers are connected to all other servers on the SAN fabric (database server or not) is because they are all sharing the same array(s)? You can't connect your database servers directly to the array (or, as you say, indirectly via fabric dedicated to database servers) simply because there's a load of other junk on the array. Afterall how would you get your MS Exchange Server data to interfere with your batch run or RMAN backup otherwise 😉

    • Hi Simon,

      You definitely can do this. And should. Your database needs a certain amount of port bandwidth in/out the array - just use those ports for the dedicated fabric. The other ports can be used for all that other junk!


Leave a Reply