James Morle's BlogRSS Feed
Sane SAN2010: Storage Arrays – Ready, Aim, FirePosted on 1:03 pm September 6, 2010 by James Morle
OK, this one might be contentious, but what the heck - somebody has to say it. Let's start with a question:
Raise your hand if you have a feeling, even a slight one, that storage arrays suck?
Most DBAs and sysadmins that I speak to certainly have this feeling. They cannot understand why the performance of this very large and expensive array is nearly always lower than they achieve from the hard drive in their desktop computer. OK, so the array can do more aggregate IOPs, but why is it that 13ms, for example, is considered a reasonable average response time? Or worse, why is that some of my I/Os take several hundred milliseconds? And how is it possible that my database is reporting 500ms I/Os and the array is reporting that they are all less than 10ms? These are the questions that are lodged in the minds of my customers.
Storage Arrays do some things remarkably well. Availability, for example, is something that is pretty much nailed in Storage Array Land, both at the fabric layer and the actual array itself. There are exceptions: I think that large Fibre Channel fabrics are a High Availability disaster, and the cost of entry with director-class switches makes no sense when small fabrics can be built using commodity hardware. I have a more general opinion on Fibre Channel actually - it is an ex-format, it is pushing up the lillies. More on that in another blog post, though, I'm not done with the array yet!
The Storage Array became a real success when it became possible to access storage through Fibre Channel. Until then, the storage array was a niche product, except in the mainframe world where Escon was available. Fibre Channel, and subsequently Ethernet and Infiniband, enabled the array to be a big shared storage resource. For clustered database systems this was fantastic - an end to the pure hell of multi-initiator parallel SCSI. But then things started getting a little strange. EMC, for example, started advertising during Saturday morning children's television about how their storage arrays allowed the direct sharing of information between applications. Well, even an eight year old knows that you can't share raw data that way, it has to go via a computer to become anything meaningful. But this became the big selling point: all your data in one place. That also has the implication that all the types of data are the same - database files, VMware machines, backups, file shares. They are not, from an access pattern, criticality or business value standpoint, and so this model does not work. Some back-pedalling has occurred since then, notably in the form of formal tiered storage, but this is still offered under guise of having all the data in one place - just on cheap or expensive drives.
So now we have this big, all-eggs-in-one-basket, expensive array. What have we achieved by this: everything is as slow as everything else. I visited a customer last week with just such an array, and this was the straw that broke the camel's back (it's been a long time coming). They have a heavily optimised database system that ultimately means the following speeds and feeds are demanded from the array:
- 10 megabytes per second write, split into around 1300 IOPs
- 300 reads per second
If you don't have a good feel for I/O rates let me tell you: That is a very moderate amount of I/O. And yet the array is consistently returning I/Os that are in several hundred milliseconds, both reads and writes. Quite rightly, the customer thinks this is not very acceptable. Let's have a little analysis of those numbers.
First the writes.: Half of those writes are sequential to the Oracle redo log and could easily be serviced by one physical drive (one 15k drive can sustain at least 100MB/s of sequential I/O). The rest of them are largely random (let's assume 100% random), as they are dirty datafile blocks being written by the database writer. Again, a single drive could support 200 random writes per second, but let's conservatively go for 100 - that means we need six or seven drives to support the physical write requirements of the database, plus one for the redo log. Then we need to add another three drives for the reads. That makes a very conservative total of eleven drives to keep up with the sustained workload for this customer, going straight to disk without any intermediate magic. This array also has quite a chunk of write-back cache, which means that writes don't actually even make it to disk before they can be acknowledged to the host/database. Why then, is this array struggling to delivery low latency I/O when it has sixty four drives inside it?
The answer is that a Storage Array is just a big computer itself. Instead of taking HTTP requests or keyboard input, it takes SCSI commands over Fibre Channel. Instead of returning a web page, it returns blocks of data. And like all complex computer systems, the array is subject to performance problems within itself. And the more complex the system, the more likely it is that performance problems will arise. To make things worse, the storage arrays have increasingly encouraged the admin to turn on more and more magic in the software to the point where it is now frequently an impossibility for the storage admin to determine how well a given storage allocation might perform. Modern storage administration has more to do with accountancy than it does performance and technology. Consider this equation:
(number of features) x (complexity of feature) = (total complexity)
Complexity breeds both performance problems and availability problems. This particular customer asked me if there was a way to guarantee that, when they replace this array, the new one will not have these problems. The answer is simple: 'no'.
Yes, we can go right through the I/O stack, including all the components and software features of the array and fix them up. We can make sure that Fibre Channel ports are private to the host, remove all other workloads from the array so that there is no scheduling or capacity problems there. We can turn off all dynamic optimisations in the array software and we can layout the storage across known physical drives. Then, and only then, might there be a slim chance of a reduced number of high latency I/Os. I have a name for this way of operating a storage array. It's call Direct Attached Storage (DAS): Welcome to the 1990s.
Now let me combine this reality with the other important aspect: semiconductor-based storage. What happens when the pent up frustrations of the thousands of storage array owners meets the burgeoning reality of a new and faster storage that is now governed by some kind of accelerated form of Moore's Law? As my business partner Jeff describes it: It's gonna be a bloodbath.
I think that we will now see a sea change in the way we connect and use storage. It's already started with products such as Oracle's Exadata. I'm not saying that because I am an Oracle bigot (I won't deny that), but because it is the right thing to do - it's focused on doing one thing well and it uses emerging technology properly, rather than pretending nothing has changed. I don't think it's plug and play for many transactional customers (because of the RAC implication), but the storage component is on the money. Oh, and it is effectively DAS - a virtual machine of Windows running a virus scan won't slow down your critical invoicing run.
I think that the way we use the storage will have to change too - storage just took a leap up the memory hierarchy. Low latency connectivity such as Infiniband will become more important, as will low latency request APIs, such as SRP. We simply cannot afford to waste time making the request when the response is no longer the major time component.
With all this change, is it now acceptable to have the vast majority of I/O latency accounted for in the complex software and hardware layers of a storage array? I don't think so.