Full-stack Philosophies

James Morle's Blog

RSS Feed

Sane SAN2010: Storage Arrays – Ready, Aim, Fire

Posted on 1:03 pm September 6, 2010 by James Morle

OK, this one might be contentious, but what the heck - somebody has to say it. Let's start with a question:

Raise your hand if you have a feeling, even a slight one, that storage arrays suck?

Most DBAs and sysadmins that I speak to certainly have this feeling. They cannot understand why the performance of this very large and expensive array is nearly always lower than they achieve from the hard drive in their desktop computer. OK, so the array can do more aggregate IOPs, but why is it that 13ms, for example, is considered a reasonable average response time? Or worse, why is that some of my I/Os take several hundred milliseconds? And how is it possible that my database is reporting 500ms I/Os and the array is reporting that they are all less than 10ms? These are the questions that are lodged in the minds of my customers.

Storage Arrays do some things remarkably well. Availability, for example, is something that is pretty much nailed in Storage Array Land, both at the fabric layer and the actual array itself. There are exceptions: I think that large Fibre Channel fabrics are a High Availability disaster, and the cost of entry with director-class switches makes no sense when small fabrics can be built using commodity hardware. I have a more general opinion on Fibre Channel actually - it is an ex-format, it is pushing up the lillies. More on that in another blog post, though, I'm not done with the array yet!

The Storage Array became a real success when it became possible to access storage through Fibre Channel. Until then, the storage array was a niche product, except in the mainframe world where Escon was available. Fibre Channel, and subsequently Ethernet and Infiniband, enabled the array to be a big shared storage resource. For clustered database systems this was fantastic - an end to the pure hell of multi-initiator parallel SCSI. But then things started getting a little strange. EMC, for example, started advertising during Saturday morning children's television about how their storage arrays allowed the direct sharing of information between applications. Well, even an eight year old knows that you can't share raw data that way, it has to go via a computer to become anything meaningful. But this became the big selling point: all your data in one place. That also has the implication that all the types of data are the same - database files, VMware machines, backups, file shares. They are not, from an access pattern, criticality or business value standpoint, and so this model does not work. Some back-pedalling has occurred since then, notably in the form of formal tiered storage, but this is  still offered under guise of having all the data in one place - just on cheap or expensive drives.

So now we have this big, all-eggs-in-one-basket, expensive array. What have we achieved by this: everything is as slow as everything else. I visited a customer last week with just such an array, and this was the straw that broke the camel's back (it's been a long time coming). They have a heavily optimised database system that ultimately means the following speeds and feeds are demanded from the array:

  • 10 megabytes per second write, split into around 1300 IOPs
  • 300 reads per second

If you don't have a good feel for I/O rates let me tell you: That is a very moderate amount of I/O. And yet the array is consistently returning I/Os that are in several hundred milliseconds, both reads and writes. Quite rightly, the customer thinks this is not very acceptable. Let's have a little analysis of those numbers.

First the writes.: Half of those writes are sequential to the Oracle redo log and could easily be serviced by one physical drive (one 15k drive can sustain at least 100MB/s of sequential I/O). The rest of them are largely random (let's assume 100% random), as they are dirty datafile blocks being written by the database writer. Again, a single drive could support 200 random writes per second, but let's conservatively go for 100 - that means we need six or seven drives to support the physical write requirements of the database, plus one for the redo log. Then we need to add another three drives for the reads. That makes a very conservative total of eleven drives to keep up with the sustained workload for this customer, going straight to disk without any intermediate magic. This array also has quite a chunk of write-back cache, which means that  writes don't actually even make it to disk before they can be acknowledged to the host/database. Why then, is this array struggling to delivery low latency I/O when it has sixty four drives inside it?

The answer is that a Storage Array is just a big computer itself. Instead of taking HTTP requests or keyboard input, it takes SCSI commands over Fibre Channel. Instead of returning a web page, it returns blocks of data. And like all complex computer systems, the array is subject to performance problems within itself. And the more complex the system, the more likely it is that performance problems will arise. To make things worse, the storage arrays have increasingly encouraged the admin to turn on more and more magic in the software to the point where it is now frequently an impossibility for the storage admin to determine how well a given storage allocation might perform. Modern storage administration has more to do with accountancy than it does performance and technology. Consider this equation:

(number of features) x (complexity of feature) = (total complexity)

Complexity breeds both performance problems and availability problems. This particular customer asked me if there was a way to guarantee that, when they replace this array, the new one will not have these problems. The answer is simple: 'no'.

Yes, we can go right through the I/O stack, including all the components and software features of the array and fix them up. We can make sure that Fibre Channel ports are private to the host, remove all other workloads from the array so that there is no scheduling or capacity problems there. We can turn off all dynamic optimisations in the array software and we can layout the storage across known physical drives. Then, and only then, might there be a slim chance of a reduced number of high latency I/Os. I have a name for this way of operating a storage array. It's call Direct Attached Storage (DAS): Welcome to the 1990s.

Now let me combine this reality with the other important aspect: semiconductor-based  storage. What happens when the pent up frustrations of the thousands of storage array owners meets the burgeoning reality of a new and faster storage that is now governed by some kind of accelerated form of Moore's Law? As my business partner Jeff describes it: It's gonna be a bloodbath.

I think that we will now see a sea change in the way we connect and use storage. It's already started with products such as Oracle's Exadata. I'm not saying that because I am an Oracle bigot (I won't deny that), but because it is the right thing to do - it's focused on doing one thing well and it uses emerging technology properly, rather than pretending nothing has changed. I don't think it's plug and play for many transactional customers (because of the RAC implication), but the storage component is on the money. Oh, and it is effectively DAS - a virtual machine of Windows running a virus scan won't slow down your critical invoicing run.

I think that the way we use the storage will have to change too - storage just took a leap up the memory hierarchy. Low latency connectivity such as Infiniband will become more important, as will low latency request APIs, such as SRP. We simply cannot afford to waste time making the request when the response is no longer the major time component.

With all this change, is it now acceptable to have the vast majority of I/O latency accounted for in the complex software and hardware layers of a storage array? I don't think so.


10 comments on “Sane SAN2010: Storage Arrays – Ready, Aim, Fire

  1. Hi James, your thinking is quite right, however SAN has advantage that is just starting to be properly used. Imagine you that have 5x1TB mission-critical applications/databases. You need a lot of resource to support those (let's assume you need: Test env, Dev env, Sandbox env, etc). So in old style that's 4x5x1 TB = 20 TB even if in reality 99% of dev,sandbox,test and production data are identical; In reality you just need something like 5x1+1 TB of data (that's 6TB .. not 20TB). Start thinking about performance and words like IOPS appear (IOPS not capacity; IOPS in 2010 is SSD/Flash/NVRAM not FC/SAS disks).

    There is also one more aspect to this, if you server goes down you are just able to connect the SAN from different server. Easier than to re-attach the DAS... it looks like the clusters and SAN typically go with each other.

    Now the FC argument is right, it looks like the technologies like NFS-RDMA or on 10GigE, (Direct)NFS, pNFS are the future... but you are not able to boot from NFS (so back to some iSCSI/FC/FCoE)... You might ask why to boot from SAN? I reply with another question, why loose 2 disks for RAID-1 for 100 servers that don't do anything..

    -Jakub

    • Jakub,
      I think the world is changing. There are other ways to provision 42 Dev databases these days, notably Delphix, that do a better job than an all-purpose storage array. And I think you might have missed the point about IOPS in my article: Storage Arrays are not good at latency and therefore are not the best tool for IOPS. Regardless of technology for the persistent storage.
      Your point about DAS is very valid, and I did not mean to imply that we should physically go back to DAS, just functionally. Multi-server attachment of storage is still mandatory!
      I'll get back to the FC issue in another blog post, currently half finished... :)

      James

  2. Hi James,

    A really interesting article, thanks.

    My hand is raised! :-)

    Just a small point, your calculation in c.9th para regarding theoretical disks required to deliver good IO omits any kind of resilience/redundancy, so possibly weakens the argument citing 11 vs 64 drives.

    Perhaps you could include a wikipedia/reference link for "semiconductor-based storage"?

    Another point that gets missed when SANs are sold as providing "xxTB capacity", as well as IOPS, is throughput. In a DW environment where big volumes of data are being shifted on and off disk this can really cause problems.

    • Hi Robin,

      Thanks for stopping by. Yes, I would concede that some allowance needs to be made for redundancy, but let's be generous and mirror them - it's still only 22 drives of the 64, and we would have way more read IOPs available to us at that point :)
      When I refer to "semiconductor-based storage" I am referring to my own adoption of that term in my flash storage article to generally refer to storage that is based on some kind of semiconductor technology such as NAND flash, DRAM, and forthcoming technologies.
      DW bandwidth, yes I agree. High bandwidth is yet another specialised problem that general-purpose tend to do badly. It is quite possible to build an array to do good DW performance, but it takes plenty of planning and design to do so, and the array is essentially now a private device for the warehouse. Full circle back to the original premise :)

      Cheers

      James

  3. I/O performance analysis is hell, but coming from an organisation which runs DBs in the 100s of TB and total storage in the 10s of PB, then shared storage systems can be made to work without too much overhead.

    A few points

    1) talk of Infiniband vs FC for saving a few 10s of microseconds is a generally an irrelevance (keep that for cluster interconnects). For the vast majority of current apps, the difference between an I/O taking 10 and 100 microseconds will be irrelevant. By that time the bottleneck will be somewhere else entirely.

    2) spread data as idely as possible and don't think you can second guess the bhaviours of apps or know what they will be doing in a year's time. Dedicating fixed disks to particular functions is

    3) looking through a large number of AWRs and system stats that we have, then virtually all DB I/O problems we have could become insignificant with a reduction in random I/O latency from 5ms to 0.5ms. That's perfectly achievable with even quite complex FC setups. Beyond that almost all our DB apps choke on something else; typically CPU.

    4) get the basic designs, principles and policies of your infrastructure and detailed deployment worked out and stick to them. Don't, except for very special projects, do bespokes.

    5) Datacentres are colonies of different age equipment, standards and so on. In the real world, standards have to be widely supported over a long period of time.

    6) I'd certainly agree that data sharing is about software, not about having the data all in one box (although that can help - it's what NAS is all about and , after all, whate else is a DB server but just a ways of sharing data). That bit of EMC advertising was always tripe, save some specialist things like mounting SNAPs.

    7) don't foget about geography. The speed of light is a finite thing, and once you have to synchronise and I/O over any distance it becomes increasingly important. It's even significant in large data centres, and if you are sync replication (of any sort) to a DR site at any distance, then you can forget (effective) I/O write times in 10s of microseconds. It's 10s if milliseconds. the only way round these problems is through good application design (well, there are others, with staging boxes, but they are horrific).

    Now it's just possible that Oracle might well come up with the appliance that will do the lot (at least to support their product set). We might yet have this virtualised mega-scalable box which can run ERP apps, middleware, databases with integrated storage and we will never again worry about infrastructure. But I doubt it - especially if they built it with T series processors that have their own (processing) latency issues.

    Needless to say, the cause of 500ms latencies reported by DBs is nothing at all to do with the fundamental capabilities of FC, Ethernet, but all about either inadequate design/planning or implementation (in which I include the array manufacturers who too often have built boxes incapable of getting within an order of magnitude od their theoretical capability). However, these things will (and are) being solved, just so long as we can stop the accountants thinking the only metric worth known is the £s per GB one.

    • Steve,

      Thanks for the comments, and I more or less agree with most of your points. I'm assuming that your comments are against a number of my recent posts, rather than just the storage array one your comments are logged against? Most of your points seem to relate to my Fibre Channel post rather than the storage array post.
      The part of your comment that I want to really underline is the talk of the application - I had carefully avoided this for the articles, because I didn't want to run off down a (lengthy) side topic. So yes, a well thought out application should do little I/O in the first place, and probably has scaling issues elsewhere in the stack such latch contention, CPU overload or latency between the tiers. The fact still remains that some applications, notably ones in the financial sector, really are susceptible to even single I/Os that are out of normal range. Another thing to ponder is: What kind of applications can be written if the storage is dramatically faster and more scalable? We are currently limited greatly in our capabilities by the latency of I/O.
      And nice point on the T series, I wish they would drop that thing.
      Cheers

      James

  4. Pingback: Rise of the appliances? « Irrelevant thoughts of an oracle DBA

  5. Pingback: OSP #2b: Build a Standard Platform from the Bottom-Up | Ardent Performance Computing

Leave a Reply