James Morle's BlogRSS Feed
SaneSAN2010: Serial to Serial – When One Bottleneck Isn’t EnoughPosted on 2:51 pm August 23, 2010 by James Morle
I was recently looking into a storage-related performance problem at a customer site. The system was an Oracle 10.2.0.4/SLES 9 Linux system, Fibre Channel attached to an EMC DMX storage array. The DMX was replicated to a DR site using SRDF/S.
The problem was only really visible during the overnight batch runs, so AWR reports were the main source of information in diagnosis. In this case, they were more than sufficient, showing clear wait spikes for 'free buffer waits' and 'log file parallel write' during the problematic period. They were quite impressive, too - sixteen second latencies for some of the writes.
The customer was not oblivious to this fact, of course - it is difficult not to see a problem of such magnitude. They already had an engineering plan to move from SRDF/S (Synchronous) to SRDF/A (Asynchronous), as it was perceived that SRDF was the guilty party in this situation. I had been asked to validate this assumption and to determine the most appropriate roadmap for fixing these I/O problems on this highly critical system.
Of course, SRDF/S will always get the blame in such situations 🙂 I have been involved with many such configurations, and can indeed attest that SRDF/S can lead to trouble, particularly if the implications are not understood correctly. In this case, a very large financial institution, the storage team did indeed understand the main implication (more on this shortly) but, as is often the case in large organisations, the main source of the problem was actually what I call a boundary issue, one which falls between or across the technology focus of two or more teams. In this case, there were three teams involved, leading to configuration issues across all three areas.
Let's go back to the SRDF implication, as it is the genesis of the problem. In synchronous mode, SRDF will only allow one outstanding write per hyper-volume. Any additional writes to that hyper will be serialised on the local DMX. The storage admin had understood this limitation, and had therefore combined many hyper volumes into a number of striped metavolumes, thus increasing the number of hypers that a given 'lump of storage' would contain. All well and good.
The system admin had created striped Veritas volumes over these metavolumes, thus striping even further. A filesystem was then built on the volumes, and presented to the DBA team. The DBAs then built the database and started it up. All apparently ran well for a few years until performance became intolerable, and that's where my story begins.
I'm going to cut to the chase here, most of us don't have time to read blogs all day long. There were three factors carefully conspiring on this system to ensure that the write performance was truly terrible:
- SRDF/S can only have one outstanding write per hypervolume - that's a serialisation point.
- The filesystem in use was not deployed in any kind of ODM, quick I/O, or other UNIX file locking bypass technology - that's a serialisation point.
- The database was not using Async I/O - that's (pretty much) a serialisation point.
There you go - 1,2, 3, serialisation points from each of the three teams, none of which were understood by the other teams. Let's step through Oracle attempting to write dirty buffers to disk during the batch run (the major wait times were observed on 'free buffer waits', so let's start there):
- DML creates dirty buffers
- Oracle needs to create more dirty buffers to continue DML operation, but cannot because existing dirty buffers must be written to disk to create space
- Oracle posts the DBWR process(es) to write out dirty buffers
(all the above happen on well-tuned, healthy systems also, though these may never struggle to have free buffers available because of well-performing writes)
- DBWR scans the dirty list and issues writes one at a time to the operating system, waiting for each to complete before issuing the next. This is the lack of Async I/O configuration in the database
- The operating system takes out a file lock (for write) on the datafile, and issues the write. No other DBWR processes can write to this file at this point. This is the side effect of having the wrong kind of filesystem, and implies that only one write can go to a file at any one time.
- The operating system issues a write to the relevant meta on the DMX, which resolves to a specific hyper inside the box. That's the single outstanding write for that hyper now in flight. No other writes can occur to that hyper at this point until this one is complete.
It's easy to see how, when all the write I/Os are being fed to the DMX one at a time, that the additional latency of having SRDF in the equation makes a really big difference. It's also easy to see that, by turning off SRDF, the problem will get less severe. I'm not defending EMC here, they deserve everything they get when it's their fault. It just isn't primarily an SRDF problem in this case. Yes, turning off SRDF or going SRDF/A will help, but it's still fixing a downstream bottleneck.
The real culprit here is the file locking in the filesystem. This file locking is disabling the storage admin's design of presenting many hypers up to the host to mitigate the SRDF overhead. In addition, operating system file locking on database files just just plain stupid, and I was hoping to have seen the last example of this in the early 90s; but this is the second one I've seen in 3 years... I'm not saying that the people that implement the systems this way are stupid, but it's pretty easy to be naive about some critical areas when the complexity is so high and unavoidable boundary issues exist between the teams.
The lack of Async I/O is not good here, either, though the presence of multiple DBWRs is mitigating the impact somewhat, and the filesystem would quickly stomp on any improvements made by turning on Async I/O. I don't believe that this filesystem would support Async anyway until the file locks were bypassed, so it's two for the price of one here.
With multiple consecutive points of serialisation, it is not surprising that the system was struggling to achieve good throughput.
What's the lesson here? There are two, really:
- Just knowing 'your area' isn't enough.
- If you try to walk with two broken legs, you will fall down and bang your head. The fix, however, is not a painkiller for the headache.
EDIT: I have realised upon reflection that I only implied the reason that the file locking makes SRDF/S worse, rather than spelling it out. The reason it makes it worse is that it (file locking) enforces only a single write to that file at once. This means that this particular write is more than likely going to a single hypervolume, and thus eliminating any parallelism that might be achievable from SRDF. FYI, metavolumes have a stripe width of 960KB, so it's really likely that any single write will only go to one hyper.