TOC PREV NEXT INDEX

Scale Abilities Ltd Logo


Chapter 2
Hardware Architectures and I/O Subsystems
his chapter will provide an overview of UNIX server hardware architectures. In the interests of greater scalability and lower price/performance, the architectures available today are not as straightforward as they were even three years ago. It is important to understand the distinctions among these platforms and how to configure them for maximum performance.
2.1 Introduction to Hardware Architectures
Gone are the days of the uniprocessor. Going are the days of the shared-bus symmetric multiprocessor. Building more and more powerful systems at lower and lower cost is very challenging for the hardware architect. Moore's Law1 has been holding very true from a processor perspective, with virtually no impact on the retail cost of the hardware. The rest of the system, however, has been lagging behind the processor advances. For example, memory speeds are certainly not doubling every 18 months, and

disk speeds are lagging even farther behind. The traditional method of connecting multiple processors and memory cards together over a shared bus started to hit the practical limits of known physics sometime around 1996 (more on the specifics in "Bus Architecture" in Section 2.1.2).
Mismatches in performance among system components, combined with the economic driving force toward production of ever cheaper hardware, have forced hardware architecture in certain directions. In order to walk our way through these new architectures, it makes sense to sort them by level of complexity, and so our path goes as follows:
1. Single-processor architectures
2. Symmetric multiprocessors, including shared-bus and crossbar switch systems
3. Clustered Symmetric Multiprocessors
4. Massively Parallel Architectures
5. NonUniform Memory Architectures
6. Hybrid Systems
For each of these architectures, we will look at the hardware and software modes that comprise it and at how Oracle is implemented on such a system.
Before we jump in, there are several concepts that you need to understand in order for the descriptions of each architecture to be the most useful.
2.1.1 System Interconnects
Even within the most fundamental modern computer there is some kind of system interconnect. What is an interconnect? It's a communication mechanism shared by several components, designed to allow any-point to any-point communication among the components. By definition, the system interconnect is very much a shared device, so it often becomes the subject of scaling conversations.
There are essentially two different reasons for system interconnects, and all systems have to have at least one of them:
1. Connecting components within the system
2. Connecting systems together
When we talk about interconnects connecting systems together, we do not include LAN connections. Interconnects provide more of an intelligent link between the systems than a LAN provides, typically at lower latency and greater bandwidth.
The first type of interconnect, one that connects the components within the system, is the one in which we are primarily interested in this book, although we will address the second type when we discuss of clustered systems (Section 2.5) and massively parallel processor (MPP) systems (Section 2.6).
For many years, the principal method of connecting components together within a system has been the shared system bus. A bus is the simplest type of interconnect and is the one on which most high-end UNIX kernels have been engineered to work. Therefore, many of the operations found in modern interconnects are directly synonymous with their old shared bus equivalents. For this reason, we will concentrate on a bus architecture for the concepts, diversifing where necessary to incorporate newer interconnect technology.
2.1.2 Bus Architecture
A bus must support several concurrently connected devices, such as a CPU, memory, and I/O controllers. In order to achieve this, and to allow the devices to communicate without interfering with each other, communication across the bus is handled by means of bus transactions. A bus transaction is very similar to a database transaction in that it is an atomic packet of work that is initiated and completed without interruption. All devices on the bus communicate across it in this way.2 For example, if a CPU on the system needs to load a cache line from main memory, it will initiate a read across the bus. The bus logic will turn this into a bus transaction to perform the read and will send the request over the bus. At this stage, it's safe to think of the bus as being arranged as shown in Figure 2.1.
The CPU will then wait while the bus logic sends both the request to the memory controller and the data back to the CPU from the appropriate memory locations.
I have used the term "Magic Transport" in Figures 2.1 and 2.2 to allow these illustrations to represent bus-based interconnects or indeed any type of interconnect method. In the case of a bus-based interconnect, the magic is simply a backplane within the system that has all devices connected to it. In this way, any device is connected to all other devices.
This is where the "practical limits of physics" mentioned earlier come into play. A bus is essentially a load of wires, one for each bit of the n-bit width of the bus, plus some extras. These wires (or tracks on a circuit board) all run in straight lines, parallel to each other, with periodic "breakouts" to connectors that allow devices to be connected to the bus.
If we ignore all the other electrical challenges of building such a bus, a limitation still exists. That limitation is the speed of an electron. All electrical flow is made up of electrons passing through a conductor, and the electrons are capable of traveling only up to a fixed maximum speed. Within a bus, all messages (electrical signals) must arrive at all devices within the clock frequency of the bus. Therefore, the electron must be capable of traveling the full distance of the bus in the time window specified by the clock rate of the bus.
To highlight how limiting this is, we will go through an example that, instead of the speed of an electron, uses the fastest known speed-the speed of light (the speeds are similar anyway). Let's assume that we have a "light-based computer," which is identical to today's shared-bus computers but uses light instead of electrons to transport the signals. The light all travels in a vacuum, and there is no overhead introduced anywhere, so the speed of light within the machine is 2.99792458 ¥ 1010 cm s-1, or around 300,000 Km per second. If the bus is running at a rate of 200MHz (a mere fraction of today's CPU core speeds), then the maximum width of this bus would be 150 cm, or about 5 feet.
Clearly, we are some way from a light-based computer, and even if we had one, it doesn't look promising for the long-term future of a shared-bus architecture. Five feet is not very wide for an absolute maximum size backplane, especially once you start to reduce it due to the internal signal path length within each card.

Now that we have a simplistic view of a bus from a point-to-point perspective, we can complicate the issue by adding more devices to the bus. The bus now becomes a many-to-many communication device and thus needs some way to manage multiple requests at once, in order to prevent the devices from corrupting each other.
This can be achieved in either of two ways. The first, simplest, cheapest, and least scalable way is to have a single bus master. A bus master is a device capable of initiating reads and writes across the bus. A CPU has to be a bus master, and so to limit ourselves to a single bus master is to limit ourselves to a single CPU-clearly not very scalable. In order to get around this restriction, the bus needs to support bus arbitration.
Bus arbitration allows multiple bus masters to exist on the bus but allows only one device to perform a transaction at any one time. The arbitration logic within the bus prevents concurrent transactions and shares out the access among the bus masters in a (randomized) fair-share manner. A good way to view the bus at this point is shown in Figure 2.2.
Now we have a concurrent, multiple bus master configuration. All access to the bus is now serialized, allowing atomic memory operations. These atomic memory operations are essential for implementing any kind of locking mechanisms in the software layer.
Test and Set
One such locking mechanism uses the test-and-set machine code instruction available on many CPUs3. This is an atomic instruction that tests a memory location for a value and, depending on the result, changes the value to something else. This is used as "test to see if the lock is free, and if so, give it to me," and is an implementation specific that allows processes to "spin on a latch" during active waits.
2.1.3 Direct Memory Access (DMA)
In order to increase the concurrency of the system and decrease the load on the CPUs, there is the concept of "intelligent" I/O controllers that do not require intervention from the CPU to do their work. All that the CPU is required to do is to set up the memory required to do the work and initiate a request to the controller to execute the work.
Let's use a SCSI (Small Computer Systems Interface) controller as an example of this. Figure 2.3 shows a non-DMA-based write operation from memory through the I/O controller.
All of the data needs to go through the CPU before being written to disk. The I/O controller is just a dumb controller that waits for data submissions and requests.
The DMA controller eliminates this requirement by building some intelligence into the controller itself. The card is configured with full bus mastering logic and maintains a memory map of the system memory. When the CPU needs to initiate an I/O request, it simply posts the DMA I/O controller with the location of the data and then resumes work on other tasks, as shown in Figure 2.4. Once the I/O is complete, the controller sends a hardware interrupt to the CPU in order to signal completion.
2.1.4 Cache Coherency
Cache coherency is an essential attribute in shared-memory multiprocessors. A system exhibits cache coherency when it is guaranteed that a read of any memory location, from any processor, will always return the most recent value of that location.
The big deal regarding cache coherency is that because of the way most modern processor caches operate, and because there are multiple processors potentially reading and writing from the same physical memory location at any one time, it is very possible to get different values from the same memory location by different processors unless steps are taken.
We have seen that caches form an essential part of all high-performance systems, because they mitigate the differences between the memory speed and the core speed of the CPU. However, when there are multiple caches in the system, each of these caches could potentially have different values for a given memory location. To compound this, caches are typically operated in write-back mode to gain the most benefit from the cache. In this mode, when a processor changes the value of a memory location, the cache does not necessarily write the new value back to memory at that time, in order to minimize the wait time on slow memory devices. The main memory is normally updated at a later time by the operating system. So at any one time it could be possible for main memory, and all the caches, to have different values for the same memory location.
To demonstrate the effect of this problem, assume that we have a system with two processors, each with a write-back-based cache, but no measures taken to maintain cache coherency. Running on this system is a single program that increments a counter four times, with an initial value of zero for the counter. Assume that when the program starts, all caches are empty. I have also taken the liberty of making the situation as bad as possible by executing each counter increment on alternate CPUs each time. This is done purely to demonstrate the problem in an exaggerated way for clarity. The result is shown in Table 2.1.
Walking through the example, during the first increment, a cache miss occurs in CPU 1, and the value (0) is loaded from main memory and incremented to become the value 1. The next increment is executed on CPU 2, where a cache miss also occurs. Due to the use of write-back caching, the value in main memory is still 0, and so this is what is loaded into the cache of CPU 2 and incremented to become the value 1. Already the problem is evident: two increments of the counter have yielded a value of 1 in both the processor caches, and a value of zero in main memory. At this point, the system is in an inconsistent state.
In order to prevent this from happening, the caches on the system must be kept coherent. To achieve this, and to support write-back caching, a fundamental rule must be put in place on the system: the value of main memory location can be stale (old) in relation to caches, but caches cannot contain stale data. Once this rule is in place, the memory and cache consistency can be maintained with some simple rule, normally implemented in hardware.
When a value is changed by a CPU, its cache contains the most recent value of the given memory location. Using the rule presented above, this means that any other caches in the system that contain an old version of this location must either
· Be sent the new value (known as write-update) in order to conform to the rule, or
· Have the cache line invalidated, enforcing a reload on the next use (write-invalidate)
A system designer will adopt one of these methods for exclusive use within the system.
Now that the caches are kept in sync, there is only the main memory that could potentially be out of step as a result of write-back caches.
If the operating system has not yet written back a cache line to main memory when another CPU needs the most recent value, the CPU cache containing the most recent value intervenes and initiates a cache-to-cache transfer of the most recent value from its cache, thus effecting the final piece in the cache consistency resolution.
The hardware can be designed in one of two different ways in order to be "aware" of the load and stores in the system that will require write-updates, write-invalidates, or cache-to-cache transfers. The two methods of achieving this are the called the snoopy bus protocol and the directory protocol.
The snoopy bus protocol (see Figure 2.5) is a broadcast-based method. This is the method used in virtually all bus-based symmetric multiprocesor (SMP) systems and represents the most straightforward technique. It uses the fact that all bus traffic is broadcast-based: all traffic is visible to all components. This being the case, logic is built into each component to "snoop" on the addresses of the memory loads and updates/invalidates occurring across the bus.
When an address is picked up on a snoop, it is compared with those addresses in cache and appropriate action is taken. In the case of a write-back cache picking up a load request for an address that is still dirty (unwritten to main memory) in its cache, it initiates a cache-to-cache transfer of the most recent value from its cache. In the case of a cache line invalidate, the corresponding cache line is marked invalid in the snooper's cache. As every message is broadcast, the entire bus is unusable during the operation, thus limiting the scalability of the bus as more processors are added.
The directory-based protocol (see Figure 2.6) is normally associated with distributed memory systems. Each "node" maintains a separate directory of which caches contain lines from the memory on that "node." Within this directory, a bit vector is kept for each line of main memory, reflecting which processors in the system have that line in their cache. The major reason for implementing such a scheme is to remove the need for a broadcast-based protocol, because broadcast-based protocols have been found to be inherently unscalable.
Using the directory protocol, the directory for a node (and thus its memory) is used to determine which caches need to be made consistent (through invalidate or update) in the event of a write. At this point the individual caches can be made consistent, using some kind of point-to-point communication rather than a broadcast, thus allowing other parts of the interconnect to be used concurrently. This is much more efficient than the broadcast protocol, where all traffic exists on all parts of the bus at all times; this is the reason that snoopy bus protocols do not scale very effectively.
2.2 Single Processor Architectures (Uniprocessors)
A uniprocessor machine (see Figure 2.7) has the simplest architecture of all UNIX servers. It includes a single processor plus cache, connected to some memory and I/O controllers through some kind of shared bus. In many implementations, there is a separate bus on the system for connection of I/O controllers and peripheral devices. Devices on this bus communicate with the CPU and memory bus by means of a bridge between the two busses.
2.2.1 Advantages
As there is only one CPU in the system, there are no inter-CPU cache coherency issues to be handled by the operating system. This is not to say that there is only one bus master in the system-the system still needs to be concerned about the cache coherency issues arising as a result of DMA I/O controllers. This can be handled solely by a bus snooping device to invalidate the CPU cache lines where appropriate when DMA writes take place in main memory. This is much simpler to manage than having multiple processors all writing to shared memory areas and invalidating each other's caches.
In addition, no operating system locking is required and so there is no contention for kernel resources, because the operating system can guarantee that no other processors are executing kernel code. For example, only the single processor in the system could be writing to the process table at the same time.
However, it is still possible within a uniprocessor architecture for a kernel data structure to be corrupted through the processing of interrupts. When this happens, it is the same CPU that is corrupting the data structure, only operating in a different context. For example, if the CPU is performing a complex memory update in kernel mode, and a hardware interrupt is received (from an ethernet card, for example), the CPU must stop processing and switch to interrupt handler code and process it. Within this interrupt handler could be a memory update to the same region that is partially completed, and therefore data corruption could be produced.
This eventuality is prevented through the use of interrrupt priority levels, which represent the only type of short-term exclusion required on a uniprocessor system. This works by setting the current priority level of the processor to the highest level available on the system, effectively temporarily disabling interrupts. Once the critical update is complete, the interrupts can be reenabled by resetting the interrupt priority level to the prior value. This kind of temporary blocking of interrupts is adequate for kernel data protection in a uniprocessor system.
In reality, many vendors run the same version of the operating system on both uniprocessors and multiprocessors, therefore slightly invalidating this advantage. This also applies only to kernel mode data structures. Oracle must still provide its own user mode locking, because any process could be preempted from the processor before Oracle memory updates are complete.
In addition to the locking that is required for safe concurrent operation in a uniprocessor system, there is locking that is highly desirable for increased concurrency. Communicating with peripheral devices, for example, takes a comparatively long time. It is clearly not an option simply to ignore interrupts and be nonpreemptable for what could be several milliseconds, and so another method is required to allow the processor to be safely used by another process while the response from the peripheral device is outstanding. This is achieved through the use of long-term locks, which protect the data structures from corruption by subsequent processes but allow the kernel to be safely preempted by another process. This allows the overall throughput of the system to be unaffected by access to slower components.
So, in summary, the uniprocessor potentially requires only two types of locks: the interrupt-disabled, short-term exclusion method and the long-term, preemptable method. Both of these methods are considerably simpler to implement than the locking required in a scalable multiprocessor environment.
2.2.2 Oracle on Uniprocessors
Oracle is fundamentally not suited to a uniprocessor architecture. By its very nature as a multiple-process software architecture (DBWR, PMON, shadow processes, etc.), serious compromises in efficiency are made when these multiple processes are run in a uniprocessor environment. This is attributable to the required context switching (see Section 2.2.3) among all the processes in the Oracle RDBMS, even if the work is being submitted by a single process. There have been many different twists in the Oracle architecture that could optimize running on a uniprocessor system, most notably the single-task connection architecture designed to reduce the number of processes for the user connection, and the "single-process Oracle" architecture that combines all the functions into a single process (actually targeted for the MS-DOS platform). However, single-processor UNIX systems simply do not make very good database systems, regardless of what you do to them.
Finally, at the end of the day, a uniprocessor system can perform only one task at a time, however good it is at pretending otherwise. Therefore, only one person can parse an SQL statement, read a block from buffer, execute a PL/SQL package, and so on, at any one time. This inherent lack of concurrency in the uniprocessor makes it less than ideal in a multiuser environment.
2.2.3 Other Disadvantages
Nobody in their right mind would suggest building a large-scale Oracle database server on a uniprocessor server. There are numerous reasons why the day of the uniprocessor in the data center is very much over.
With only one processing engine in the system, high-context switching ties up a larger percentage of available processor time in wait state when compared with an equal workload on a multiprocessor system. For example, if ten processes are executing on a uniprocessor system, each of the ten processes must be switched onto the processor, allowed to execute for the quantum period, and then switched off the process once more. On a five-way multiprocessor system, the same ten processes would generate exactly one-fifth of the context switches as the uniprocessor, because there are only two processes competing for any one processor. Each of the context switches takes a finite amount of time, during which the processor cannot be executing any useful work. The time will be spent preserving and restoring the CPU registers, the stack pointer, the program counter, and the virtual memory mappings.
The increased context switching creates another problem. As all processes run on the same processor, there is therefore only one cache by implication. This means that there are no cache affinity4 choices available to the operating system. Therefore, all processes are competing directly for the same cache lines, and, for the same number of executing processes, the cache warmth will always be less than that achievable on a multiprocessor system.
With only one processor in the system, it is obviously not scalable beyond the speed of a single CPU. Although the single processor could be upgraded, the system remains only as scalable as the single processor within it. Even with Moore's Law being achieved, this means that the system could only double in performance every 18 months, unlike a multiprocessor system, which may have the option of doubling its capacity every week (for a few weeks, at least).
Finally, the thing that absolutely kills the uniprocessor in the datacenter is the fact that a single CPU failure impacts the entire business by the machine being unavailable until hardware is physically replaced. That is, if a processor board expires, the system cannot be restarted until a hardware engineer has been called, travels to the site, and installs the board. This takes around two hours even under the most comprehensive service contract. It is rare for a business to find this acceptable-especially because the failures always seem to happen during the peak period.
2.2.4 Summary
It can be seen that despite the potential advantages gained through reduced locking administration (which are very slight in comparison with modern lightweight, multithreaded SMP kernels), the disadvantages are severe. This makes a uniprocessor fundamentally unsuitable for high-end enterprise computing. However, there will always be a place for these machines in the low end of the market, in addition to their use as development platforms.
2.3 Symmetric Multiprocessors (SMPs)
A symmetric multiprocessor extends the capability of the uniprocessor by increasing the capacity of the system bus and allowing the addition of multiple processors and memory subsystems. All of the memory in the system is directly accessible by all processors as local memory, a configuration known as a tightly coupled configuration.
Examples: Sequent Symmetry, HP T-class, Siemens E600, Sun E6500, SGI Challenge

The most common SMP architecture is that shown in Figure 2.8-the shared global bus architecture, where all components exist on the same, shared bus. While currently the most common, this architecture is increasingly being replaced by either a crossbar-switch-type bus (discussed in Section 2.4) or the NUMA architecture (discussed in Section 2.7). However, we will start off with the shared-bus model, because it represents the most simplistic view.
2.3.1 SMP Advantages
The SMP architecture allows additional scalability to be gained through the addition of processor cards to the system. Each of the processors added can be used by the kernel for the processing of presented work. In this way, the effective bandwidth of the system is increased each time a processor is added.
The ability to add additional processing capacity to a system at will allows significant flexibility in operating an SMP system. It also allows far greater performance gains to be made than by purely upgrading the processor in the system to a faster one.
One thing needs to be noted here. I have just stated that SMP allows "performance gains to be made." It is important to clarify this point before we get any further into the book. Performance gains will not be gained simply by adding more processors to the system. Rather, the processing bandwidth of the system is increased, allowing performance gains to be made.
For example, if a system with a single processor can run a given task in one hour, this task would take exactly the same time to complete on a ten-way SMP system with the same processors. The reason for this is that the task is serialized on one processor and the addition of extra processors will not make the serial stream of instructions complete any faster. In fact, it is likely to make it slightly worse, depending on the scalability of the SMP platform.
However, if the same task is split into ten pieces that can be run concurrently, the ten-way SMP system could provide close to ten times the "speed" of the single-processor system, if the system scaled efficiently. Likewise, if the single-processor system were running 1,000 processes at any one time, the addition of nine more processors would add significant computational bandwidth to the system and individual processes would wait less time before being scheduled to run. In fact, the run queue for the system would be only 10 percent of the depth of the uniprocessor system, and all of the processes would appear to "run faster," although they are simply being allowed to run more often.
The final advantage of an SMP architecture (whether it has a single shared bus or otherwise) is that the programming model does not change from the user (non-UNIX/Oracle kernel programmer) perspective. The system is described as having a single system image or a single instance of the operating system. This is a very important point, in that it allowed the SMP architecture to be introduced in a virtually transparent way, because all the existing user code ran without change. It also means that the programming model remains very simplistic and allows easy programming on a multiprocessor system. This contrasts with the massively parallel processor (MPP) model (see Section 2.6), where the programmer needs to be personally aware of the hardware configuration and is required to assign processing tasks to the processors individually.
2.3.2 Kernel Challenges
In order to make an SMP system scale effectively, the system designer faces several challenges. Ultimately, a large number of these challenges need to be met by the operating system engineer.
The presence of multiple processors in the system automatically gives us multiple bus masters, as discussed at the beginning of this chapter. It also gives us multiple caches that need to be maintained, as discussed in Section 2.1.4. Finally, it opens up a whole bunch of load balancing and affinity options to the process scheduler within the UNIX kernel.
We have already discussed the first two of these issues (multiple bus masters and caches), but what about the last ones (load balancing and affinity)? Really, these two things are fairly closely related. When a process becomes runnable on the system, the kernel must associate the process with a processor and enable it to run. With a uniprocessor system, there is only one choice of where this should go, because there is only one processor. On an SMP system, however, the kernel needs to make a choice as to which processor (or engine) is optimum for executing that particular process.
The most important thing that the kernel must do is to ensure that the work on the system is well distributed across the engines. For example, if there is enough work to keep all of the processors busy, then all of the processors, rather than just a subset of them, should be kept busy. This is the load-balancing part of the problem.
Next, the kernel must decide whether it is worth attempting to get the process assigned to any particular processor. This part of the problem may seem a little strange at first, but there is good reason to assign a process back to the last processor on which it ran, provided that there is some benefit to be gained from the residual cache warmth from the last time the process ran. The kernel can make educated guesses as to whether this is the case, based on the number of processes that have executed on that CPU since this one last ran. This kind of processor preference is known as cache affinity, because the process has an affinity for one particular cache on the system.
Modern UNIX kernels are very good at managing the administration burden of multiple processors. The ability of the kernel to handle the additional administration directly affects the scalability of the system, and some vendors are better than others at achieving this.
2.3.3 Oracle on SMP Architectures
The Oracle architecture is probably most suited to an SMP-like hardware architecture. This is no accident, because SMP has made Oracle a datacenter reality for large-scale systems. The multiprocess architecture fits perfectly into a system with multiple processors to run these processes on.
The fast, uniform access to main memory found in SMP architectures allows Oracle to use bus primitives for implementing very fast latch operations (i.e., test and set). This is vital for operation of very large single-instance implementations of Oracle, because all operations share common latches. While multi-instance implementations of Oracle (using Oracle Parallel Server) are not subject to the same intense contention for the same latches, they are subject to contention on significantly slower methods of synchronization than those used in a single-instance configuration.
However, for reasons that are separate from any bandwidth issues associated with the bus, SMP still has inherent limitations for scaling Oracle infinitely. For the same reason that SMP architectures are good Oracle platforms, they also have a very definite upper boundary: the speed of latch operations.
For a latch operation to occur, a CPU must use test and set or a similar instruction to attempt to gain control over the latch (a value in shared memory). Only when the control is gained can the processor perform the real work that it is required to do. If the memory location is in the cache of other CPUs, then the processor and/or operating system must ensure that the caches are kept coherent on those other processors.
If a large number of user sessions require a common latch, it becomes more and more likely that all processing engines in the system will be executing test and set against the memory location. Every time the latch value changes, all of these operations need to be serialized by the bus arbiter, and so there is a finite number of latch operations possible on any SMP system. Therein lies the inherent limitation. At the end of the day, it really doesn't matter if the bus has another 10GB/s of bandwidth left unused if the system spends all of its time serializing down the bus on required latch operations.
All this being said, being bound by the speed of a latch is not as big a problem as it used to be, because Oracle has spent considerable time breaking nearly all of the single latches out into multiple latches, thus reducing the actual contention (it's the contention that hurts most). As more and more vendors produce nonuniform memory access (NUMA) architectures (see Section 2.7), Oracle is likely to start making more and more private latch allocations, in addition to implementing algorithmic alternatives to taking out the lock in the first place.
2.3.4 Shared-Bus Limitations
Due to all the system traffic going across a single shared bus, the system is limited by the capacity of the bus, in addition to the rate of arbitration and the capabilities of any bus snooping logic. As discussed earlier, the bus itself is limited by physics, and the upper limit on the number of processors that can be usefully attached is something of a religious war in the SMP industry. Many SMP designers believe that building an SMP of more than eight processors is a waste of time. Others can demonstrate reasonable scalability of SMPs of 16 to 20 processors. The real answer is very dependent on the way the bus is designed, the speed of the individual processors, the effectiveness of the cache, and so on, and there is no simple answer.
The one thing that is certain, however, is that the upper bounds for simplistic SMP scaling using a shared bus are pretty much at an end as of today. This is affirmed by the hardware vendors, who are all converting to some alternative type of SMP-like technology, using either NUMA, which we cover Section 2.7, or some kind of intelligent, point-to-point interconnect fabric to replace the global shared bus.
2.3.5 Summary
The shared-bus SMP architecture is the architecture that really changed the viability of UNIX-based Oracle database servers to deliver world-class, high-performance solutions. Although there are a decreasing number of hardware manufacturers still producing vanilla SMP architectures of this design, virtually all of the current products present that same, or very similar, programming model.
2.4 Point-to-Point SMP
One of the approaches taken to extend the useful life of a pure SMP architecture is to provide greater bandwidth and more interconnect concurrency through the use of some kind of point-to-point interconnect between components (see Figure 2.9).
Examples: Sun E10000 (a.k.a Starfire), HP V2500

The actual architecture of the interconnect fabric varies slightly among systems, but the common goal of the systems is to allow point-to-point communication among all of the processors and memory in the system. This allows for full connection bandwidth between any two points at any time, in addition to increasing the concurrency of the "bus" as a result of multiple connects being active at any one time (remember that the shared global bus allows only one at a time).
The typical sustainable bandwidth for such systems is in the 16GB/s range, with each connection capable of around 2GB/s.
Figure 2.9 shows a schematic view of such a system, with all of the components connected to one another. To my knowledge, there isn't actually a real system out there that looks precisely like the one in this example, but I took interesting features from all of them and made a more useful example.
The first thing to note is that my system is constructed from 16 system boards, all interconnected to one another. The actual method of interconnection demonstrated here is direct point-to-point connection, which was chosen for reasons of path demonstration rather than to show any particular technology. Most likely, the components would be interconnected by some kind of crossbar or other switch fabric (see Figure 2.10).
On each of these boards resides a single processor, a processor cache, cache coherency logic, and either a portion of system memory or a bus converter and I/O controllers. The trade-off between the memory controller and the I/O controllers was done on purpose to make the hypothetical system have the same kind of trade-offs as those of a real system.
For the sake of clarity in the illustration, none of the I/O controllers has been connected to any disk. Just assume that half of these controllers are connected to different disks in the disk cabinet and that the other half are providing alternate I/O paths to the same disks for the sake of redundancy. None of the I/O controllers is "private" to the board that it resides on: all I/O controllers are shared among all processors, as is the memory.
2.4.1 Cache Coherency
The important omission from the illustration is that of some kind of cache coherency mechanism. With point-to-point communication within the system, the standard SMP bus snooping method no longer works, because there is no shared bus. There are two main approaches to providing the solution to this requirement: a broadcast-based address-only bus with a snoopy bus protocol, and a directory-based coherency protocol designed for point-to-point communication.
By building a global address-only bus, systems such as the Sun E10000 benefit from the simplicity of the snoopy bus protocol. Greater scalability is achieved when compared with a full shared (address-plus-data) bus, because the system does not need to know the details of the activity (the data itself), just the locations. This allows the most appropriate component-be it a memory controller or the cache of another CPU-to service requests for memory, and allows cache lines to be invalidated accordingly when memory locations are changed by another CPU. It is also possible for a system to have multiple address busses in addition to multiple data connections, because an address bus is very readily partitioned by address ranges into any number of subaddress busses.
All access to other components, whether or not they are resident on the same system board, follows the same logic path. This ensures that all the access times to any particular component from anywhere else in the system are the same (uniform). This kind of fair sharing of access is critical in the design of these systems, because software changes need to be made as soon as the memory access becomes nonuniform. It is easy to see how a CPU with faster access to a certain piece of memory could "starve" the other CPUs on the system from accessing that component through unfair speed of access-a bit like having a football player on the field who can run ten times faster than all of the other players. This subject will be covered in more detail in Section 2.7, where it will become necessary to worry about such things.
As far as the operating system and Oracle are concerned, a crossbar-type system is a shared-bus system. The changes in operation all occur in the hardware itself and are therefore transparent.5 Therefore, the introduction of such a system does not impose as great a risk to stability and performance as does a more radical one such as MPP or NUMA.
The increase in bandwidth of the "bus" allows for immediate performance improvements in non-latch-intensive workloads. The increase in concurrency, and the ability to have multiple address busses and therefore arbitration bandwidth, allows for greater performance in latch-intensive workloads.
2.4.2 Summary
This type of SMP system has bought the SMP architecture some time. The introduction of multiple interconnects has significantly increased the performance capability of the SMP software model, and this model probably has some way to go before it is exhausted.
However, scaling of SMP beyond the capabilities of crossbar SMP is a difficult task. For example, if we already have point-to-point data connections that are running as fast as physics will allow, how do we increase the throughput of the system? At some stage, the open system world needs to break out of the SMP mold once and for all, but until the software support is there, this cannot happen.
2.5 Clustered SMP
Clustered SMP configurations extend the capabilities of SMP further by clustering multiple SMP nodes together, as shown in Figure 2.11.
A cluster is also known as a loosely coupled configuration, in that the systems that are clustered have no direct access to each other's memory or CPUs. All the systems run a discrete copy of the operating system and maintain a memory map containing only local memory addresses. The systems are coupled by a low-latency interconnect between all the systems and share only message information, which is why the coupling is termed "loose."
With this coupling, the systems can perform intersystem coordination but not much else because of the limited amount of information that they share. However, when connections are made from each system to a shared disk array, the usefulness of the cluster becomes far greater. Now the systems can share common data and coordinate the sharing through the use of messages passed over the interconnect.
2.5.1 Clustering Types
In the UNIX world, a cluster is currently useful for one or both of the following:
· Provision of high availability
· Oracle Parallel Server (OPS)
A high availability cluster is one in which the cluster interconnect is used solely as a "heartbeat" monitor between the systems. When the heartbeat from one of the systems ceases as a result of system failure, another system takes appropriate action to support the workload of the deceased system. This could include mounting of the dead system's file systems (thus the need for a shared disk), a change of IP address to make the system look like its peer, and the start-up of software subsystems.
There is no concurrent sharing of data in a high-availability solution, only sharing of mortality information.
The Oracle Parallel Server (OPS) cluster is more complex. In this type of cluster, full access is granted to the shared disk from all systems via the database. Note that it is still only the database that is shared, not file systems or any other UNIX resources. Any system in the cluster can bring up an Oracle instance and use it as required.
Clearly, some kind of synchronization is required to prevent all the Oracle instances from writing all over the same pieces of disk and to maintain the same consistent view of the data. This synchronization is provided by the Distributed Lock Manager, or DLM.
The DLM is a piece of software that was, for Oracle7, written by the hardware vendor6 to provide the interface between Oracle and the hardware communications. For Oracle8, the DLM is incorporated into the Oracle product itself, and the vendor-specific lock manager has been retired.
The specifics of the DLM and how it integrates into Oracle will be covered in Chapter 6, but in order to get the basics right here, we can assume that whenever Oracle needs to read a block from disk, it checks in with the DLM. The DLM will send a request to any other Oracle instances that have the block and will ensure that no outstanding writes are pending by flushing the block from cache to disk (this is known as a ping). The requesting system will then be notified that the block is safe to read from disk, and normal operations will continue. This is a dramatic simplification of a complex issue, but should suffice to demonstrate the basic operation of an OPS cluster.
The DLM lock management is not transparent to the application, whatever people may try to tell you. Actually, it's fair to say that enough people have been severely burned by that claim that it has become a rare and uninformed statement.
A good deal of work is required to alter the design of the application to operate in this type of configuration. Specifically, anything that could cause communication with other nodes in the cluster by means of the DLM should be minimized wherever possible within the application. Examples of this are
· "Hot block" data access between the nodes, causing excessive pinging
· On-disk sorting
Specifically, a successful SMP cluster implementation will demonstrate good partitioning within the application. This could mean different areas of the application getting processed on different nodes or, even better, "private" data access per node through the use of a transaction processing (TP) monitor.
2.5.2 Summary
The implementation of a clustered system is a special requirement. This is not to say that you shouldn't do it-if you are building a very large scalable system, a clustered system of one of the types described above becomes a necessity. It's important, however, that you choose the most suitable method of clustering for your needs.
When deciding which type of clustering to use, you should keep your choice as simple as possible. Unless it is absolutely necessary, an OPS cluster should not be considered. The difference between the two methods of clustering is acute: high-availability clustering is reasonably simple whereas OPS clustering is very complex. This should be prime in your mind when making the decision.
One particularly bad reason for using OPS is for scalability, when the other option is to buy a larger machine. My advice on this is simple: although buying a second, smaller machine may be the most attractive proposition from a cost perspective, just remember that cost comes in many different forms. For example, don't forget to factor in extensive application modification (perhaps a total redesign in extreme cases), a large team of very experienced DBAs, and potentially more unplanned downtime owing to complex, multinode problems.
That having been said, the very best reason for going the OPS route is when you need scalability-and I really do mean need. When there is no system big enough to house the entire database processing requirement, OPS is the only way to go. Just remember that it will be an expensive ride. Do your homework, and never, ever, underestimate the undertaking.
2.6 Massively Parallel Processors (MPPs)
2.6.1 Definition
A massively parallel processor (MPP) is known as a shared-nothing configuration (see Figure 2.12).
Examples: Siemens RM1000, nCube, IBM SP/2

Despite the grand title, there is nothing very complex about the architecture of an MPP system. Traditionally, the building block of an MPP system has been a uniprocessor node, comprising a single CPU, some memory (16 to 1024MB), a proprietary interconnect, and a finite number of private disks.
Note that all of the disk is private; this is the reason that MPP systems are called "shared-nothing" configurations. Not only does each node run its own instance of the OS, but it does not even physically share any of its disk with other nodes. Think of an MPP system as a bunch of uniprocessors in a network, and you are pretty close to the truth of things.
The sole distinguishing feature of an MPP system (and the only one that makes it more than just "a bunch of uniprocessors in a network") is the proprietary interconnect between the nodes and the software layers that use it.
Each node runs a private copy of the OS, and thus MPP systems are comparatively simple to produce, because there are no cache coherency issues to deal with. Each processor has its own memory, its own disk controllers, and its own bus.
Typically, MPP is considered when high system bandwidth and good scalability up to potentially hundreds of nodes are required. The reason that MPP is well suited to such applications is that the nodes do not interfere with each other's resource and thus can scale very effectively, depending on the type of workload.
Workload is the key: MPP is suited to workloads that can be partitioned among the nodes in a dedicated and static fashion, with only messages (as opposed to data) traveling across the interconnect among the nodes. In this way, almost 100 percent scalability is possible because there is no data sharing among nodes. Examples of workloads that have been scaled effectively on MPP systems include
· Digital video streams for video on demand
· Complex mathematical problems
· Some decision support/data warehouse applications
Only some decision support workloads scale effectively, because they are often prone to data skew, thus breaking the static partitioning rule. For example, if a parallel query is to scale effectively, it is important that all the processors in the system are processing the query. In a shared-nothing environment, this would mean that the pieces of database that are to be scanned should be evenly distributed across all the nodes in the MPP system. As soon as the physical distribution of the data changes, the system is suffering from data skew.
As the data is skewed in this way, the query can be serviced only by a subset of the nodes in the system, because in a 100 percent shared-nothing environment a node can process only data that it is physically storing.
As an example, let's assume that there is a table of historical information that is populated by daily feeds and purged by date range, with the oldest first. When it was first loaded up, there was an even spread of last name values in the table, and so it was decided that the data would be loaded onto the system in "last name" partitions. Each range of names would be stored locally on the respective nodes, as shown in Figure 2.13.
When the first queries are run (full table scans, in parallel), all nodes are busy. However, as time goes on, old data is deleted and new data is added. After a few weeks, the data distribution looks like that shown in Table 2.2.
If the same query were to be run against the new data distribution, node 4 would finish 40 times faster than node 2 and then would sit idle while the other nodes continued processing.
In fact, this particular problem is easy to fix, because the partitioning criterion (last name) was clearly incorrect even from a logical standpoint. However, unlike this very simplistic example, a real data warehouse has many hundreds of tables, or many tables of many billion rows each. This is a very serious problem-a problem that is difficult to resolve without significant periodic downtime.
Examples of workloads that definitely do not scale effectively across MPP systems (no matter what anybody tries to tell you) are
· Transaction processing
· Transaction processing
· Transaction processing
Hopefully this is clear. Transactional workloads do not scale well on an MPP architecture, because it is very unusual to be able to partition the workload beyond a handful of nodes. For the same reason that it is difficult to write scalable applications for clustered SMP systems running Oracle Parallel Server, MPP systems are even more complex. When there are tens or hundreds of nodes out there, it is virtually impossible to make a logical partition in the application that corresponds to the physical partitioning of the system.
2.6.2 Oracle on MPP Systems
You might have been wondering how, on a system that consists of physically private disk, Oracle can be run as a single database. This is a valid question, because one of the clear assumptions that the Oracle architecture makes is that it resides on a system with the entire database accessible through standard I/O system calls. Oracle is often termed a shared-disk database for this reason.
The answer is that Oracle can't be run as a single database. Or rather it can, but not without help from the operating system. Without this special help, Oracle would not be able to open many instances across all the nodes, because each node would be able to see only a tiny portion of the database. Most nodes would not even be able to open the SYSTEM tablespace.
The special support that the operating system provides is a Virtual Disk Layer within the kernel. This layer of software traps all I/O destined for remote disk and performs the I/O using remote calls to the node that is physically connected to the required disk. The data is then passed over the interconnect and returned to Oracle as if it had just been read from a local disk. At this point, the system looks like a very large cluster, with shared disk on all nodes.
You might be thinking that this is a terrible, nonscalable thing to do, and you would be correct. Passing data over the interconnect breaks the fundamental rule of MPP scaling: pass only messages over the interconnect. In this way, the scenario described above, with the data skew, changes form a little on an Oracle database hosted on an MPP system. Instead of nodes becoming idle in the event of a data skew, Oracle does a fairly good job of keeping them all busy by reassigning the query slaves when they become idle. However, when this is done, it is likely that the "local disk affinity" aspect is then destroyed. The local affinity aspect is where Oracle attempts to execute the query slaves on the nodes with the most preferential disk transfer rates: the local nodes. When Oracle can no longer maintain this affinity because the local nodes are still busy, the data needs to be read over the interconnect. So, the system is kept busy, but the system is potentially more limited than an SMP system, because the interconnect does not have as much bandwidth as an SMP interconnect.
The final requirement for running Oracle on an MPP system is a DLM. The DLM provides exactly the same services as on a clustered SMP system, because the system is effectively a large cluster. However, the cluster software needs to be more robust than ever, because the number of nodes to manage (32 to 1,024) is dramatically larger than a clustered SMP system (typically a maximum of eight).
2.6.3 Summary
The use of MPP architecture has always been a niche piece of even the high-end marketplace. Although many customers have been sold down the river by their salespersons, the word is now getting out that MPP is definitely not the way to go if
· You need fewer than 32 processors of total computing capacity
· You do not have sufficient resource to manage an n-node system, where a separate instance of the operating system and a separate instance of Oracle need to be maintained on each node
· You have a transactional system
With the large leaps in SMP bandwidth and, potentially more significantly, with the introduction of NUMA, the MPP is forced even more into a corner. It is my opinion that MPP had a single shot at the DSS goal, as the bus-based SMPs ran out of steam, but failed to make a big impression owing to the management overhead of implementing such a system.
2.7 Cache Coherent Nonuniform Memory Access (ccNUMA)
2.7.1 Definition
ccNUMA (NUMA from now on) could be described as a cross between an SMP and an MPP system, in that it looks like an SMP to the user and to the nonkernel programmer, and yet is constructed of building blocks like an MPP system. In Figure 2.14, the building block comprises four processors, similar to the Sequent NUMA-Q 2000 system.
Examples: Sequent NUMA-Q 2000, Silicon Graphics Origin, Data General ccNUMA

These building blocks are not nodes-at least not by my definition. My definition of a node is an atomic entity that executes an operating system, and this does not apply to the building blocks of a NUMA system. Different manufacturers use different names for their building blocks (see Table 2.3).
For ease of explanation, I will adopt the Sequent NUMA-Q 2000 system as the primary example for the description, with a breakout later on to describe the key differences between the Sequent approach and the SGI (formerly Silicon Graphics, Inc.) approach found on the Origin platform.
2.7.2 Sequent NUMA-Q 2000
Sequent uses a four-processor Intel Pentium II Xeon board, similar to that shown in Figure 2.14, as the building block for its NUMA-Q system. This component is known as a "quad," and for simplicity's sake I will use this term for the rest of the description.
With a single quad in the system, the remote memory controller in the quad would not be needed for system operation. The system would look and behave (in this case) exactly like a four-processor shared-bus SMP machine.
The difference becomes apparent when multiple quads are linked together through the remote memory controller, as shown in Figure 2.15.
When a multiple-quad system such as this is booted, the presence of the additional quads is detected and the memory from all of the quads is added to the global memory address map. The addition of remote memory to the system introduces a new level to the memory hierarchy (see Figure 2.16).
For the time being, don't worry about the "remote memory cache" that has been introduced as a peer to the local memory. This is a Sequent-specific implementation detail, which will be discussed in more detail when we get into how NUMA delivers cache coherency.
The NUMA-Q handles the references to remote memory locations with the remote memory controller hardware (known as IQ-link), making the circled levels of the hierarchy appear from the software perspective to be one, local memory level. With the hardware hiding the implementation of the remote portion, the system looks a lot like an SMP, even from the operating system perspective.
However, this is only the start of the design of a NUMA system, because without significant software work the system above would perform poorly under high load. Although the IQ-link logic can successfully hide the fact that there is remote memory, the operating system still needs to be very intimately aware of this fact. The reason for this is that a load or store (read or write) from/to a remote memory address take longer than a local one,7 and so it is vital that the operating system take careful action to minimize the remote references within the system.
This difference in latency between local memory and remote memory is the reason for the "nonuniform" part of the NUMA name. It is also the reason that the architecture of these systems is absolutely fascinating, and subsequently why Stanford and MIT have ongoing NUMA projects. It is also the reason why this section is longer than the others in this chapter. Let's get back to the operating system.
The first big gain to be made within the operating system is to ensure that processes have their resident set all located on the same quad and that this same quad is used when scheduling the process for execution. These two simple changes guarantee that all of the code and data associated with the process are local, and therefore fast. The one exception to this occurs when the process is using shared memory segments, because there is no direct correlation between the location of a shared memory segment and the location of the processes that execute against it. This is a major issue in an Oracle environment and will be covered in Section 2.7.4.
With the user processes executing mostly within local memory, the number of remote references is substantially decreased. Assuming for a moment that Oracle is not installed on the machine, the remainder of the remote memory references occur when executing in kernel mode.
In kernel mode, processors on all quads must access common memory structures that comprise the kernel address space. Examples of kernel structures that are shared in a standard SMP kernel include
· Process table
· Run queue
· Streams buffers
· File system buffer cache
Potentially more important than the structures themselves are the locks that protect the structures. If read locks are prevalent in the kernel, a good deal of remote memory activity can be felt just by reading some of these structures.
In Chapter 1, we covered some of the issues associated with locks and how they affect scaling. In a NUMA environment, even fine-grained locking can cause problems if the locks in use are not based on the local quad. Therefore, to minimize remote memory references in kernel mode, the OS engineer must implement
· Distributed kernel memory structures
· Quad local kernel locks
In the same way that SMP kernels have taken many years to mature into fine-grained, scalable kernels, it likely to be several generations before the NUMA kernel engineer has located and changed all the significant areas of the kernel that decrease the scalability of NUMA.
In addition to the kernel-induced remote references, DMA-based I/O that occurs on nonlocal I/O controllers also increases the number of remote references, because the data must be passed over the interconnect between the quads. For example, if the SCSI bus with the swap disk on it were hosted by quad-0, the kernel would need to copy pages of memory over the interconnect every time it needed to page out memory on any quad other than quad-0.
For this reason, Sequent adopted fibre channel I/O for the NUMA-Q rollout. With fibre channel, all I/O can then be made quad-local by adding a fibre channel card to each quad. The fibre channel on each card can be configured to "see" all the drives in the disk array, and therefore each quad can perform I/O locally. Again, this is a kernel value add, because the kernel must instruct the card as to which quad to perform the I/O, and is one that is unique to the NUMA architecture.
With the potential for many remote memory references to be always present on a NUMA system, the system needs to provide optimizations to lower the impact of this condition. In the Sequent implementation, the impact is mitigated by the use of the remote cache component within the IQ-link (see Figure 2.17).
When an address is requested by a processor as a result of a cache miss, the request goes across the bus to the local memory controllers and to the IQ-link card concurrently. If the address is a remote address, the remote cache component acts a little like a CPU cache-it searches its cache for the memory region and, if a hit is found, returns it to the CPU's L2 cache. If no hit is found, a request is issued to the remote cache component of the quad that has the most recent copy of the cache line. When the remote memory is received, it is sent on to the CPU's L2 cache, and a copy is kept in the cache for future reference.
In order to optimize the cache coherency requirement among all the caches in the system, the designers of the NUMA-Q system decided to adopt both main types of coherency protocol: snoopy bus and directory. The snoopy bus protocol is used within a quad component, where its broadcast-based approach is very effective because there are a small number of processors within a quad. This snoopy bus protocol ensures cache coherency within each quad, with the IQ-link simply acting as another device on the bus.
To ensure coherency between the quads, a directory-based protocol is used, because the interconnect between the quads is essentially a point-to-point connection, implemented as a unidirectional ring. For example, for quad 1 to communicate with quad 4, the communication will "hop" through quads 2 and 3. Although the traffic is going through quads that are not involved in the communication, the data is simply passed through without incurring any significant latency, because the IQ-link card does not need to be concerned with the content if it is not the intended recipient, and therefore the overhead is kept to a minimum. If the cache coherency protocol used over the interconnect were a broadcast-based protocol, all IQ-link cards would need to look at the content, and therefore a more significant delay would be incurred in forwarding the traffic to the next quad in the ring. This delay would be in addition to any decrease in capacity resulting from the IQ-link controller logic. The directory-based coherency protocol is therefore the most suitable in order to minimize the traffic across the shared interconnect.
The IQ-link card provides the critical link between these two protocols and allows the most suitable protocol to operate in the appropriate area of the system.
2.7.3 SGI Origin 2000
The SGI Origin 2000 more closely resembles the Stanford DASH ccNUMA developed at Stanford University, in that there is no remote cache component within the system. Instead, the designers of the Origin 2000 decided to go for a full directory-based system, with the directory maintaining the states of all blocks (cache lines) of memory in the system. In order to gain greater scalability from a directory-based protocol, the designers also chose to distribute the directory evenly across all the node cards in the system.
The key components of the Origin 2000 include the HUB ASIC and the router to the CrayLink interconnect fabric (see Figure 2.18).
The HUB ASIC is a crossbar switch that handles high-bandwidth point-to-point communications within the node card. The router chip provides a similar function for traffic between cards, but provides multiple paths between any two points.
The directory memory on each node card maintains the status for the local memory only. This is done so that no single part of the interconnect is bearing the burden of directory interrogation.
The Origin 2000 also differs from the Sequent NUMA-Q 2000 in that it offers dynamic memory page relocation based on use. When a page is referenced, a counter is incremented on the basis of where the request came from. If the page is found to be referenced from one particular remote node card at greater frequency than from the local card, the page is relocated to the remote node card.
This relocation occurs at the operating system level in that the page of memory is simply copied to a free page on the remote card and the virtual address map of the mapping process is altered accordingly. That is, the contents of the page are relocated, not the physical address itself.
2.7.4 Oracle on NUMA Systems
During this explanation, the Sequent architecture will be used for examples. Oracle is currently architected to run most efficiently on an SMP architecture. Evidence of this includes the following:
· All disk must be accessible by any process at any time, through standard UNIX system calls.
· Shared memory is central to operation.
· All latches are implemented using atomic test-and-set (or similar) operations on shared memory addresses.
We have already seen that the Oracle architecture can be fairly well adapted to run on MPP, and even better on clustered SMP. However, because of the items listed above, Oracle runs best on an SMP architecture.
We know that NUMA is a lot like SMP from a user perspective but not from an operating system standpoint, so into which of these categories does Oracle fit?
Oracle is pretty close to being an operating system. If one views the Oracle System Global Area (SGA) as the kernel memory region and the shadow processes as "user processes executing in kernel mode," then Oracle looks a good deal like an operating system.
It's no coincidence, therefore, that Oracle faces the same kind of scaling challenges that the UNIX kernel faced, except that with Oracle the challenges are heavily compounded by the use of a shared memory buffer cache and by the very nature of database software. The following aspects of Oracle operation cause varying degrees of difficulty on a NUMA system:
· Shared buffer cache
· Shared redo log buffer
· Global (nonquad local) latches
The implication of the shared buffer cache is not much different from that in an SMP system. The very existence of a shared-memory cache means that a significant number of cache misses and cache invalidations are inevitable. The two big differences with a NUMA system are as follows.
1. It is more expensive to miss if you need to do a remote reference to get the buffer.
2. It is important to increase the statistical chance of the buffer being in local memory, not remote memory or even the local remote cache component, when you miss in the L2 cache.
Because of reason 1, reason 2 becomes important-not so much because of access speed as because of cache line turnover. If the SGA were to be created on a single quad of an evenly loaded eight-quad system, there would be seven remote accesses of the SGA for every eight attempts. This means that the remote cache components of the other seven quads would be replacing lines very rapidly to keep the cache updated with the artificially increased remote reference loading. It also means that the quad that owns the SGA is spending a good deal of its resource processing memory requests on behalf of the other quads.
Sequent's approach to this problem was to equally distribute the buffer cache portion of the SGA across all the quads in the system. This ensures, by virtue of the random access patterns of an Oracle buffer cache, that all remote cache components are equally loaded and that every quad has a one in eight chance of getting a local hit.
The redo log buffer is a different animal. All user sessions writing to the database need to write to the redo buffer. Therefore, there is no optimal location for this buffer as far as making it fair for all users. It is already equally unfair to all users, except the ones executing on the quad where it is located. Striping of the redo buffer over the quads is not an option, however, because in this case it is definitely preferable to give priority access of the redo buffer to one process on the system, the LGWR process.
If the redo buffer fills because the LGWR process is unable to keep up (because of long access times to the buffer), the database will prevent any more writes until LGWR flushes some space. Therefore, in a NUMA environment, it is important that the LGWR process be given a larger slice of the pie. Sequent supports this through the use of (a) an initialization parameter that builds the buffer on a quad of choice and (b) run queues that bind the log writer to the same quad.
Global, shared latches are potentially a larger problem factor than anything else in a NUMA environment. This is where the biggest impact of nonuniform memory access times is felt.
Oracle relies on latches to protect nearly every memory structure in the SGA. These latches are physically implemented using padded data structures8 throughout both the fixed and variable portions of the SGA. The usage profile of these latches is as follows.
1. Get the latch.
2. Do some manipulation within shared memory.
3. Release the latch.
We know from Chapter 1 that when a latch is allocated, this equates to a memory update. We also know that when memory is updated, all the caches in the system must be made coherent, either by invalidating the lines in the caches of other processors, or by updating the caches of other processors. Combining this information with the knowledge that there is a high probability of both the latch and the protected memory structure being in remote memory, the potential for problems can be seen.
The first problem in this environment is that latch acquisition is no longer fair. Imagine the situation where two processes are spinning, trying to acquire the same latch. Each of these processes is on a different quad, but one of them happens to be on the same quad as the current latch holder. The process that is sharing the quad with the latch holder will always "see" the release of the latch faster than the remote process, due to its locality to the (now dirty after the release) cache line. Therefore, the local process will be able to acquire the latch while the remote request is still in transit.
In a latch-intensive Oracle environment, which is essentially any very-large-scale system, this can lead to false sleeps on the latch as a result of the remote quads failing to acquire the latch within their specified spin periods, causing starvation on the latch. A system with a high proportion of sleeps on a latch will demonstrate poor response time.
This particular issue of latch unfairness is one that can be addressed through special software algorithms. Sequent currently utilizes patent-pending locking algorithms within their operating system and is working with software partners such as Oracle to get such support into their products.
There is still a further challenge with latch allocation in a NUMA environment. Latch throughput for a given latch could be measured as the number of (time to acquire latch plus time spent holding plus time to release latch) quantums per second. Therefore, if the acquire, hold, and release times are all artificially high as a result of all the memory references being remote, the maximum number of latch operations per second is proportionally lower than if all the references were on local memory.
This all assumes that the request has missed in all caches, but this becomes a reasonably likely event when latch contention is evident on the system and the cache lines are frequently invalidated by the latch changing hands.
In fact, the more quads you add, the more likely it becomes that the next latch requester will be a processor on a remote quad. If the next requester is on the same quad, the cache line either will be refreshed by local cache-to-cache transfer or, if requested by the same CPU, will still be valid. If it is on a remote quad, the cache line will be invalid on the quad at that time, removing the beneficial effect of the remote cache, because the line needs to be refreshed from a remote location.
When this occurs on the system, the symptom is likely to be a very high latch sleep count, with the CPUs potentially not busy. Adding additional quads in this situation not only will fail to improve the system performance, but could make it worse.
2.7.5 Summary
The types of latch problems described above mean that despite all the efforts of the system designer, NUMA is still not the incredibly scalable system that it has the potential to be. Although many things have been done to alleviate other types of remote memory reference, this problem eludes all of these optimizations. At the end of the day, shared global latches are absolutely dependent on the latency to memory.
For workloads that do not rely on very frequent updates of shared memory, NUMA already promises to be vastly scalable. This is the key point-the Oracle architecture, as it exists today, is retarding the scaling potential of NUMA systems. Until Oracle provides complete base code support for NUMA, this will continue to be the case. As of Oracle release 8.1, Oracle has started to cater for NUMA in the kernel, and these changes are covered in Chapter 8.
Nevertheless, NUMA systems are competing well with other high-end architectures. By increasing the power of each quad, for example, the number of quads can be reduced for a given load, and the statistical likelihood of memory references being remote is reduced accordingly. Likewise, the latency of remote memory accesses is decreasing as the architectures develop-Sequent has already halved its remote memory latency. This in itself directly raises the system's "Oracle ceiling."
For many, the time for NUMA is already here. Depending on the work profile of the system, NUMA can already beat many of the latest SMP architectures. As time goes on, and Oracle's architecture lends itself more to this architecture, the upper bounds of system performance stand to be dramatically increased.
2.8 Storage Systems
It is a surprising fact, but the truth is that in many large-scale database systems insufficient attention is paid to the I/O subsystem. Plenty of time is spent tuning processor and memory utilization, whereas the impact of potentially badly configured disk is ignored.
For example, a database administrator may closely monitor the wait states from within Oracle and determine that any users waiting for physical reads are doing OK because one has to go to disk sometimes, anyway. This is true, but it should not be assumed that the waits for disk activity are short ones.
To highlight the effect of a relatively slow device such as a disk in a query, let's look at an example. Assume that the example system already has an optimally tuned cache and that no further disk accesses can be prevented from further tuning of the cache. For a given query it is found that a cache hit ratio close to 100 percent is achieved, with only 500 reads necessary from physical disk. The total time spent retrieving buffers from memory is 2 seconds, and so doubling the speed of the system memory and processor will improve the response time by 1 second. In contrast, the 500 reads required from the disk are performed from a single disk, one at a time, adding approximately 8 seconds to the response time of the query. So, even the small amount of I/O still required by the query has an overpowering effect on the response time, accounting for a full 80 percent of the total time. Doubling the speed of the memory and CPU was clearly not the way to improve this system, because it improved the response time of the query by only 10 percent. A further doubling of speed in these components would yield an additional improvement of only 5 percent. A change in the system that would be more effective in improving the response time of the query would be to find a way to double the speed of the physical I/O, thus reducing the query response time by 40 percent. This is something that is easiest to do during the storage design phase, in order to obtain the greatest impact in performance with the least system interruption.
With most databases being many times the size of all the caches combined, it is clear that significant physical disk access is likely, even if everything is optimally tuned. With this certain knowledge, it is of obvious importance that the disk configuration in a database system be designed, built, and maintained as carefully as any other component in the system.
In addition, the disk on the system is the sole form of persistent storage for the data and is therefore the single most critical component. Although a system can typically be rebooted immediately after a processor or memory failure without further problem,9 a terminal failure in the disk subsystem needs to be repaired before the system can safely be used once again by the business, even though the UNIX system is likely to keep running. If there is no form of redundancy built into the I/O subsystem, this repair is likely to include the dreaded restore from tape.
2.8.1 I/O Busses
We have already touched on the concept of an I/O bus. Owing to the number of connections available on the CPU and memory interconnect, it is not practical to "waste" these specialized connections with potentially tens or hundreds of I/O controllers. While the traffic from the I/O controllers ultimately travels across the system bus, it does not make financial or architectural sense to attach these devices to it directly.
To implement I/O connectivity, most systems have the concept of an I/O or peripheral bus, which is designed to connect several I/O controllers, Ethernet cards, and the like to a single slot on the system bus by means of a bus adaptor. Examples of I/O buses include PCI, S-Bus, VME, and Microchannel.
I/O busses are typically designed for a different goal than that of a processor-memory bus. For an I/O bus, it is more important to be able to have an accessible bus, to ensure that components can be replaced with minimal interruption of system operation. For this reason, the I/O bus is typically longer, narrower (fewer bits), and subsequently slower than the processor-memory bus.
It is also typical for a system to support multiple I/O busses, in order to allow for connection of many disk controllers, network cards, and so on. Although this has been quite necessary in the past owing to the capacity limitations of SCSI busses, it is getting less important as fibre channel becomes the common form of I/O adapter for high-end systems.
2.8.2 Controllers
Several controllers are available for the connection of disk arrays to a UNIX system. The one used most commonly is some form of SCSI, with an increasing trend toward fibre channel.
SCSI Controllers
SCSI (pronounced "scuzzy") stands for Small Computer Systems Interface. SCSI has long been the I/O controller of choice for UNIX systems, due to its comparatively low cost and high performance in a highly concurrent, I/O-intensive environment.
The first implementation of SCSI, now known as SCSI-1, is an 8-bit-wide bus operating at a theoretical maximum bandwidth of 5MB/s. The bus protocol allows a maximum of eight targets (0 to 7) to be connected to the bus, one of which is always the host adapter/controller itself. The inclusion of the host adapter as a target is because the SCSI protocol is, at least in design, a peer-to-peer protocol, with the host simply existing as a unit on the bus. For all practical purposes, however, the host adapter acts very much as the master, with the other devices acting as slaves.
A target typically is the same thing as a physical device, such as a disk. More accurately, a target is an entity that can be physically selected by the bus protocol. In the case of directly attached devices such as disk drives and tape devices, this means that a maximum of seven disks or tapes can be attached to each SCSI bus.
In the case of RAID-based disk arrays, the RAID controller within the disk array could be the target, with multiple logical unit numbers (LUNs) assigned within the array equating to physical disks. The access to the disks within the array is handled by the RAID controller hardware itself.
The limitation of eight targets on the bus comes from the target selection method used in the bus protocol. The target is selected by electrically selecting the physical line (1 to 8) to which the device has been configured on the bus and is therefore limited to the physical width of the bus-in this case, 8 bits. When bus conflicts arise that need to be arbitrated, the highest target number is chosen over the lower ones. For this reason, the highest target ID (ID7) is chosen for the host adapter.
SCSI-2, also known as Fast SCSI, increased the frequency of the bus from 5MHz to 10MHz, increasing the throughput to 10MB/s in synchronous mode. In addition, SCSI-2 introduced the concept of command queuing, which gives SCSI-2 devices the ability to process certain SCSI commands in a sequence different from that in which they were received. This allows the drive to optimize certain operations and minimize excessive head movement where possible. The most important of these optimizations is the ability for a device to accept a second write instruction before the prior write instruction is complete. This allows the communication latency of the write instructions to be overlapped/hidden in the actual write phase of another write. With a SCSI-1 protocol, the second write could not be dispatched until the first one had been signaled complete.
SCSI-2 also comes in a Wide SCSI-2 format, more commonly known as Fast/Wide SCSI. This has been the most commonly available SCSI format for high-end systems for several years. The Wide version of SCSI-2 increases the bus width to 16 bits and allows the concentration of up to 15 devices. The doubling of the data path allows the maximum bandwidth of the bus to go up to 20MB/s.
Even with SCSI-2, systems were rapidly becoming full of SCSI controllers to allow for the required throughput from the disk subsystem. The latest version of SCSI is Ultra SCSI, or SCSI-3. Ultra SCSI provides another doubling in the bus clock speed, taking the frequency up to 20MHz and the capacity of a Wide Ultra SCSI bus up to 40MB/s. The SCSI-3 standard also provides an extension in command set, allowing the host to negotiate optimal behavior from the drive.
At the time of this writing, another standard of SCSI is reaching the marketplace-the confusingly named Ultra2 SCSI. This doubles the data rate once again to a maximum of 80MB/s in synchronous mode. Ultra2 SCSI needs to be implemented over differential SCSI cabling, as opposed to single-ended SCSI cabling. The differential interface has been used for some time in high-end UNIX systems, because it allows longer cables owing to the use of twisted pairs, which eliminates a proportion of electrical noise picked up by the cable.
One of SCSI's most limiting factors for building large systems is the maximum cable length on a SCSI bus. Even using a differential interface to get the maximum distance, the limit for a SCSI-2 implementation is 25 meters. This seems like quite a lot to begin with, but this distance includes all the cables between the drives, the cable between the system and the drives, and a good deal of the circuitry inside the drives themselves. Once all this is factored in, it very quickly becomes a major engineering feat to connect a large amount of disk to a system. This is further complicated when the SCSI bus includes more than one host system, as is the case in a shared disk cluster configuration. In this case, it is common for the clustered systems to be physically clustered around the shared disk cabinets, because the physical proximity reduces the cable lengths required for the connections.
SCSI Throughput
The real-world throughput of a SCSI bus never actually gets to the theoretical bandwidth of the bus. This is attributable to the overhead incurred from the SCSI protocol. The actual throughput that can be observed varies depending on the access patterns.
For a sequential disk access pattern, the bus can be quickly saturated using a small number of drives, because the drives spend most of the time transferring data rather than waiting for seeks and rotational delays. The actual number of drives that will saturate the bus depends on the size of the I/O units requested. In the case of 2KB reads, the overhead of the bus protocol dominates the bus and the effective bandwidth of the bus drops by as much as 60 percent. For larger reads of 32KB, 64KB, or greater, the SCSI overhead is far less, and a bandwidth closer to the theoretical value will be achieved.
For random access patterns, the bus becomes less of a problem because the drives spend a large proportion of the time waiting for seeks and rotational delays before sending data. Even with a modern drive performing 100 reads/s, it would take around 40 drives on a single SCSI bus to saturate the bus with 2KB reads.
In real life, however, it is unusual for a database system to be either sequential or random. Rather, databases tend to use a combination of the two, particularly in transaction-processing environments. The user transactions are typically all small, random units of transfer, whereas the batch cycle has a completely different profile, using large, sequential I/O transfers to process large amounts of data. For this reason, it is very unusual to find a SCSI bus with 40 or more drives attached to it, because the system must be built to process both types of workloads effectively. Oracle helps out with multiblock reads for full table scans, which batch together several smaller reads into one large request. Typically, this larger unit of I/O is returned by the drive in a comparable time as a single block of its constituent parts, owing to the amount of latency incurred regardless of the I/O size requested. Once Oracle is performing multiblock reads, a bus with too many drives on it will quickly become saturated, and queues will build up and all users of disk on that channel will be adversely affected. For this reason, it is rarely a good idea to economize on the number of channels at the cost of the batch process.
Fibre Channel
Fibre channel is a fiber optic link between the host and the disk subsystem. With a theoretical maximum bandwidth of 100MB/s, fibre channel has a large bandwidth advantage over any of the SCSI standards. In addition, the physical attributes of fibre channel make it attractive in the data center:
· Optical transmission ensuring freedom from electrical interference
· Maximum cable length increased to 500 meters
· Thin, flexible cable
These advantages allow far greater flexibility than ever before, including campus-style (in another building) disaster recovery strategies without specialized disk hardware, flexibility over physical disk cabinet placement, and a good deal less cable to keep tidy under the raised floor.
There are two main ways of implementing fibre channel I/O subsystems: point-to-point and arbitrated loop. In a point-to-point configuration, a direct connection is made between the host and each individual disk device. This allows for the full bandwidth of the fibre channel connection to be available between the two points, but in reality no single device will be able to realize the full fibre chanel bandwidth, and it is not practical to connect each device in this way.
In an arbitrated loop configuration or Fibre Channel-Arbitrated Loop (FC-AL), disks and hosts are connected together in a full duplex ring, similar to fiber distributed data interface (FDDI). This configuration (see Figure 2.19) allows fully shared access among all "loop nodes," including hosts, disks, and other peripherals.
Although this topology provides a very flexible and simple way of stringing the system together, there are several drawbacks to this approach. First of all, the full bandwidth of the fibre channel has to be shared among all "nodes" in the loop. While 100MB/s is a good starting budget, this is very quickly consumed in a large-scale system. Second, fault diagnosis becomes significantly more difficult, because failures of devices and/or fiber interconnects are more difficult to locate and could affect the entire system.
A preferable alternative to both of these methods is to create a fibre channel fabric using fibre channel switches and/or hubs (see Figure 2.20).
When this kind of topology is used, all devices within the fabric are visible to others, and the bandwidth is scalable as required by the addition of more fiber connections into the fabric from the host. When a failure occurs in the fabric, the fault can be quickly isolated to the relevant part of the fabric and resolved accordingly. Using multiple switches and alternate paths, a fully redundant I/O system can be built, without the electrical headaches that have long been associated with SCSI.
The switched fibre channel topology is rapidly becoming the preferred method for large systems, despite the higher cost resulting from the use of expensive fibre channel switches. At the time of this writing, native fibre channel disk drives are only just becoming available. As a temporary measure, vendors have been using existing SCSI disks within the fibre channel fabric through the use of fibre channel-to-SCSI bridges.
2.8.3 Disk Drives
At the end of the controller chain lies the individual disk drive. Because the disk drive is the fundamental building block of the system, it's important to have a good understanding of disk drive mechanics in order to build the system most effectively. The disk drive is ultimately the slowest component in the system, and so any gain in disk drive performance will have a significant effect on the performance of the overall system.
A disk drive (see Figure 2.21) is composed of several disk platters stacked one on top of another with enough of a gap between them to fit in a read/write head.
The data is stored on the platter in a series of concentric rings called tracks. Each of the tracks is composed of several sectors. In modern SCSI disks, each of these sectors has a capacity of 512 bytes and represents the smallest unit of data transfer that the disk will process. A group of tracks, all with the same offset but on a different platter, is known as a cylinder because of the shape formed by the stacked tracks.
The platters themselves rotate at a constant speed of 5,400 rpm to 10,000 rpm, while the heads move laterally across the platters in response to commands from the I/O controller.
To access a particular block on disk, the disk head needs to be moved to the track that contains the requested data. This operation is called a seek. After the seek is complete, the head waits for the requested block of data to pass underneath it. This wait period is called rotational delay or rotational latency. Once the data to be read is in position, the head initiates the read, and the data is sent back across the I/O bus to the requesting host. If this were a write operation, the new data would be written to disk instead of being read.
The two aspects described above, seek and rotational delay, constitute the vast majority of most I/O operations and should be the focus of attention in the design of any I/O system.
Seek Times
Several aspects affect how long you spend seeking with your disk drives. The first of these aspects is the usage profile of the disk. At one extreme, if the disk usage pattern is sequential-that is, accessing each block in sequence-then seeking will not be a problem, because the only seeking performed will consist of moving the disk head onto the next track once the current one is fully read. This type of track-to-track seek is typically an order of magnitude lower in latency than a quoted "average seek" time.
The opposite extreme is a random access pattern. Welcome to transaction processing systems. The Oracle buffer cache has already "stolen" any sequentiality out of your disk access (to your benefit, I might add), and the disks are left to supply single rows using rowids gleaned from index lookups. The one thing you can be sure of here is that the block needed will not be on the same track that the head is currently on. Therefore, a certain amount of seeking is guaranteed.
The key here is to control the amount of seeking that is performed. The obvious thing to consider, therefore, is keeping the data on the disk grouped as closely as possible, thus keeping the seek distances minimized. There is one aspect of disk drive mechanics that is essential knowledge when planning your disk placement strategy. Modern disk drives read and write constant bit densities across the entire platter, using a recording format called zone bit recording. For any given square centimeter on the disk, the same number of bits can be stored, regardless of location on the disk. What this means is that tracks near the outer edge of a disk are capable of storing more data than tracks near the inner edge, because more of the 512-byte sectors can be fitted onto each track (see Figure 2.22 ). The implication of this is that more data can be stored within a given number of tracks near the outer edge than can be stored near the center of the disk. It is therefore statistically more likely that if you place your high-activity data at the edges of the disks, fewer and smaller seeks will occur than if you store it near the center of the disk.
In addition to the improved seek times near the outer edge of the disk, the increased density also allows far greater data transfer rates, potentially 50 percent greater or more.
Rotational Delay
Rotational delay is affected by one thing alone-the rotational speed of the drive. If you take a look at a data sheet for a disk drive, the "average latency" statistic is the one describing rotational delay. If this isn't quoted, don't worry: divide 60 seconds by the rotational speed of the drive (in revolutions per minute) and then halve the result. This is the same number-that is, assuming that there is an average of one-half turn of the disk platter on each read. Clearly, the faster the drive spins, the quicker the required sector will arrive under the head.
2.8.4 Disk Drive Sizing
We have seen several disk metrics. As a demonstration of how these disk metrics are employed, let's use a sizing example in which we choose between two drives with the specifications listed in Table 2.4. For this example, we will be using unprotected, plain disk.
The proposed system is believed to exhibit a peak load of 12,500 reads per second on the I/O system, evenly distributed across the 400GB database. A decision needs to be made as to which disk drives to use. Conventional wisdom from the uninformed would dictate a purchase of 22 of the 18GB drives, because this would be the cheapest way to get the required capacity. Herein lies the problem: many people still use the storage capacity of a drive to determine how many drives are required. This would be absolutely the worst choice that could be made.
Assuming now that we are buying according to the number of I/Os per second that the drives are needed to perform, which do we choose? First, we need to know how many reads per second each drive is capable of. To determine this, we calculate how long each read request will take, as follows:

Once we have this time, we can calculate how many reads can be performed per second:

The results of these calculations are shown in Table 2.5.
It can be seen that, although the instinct may be to go with the smaller drive to obtain more spindles, the newer drive is not only faster but also requires fewer controllers and offers a total cost reduction of $64,500. The part of this that most bean counters find hard to swallow is that about 1.5TB of storage space will not be used if drive 2 is chosen. The solution is simple, of course-don't tell them.
The additional benefit of choosing drive 2 in this situation is that only 19 percent of the drive is needed, as opposed to 48 percent of the 4GB drive. This means that significant performance gains can be achieved by using only the outer 19 percent of the drive.
2.8.5 Redundancy
The provision of full redundancy in the storage system rapidly became a hot topic when people started to implement systems with several hundred disk drives. For example, if a system has 500 disk drives in its storage subsystem, each rated at 1,000,000 hours mean time between failures (MTBF), the effective MTBF of all the disks in the system becomes 2,000 hours, or 83 days. It is clearly not acceptable to have the database exposed to a crash every 83 days on average.
The term redundancy is used to describe the use of spare, redundant parts that allow the system to keep running in the event of a component failure. For every fan, cable, power supply, and so on in the system, there needs to be redundant capacity from another part to allow the system to run unaffected by the failure of the primary. Examples of this include fans that can speed up to compensate for other failed fans, twin power supplies that are each capable of supplying all required power, and twin power cables each rated at full current capacity and connected to its own power source.
In the I/O world, redundancy is an art form. Almost every component in a modern disk system can sustain at least one failure. This includes the host I/O controllers (and therefore the cables attached to them), the power supplies of the disk cabinet, and the disks themselves. One of the most common ways of providing disk-level redundancy is the use of RAID.
2.8.6 RAID Levels
In 1987, Patterson, Gibson, and Katz of the University of California, Berkeley published a paper entitled "A Case for Redundant Arrays of Inexpensive Disks (RAID)." This paper was written to demonstrate the case for using cheaper, commodity disk drives as an alternative to the very expensive disk drives found in large mainframe and minicomputer systems.
The authors proposed that, by combining several these cheaper devices, the performance and reliability of the combination can match or exceed those of the more expensive devices. At the time the paper was written, the commodity disk drive was a 100MB disk drive costing $1,100, compared with an IBM 3380 disk drive with 7.5GB capacity and a price of more than $100,000.
The real driving force behind the research that went into this paper was that while CPU and memory speeds were increasing dramatically year after year, the speeds of disk systems were not advancing at anything approaching the same rate. As already described earlier in this chapter, any increase in processing speed will ultimately be restricted if the operation has to include a slower device.
The idea behind the research was to configure a large number of comparatively inexpensive SCSI disks to look like one very fast, high-capacity disk, rather like the IBM disk. Once many disks are combined in this way, however, the aggregate MTBF for the disk system is substantially reduced. The example in the paper shows the drop in mean time to failure (MTTF) to 2 weeks for the combined array of 100MB disks. Clearly this is not acceptable, and so the paper goes on to describe different techniques (now known as RAID levels) for increasing the reliability while still retaining the performance.
These RAID levels are identified by number, 0 through 5, and have varying degrees of success in balancing price, performance, and reliability. RAID levels 2, 3, and 4 are not commonly implemented, although some RAID disk arrays use a RAID-3 variant to obtain the results they require. For the sake of simplicity, I will concentrate on RAID levels 0, 1, and 5, because these are the most commonly used levels.
RAID-0
RAID-0 is not formally defined in the Patterson et al. paper. However, it is named as such because it conforms to the spirit of the paper-using multiple disks to achieve higher aggregate performance. In a RAID-0 configuration, multiple disks are configured together as a set, or a "bank," and data from any one datafile is spread, or striped, across all the disks in the bank. For the sake of example, we will use 64KB stripes over a six-disk stripe bank, as shown in Figure 2.23.
This type of organization has made RAID-0 more commonly known as striping, because of the way the data exists in stripes across the physical disks. Using striping, a single data partition is physically spread across all the disks in the stripe bank, effectively giving that partition the aggregate performance of all the component disks combined.
The unit of granularity for spreading the data across the drives is called the stripe size or chunk size. Typical settings for the stripe size are 32K, 64K, and 128K.
If one were to perform a 384KB read from the file in Figure 2.23, starting at zero offset (i.e., at the beginning of the file), this would be translated as six physical reads from the respective disks. Disk 1 supplies bytes 0 to 65,535, disk 2 supplies 65,536 to 131,071, disk 3 supplies 131,072 to 196,607, and so on. The impact of this differs (yet again) depending on the access pattern.
If the disk access is sequential, it is bound by the transfer time of the actual data. That is, most of the time spent servicing the request is spent transferring data rather than waiting for a physical data rendezvous. In a striped configuration, all the disks that compose the stripe can transfer data concurrently, making the transfer rate the aggregate rate of all the drives. For example, if a file on one disk can be sequentially scanned at 3MB/s, then the same file striped across six disks can be sequentially scanned at a rate of 18MB/s. The achievable bandwidth is still subject to other constraints, such as SCSI bandwidth limitations, but, with careful planning, significant gains in throughput can be made using striped disk.
With a random access pattern (a.k.a. transaction processing), the gains are less quantitative, but significant nonetheless. However, if a single user were to make random read requests against a striped file, virtually no difference in performance would be observed. This is because the gains that are made in a random access environment have much more to do with concurrency than with bandwidth.
The first benefit that striping provides in a random access environment is load balancing. In a database environment, some files are always busier than others, and so the I/O load on the system needs to be balanced. Unfortunately, this is never as easy as it would first appear, owing to the way database files are accessed. It is very typical for a file to be busy in just a small portion (maybe 1MB) of its total volume. This is true even if the file contains only one database object. In a nonstriped environment, this will cause a hot spot on the disk and result in poor response times.
Using a striped disk array, the exposure to hot spots is significantly reduced, because each file is very thinly spread across all the disks in the stripe bank. The effect of this spreading of files is random load balancing, which is very effective in practice.
The second advantage of striping in a random access environment is that the overall concurrency of each of the files is increased. If a file were to exist on a single disk, only one read or write could be executed against that file at any one time. The reason for this is physical, becasue the disk drive has only one set of read/write heads. In a RAID-0 configuration, many reads and writes can be active at any one time, up to the number of disks in the stripe set.
The effects of striping in both sequential and random access environments make it very attractive from a performance standpoint. There are very few situations in which striping does not offer a significant performance benefit. The downside of striping, however, is that a striped configuration has no built-in redundancy (thus the name RAID-0) and is highly exposed to failure.
If a single disk in the stripe set fails, the whole stripe set is effectively disabled. What this means is that in our set of six disks, the MTBF for each disk is now divided by 6, and the amount of data that is not available is 6 times that of a single disk. For this reason, RAID-0 is not often found in plain vanilla RAID-0 form, but is combined with RAID-1 to provide the necessary protection against failure.
RAID-1
RAID-1 (see Figure 2.24) is more commonly known as mirroring. This was not a new concept in the paper by Patterson et al., but rather a traditional approach to data protection. Mirroring involves taking all writes issued to a given disk and duplicating the write to another disk. In this way, if there is a failure of the first disk, the second disk, or mirror, can take over without any data loss.
RAID-1 can be implemented using volume management software on the host computer or using a dedicated intelligent disk array. The details of the implementation vary slightly depending on the method used, and thus need to be described separately.
A full host-based RAID-1 implementation uses dedicated controllers, cables, and disk drives in order to implement the mirror. This provides two positive benefits to the configuration. First, it protects the mirrored disk from failure of any of the primary components. If a controller, cable, or physical disk fails in the primary disk array, the mirrored disk remains isolated on its own controllers. This type of configuration is a full RAID-1 configuration-all components in the I/O infrastructure are protected by the presence of a redundant peer. The host ensures that both sides of the mirror are consistent by issuing all writes to both disks that comprise the mirror.10 This does not mean that all writes take twice as long, because the writes are issued in parallel from the host, as shown in Figure 2.25.
When the write is ready to be issued, it is set up and sent to the first disk. Without waiting for that write to complete (as would be the case in a sequential write), the write to the second disk is initiated. Therefore, the "write penalty" for mirrored disk is significantly less than the expected 100 percent.
When a host needs to read a mirrored disk, it takes advantage of the fact that there are two disks. Depending on the implementation, the host will elect either to round-robin all read requests to alternate disks in the mirrored set or to send the request to the drive that has its head closest to the required track.
Either way, a significant benefit is gained by using the disks on an independent basis for reads. If the reads are sent round-robin, a 100 percent gain in read capacity is possible.11 If they are optimized to minimize head movement, the gain could potentially be greater than 100 percent.
Failure of any component on one side of the mirror will cause the system to start using the other side of the mirror exclusively for all read and write requests. During this period, the read capability of the system is substantially reduced, because only one disk is available for reading.
Once a failed side of the mirror is restored, by replacing or repairing the failed component, the contents of the new or repaired disk are considered stale in comparison with the active side. The writes that have occurred while one side of the mirror was unavailable have to be applied to the newly restored side in order to bring the mirror back online.
Typically in host-based solutions, bringing a disk back online will involve physically copying of the surviving mirrored disk onto the replaced peer. This is necessary because the system has no record of what has changed since the disk failed and so must take a brute-force approach.
This process is called resilvering, to further the mirror analogy, and can take a considerable amount of time when the system is still servicing its normal workload. This can also present response time problems, because the active side of the mirror is servicing all user requests and supplying the data for the full copy. Often, there are options available as to how quickly the resilvering is performed. This is a fine balance, because it is undesirable to be unprotected for any extended period (as you are when the other side of the mirror is stale), but it is also undesirable to resilver too aggressively, because the system will suffer poor response times as a result.
There is another situation that can cause resilvering in a host-based solution. When the host system crashes, there is no guarantee that writes were performed to both disks that constitute the mirror. Therefore, the system cannot rely on the integrity of the data between the two sides of the mirror and must take action to ensure that the data between the two sides is clean. The approach taken to achieve this is simply to designate one side of the mirror as clean and the other side as stale. This forces a resilvering, and the mirrored disk can once again be guaranteed to be consistent with itself. This can be a catastrophic occurrence in a production system, because it is likely that most disks in the system will require resilvering, and you must pay this penalty after getting your system restarted after the crash.
Some of the problems associated with host-based mirroring are resolved when the mirroring is performed within an intelligent disk array (see Figure 2.26).
In this configuration, a slightly different approach needs to be taken from the host's perspective. All of the mirroring is taken care of inside the disk array, with the host aware of only a single target to read from and write to. As two connections are no longer necessary between the host and the disk, the failure of the single connection means that both the mirrored disks are unavailable.
To get around this problem, the concept of an alternate path is introduced. During normal operation, both of these channels are used to read from and write to the disk. On failure of one of these channels, the other channel takes on the entire workload until the failed channel is restored. In this way, the connections to the disk are once again fully protected, and the full bandwidth of two channels is available once more.
Within the disk array, the read and write operations are converted into discrete requests to the relevant disks, without the host being aware that there is more than one disk. The write activity is still subject to the overlapped write penalty as the host-based mirroring, unless caching is used within the disk array.
There are essentially two benefits of performing the mirroring within the disk array itself. First, there is no processing overhead incurred by the host to manage the mirrored I/O. Although this is not a very high overhead, in a very-large-scale system it is important to maximize the available processing capacity wherever possible.
Second, the mirror is not exposed to problems resulting from crashing of the host. If the host crashes, the mirror does not become stale, because the state information is not stored on the host and therefore is unaffected by any host operations, including crashes. This single attribute makes disk-array-based (commonly called hardware-based) mirroring very attractive for building a highly available system.
Other advantages of using a hardware-based mirroring policy come into play when the disk array includes a quantity of cache memory. These advantages will be discussed in Section 2.8.8.
The downside of disk mirroring is that the cost of the entire disk infrastructure is exactly 100 percent greater, with no gain in storage capacity. This having been said, in a large-scale database environment, the storage capacity is often a concern secondary to the service capacity of the disk system. Mirroring does provide a significant uplift in the read capability of the system and so provides performance improvements in addition to absolute protection of the data.
RAID-0+1
RAID-0+1 is another RAID level that was not described by Patterson et al. Based on the "made up" RAID-0, RAID-0+1 is exactly what its name implies: striped and mirrored disks.
Figure 2.27 shows a RAID-0 implementation comprising a set of six striped disks mirrored over discrete SCSI channels to an identically configured mirror of the stripe set. This is the most common configuration used in high-end transaction processing environments, because it presents excellent performance and reliability.
The RAID-0/RAID-1 combination has started to be called RAID-10. This has nothing to do with RAID levels, of course, but with the marketing departments of RAID hardware vendors. Still, it's easier to say than "0+1," so we'll adopt its use in this book.
RAID-10 can be implemented using various combinations of hardware-based RAID and host system software. For the purposes of this book, we will concentrate on the two most common implementations, because they also demonstrate an important operational difference between them.
The most common environment for the first of these implementations is a 100 percent software solution, with the striping and mirroring both handled by software on the host. In this configuration, the stripe set is first created and then mirrored. In this configuration, it is the entire stripe that is mirrored, and any disk failure in the stripe set will take that entire side of the mirror offline. For example, if any disk in stripe set A in Figure 2.28 fails, the entire A side of the mirror will become unavailable.
If a failure were then to occur anywhere in the remaining side of the mirror (side B), the entirety of side B would also be taken offline. At this stage, all of the data (A and B) is unavailable and logically corrupt.
The other common implementation of RAID-10 involves a combination of hardware-based RAID and host software. The RAID-1 component is handled by the hardware, and the RAID-0 striping is taken care of in software on the host. The important difference between this configuration and the 100 percent software solution is that in this configuration it is the disks themselves that are mirrored, not the stripe. What this means is that data loss can occur only if the corresponding disk on the other side of the mirror goes bad, not just any disk. For example, in Figure 2.28, if disk 1A fails, it would take a failure in disk 1B for data loss to occur. No disk other than 1B would make any data unavailable.
Obviously, this issue comes into play only in a double failure situation and is therefore not very statistically likely to cause a problem. However, in the stripe-centric configuration, there is an n-fold greater likelihood of data loss if a double failure occurs, where n is the number of disks in the stripe.
2.8.7 RAID-5
The goal of the RAID-5 design was to provide a reliable, high-performance array of disks with the minimum amount of redundant hardware. RAID-5 is based on the use of parity protection across several drives in order to provide protection against disk failure. The RAID-5 configuration (see Figure 2.29) is essentially a striped configuration with an additional disk added to cater to the additional storage needed for the parity information.
With the data striped across the drives in this way, the read performance of RAID-5 is comparable to that of RAID-0. A five-disk RAID-5 configuration will perform practically the same as a four-disk RAID-0 configuration, because no performance is gained from the parity disk.
Parity protection uses the storage of a mathematically derived "check value" for each stripe of actual data. This block of calculated parity information is then stored on one of the disks in the array. The actual disk chosen to store the data is changed on a round-robin basis, so that one disk does not become the bottleneck in a write scenario (see below).
In Figure 2.29, blocks 0, 1, 2, and 3 are used to generate the parity information stored on disk 5. RAID-5 uses an exclusive-OR method of parity generation, which allows any value to be recalculated from the remaining parts. The exclusive-OR operation (XOR or ) is a Boolean operator that returns 1 when one and only one of the bits compared is a 1. What this means in real terms is that the following formulas are possible:

and

You can try this reversible operation on any scientific calculator to make sure that you are happy that this "just works." With this ability to recalculate any missing values, RAID-5 can sustain the loss of any single disk in the stripe with no loss of data. However, when this occurs, RAID-5 runs in a "degraded mode." This degradation is significantly worse than, for example, losing one side of the mirror in a RAID-1 configuration.
When a read is issued to a degraded RAID-5 array, the array must rebuild the missing data from the failed disk. To do this, it must read the corresponding blocks of data from every other disk in the array. In our example, if the middle disk were to fail, and a read of block 2 were issued, disks 1, 2, 4, and 5 would all need to be read in order to recalculate the missing data from the remaining data plus parity information.
When the disk is replaced, the disk must be rebuilt by using all of the other member drives to recompute the missing data. This can be an extremely lengthy process on a busy system.
RAID-5 writes, on the other hand, are almost legendary for their poor performance. The "RAID-5 write penalty" is well known in most storage circles. When a write occurs on a RAID-5 volume, the block that it replaces must first be read, in addition to the parity block for that stripe. The new parity can then be calculated from this as:

The new data and the new parity information must now be written as an atomic operation, because any partial writes will corrupt the data. Therefore, a single write within a single block on a single disk equates to a minimum of two reads and two writes, to keep the array up to date.
The necessary writes of the parity information are the reason that the parity information is spread across all the disks in the array. If all the parity information were stored on a single disk, that disk would ever be capable of executing one read or write concurrently. This implies that only one concurrent write could occur in the entire array of disks, because all writes involve a write to that one parity disk. By spreading the parity over all the disks, depending on the block accessed, the theoretical number of concurrent writes at any one time is the number of disks divided by 2.
It is most common for RAID-5 to be implemented in hardware for several performance-related reasons, the most significant of which is the sheer number of I/O requests that need to be performed in order to operate in RAID-5 mode. For example, one write takes at least four I/Os, assuming that the I/O does not span more than one drive. If the RAID were implemented in software, all of these I/Os would need to be set up by the operating system and sent across the bus. In comparison with the I/O overhead, the task of processing the XOR operations is insignificant.
The final reason that RAID-5 lends itself to hardware implementation is that it benefits greatly from caching the I/O requests, especially for the write requests. For anything that is sensitive to slow response time from the disk (such as redo logs), the use of a write cache is essential.
RAID-5 works well for applications that are mostly read-only, especially very large data warehouses in which the data must be protected with a minimum of redundancy. In large-scale transaction processing environments, a pure RAID-5 solution does not typically provide the performance that is required. If RAID-5 is the only possibility within the budget, a hybrid solution is recommended that stores redo logs, rollback segments, and temporary tablespaces on RAID-1, and the rest on RAID-5. This addresses the most significant problems of RAID-5 in an OLTP environment by keeping the heavy write activity away from RAID-5.
2.8.8 Cached Disk Arrays: EMC Symmetrix
A cached disk array, such as a member of the EMC Symmetrix family, provides several interesting variations on the standard disk technologies presented in this chapter. Although these units can be more expensive than other solutions, the increases in flexibility and performance can often justify the additional cost. The EMC units, in particular, extend beyond the capabilities of raw disk devices and justify this dedicated coverage.
The architecture of the EMC units (see Figure 2.30) is very similar to that of a multiuser computer, only this is a computer that communicates with SCSI commands rather than with keystrokes on a terminal. The "users" of the computer are the database server hosts.
The EMC Symmetrix disk system looks a little different from conventional disk arrays. The fundamental difference is that the cables from the host are never physically connected to the actual disk storage. All communication between the host and the storage goes through the cache. This decoupling of the physical connection allows several enhancements of the functionality of the disk array and provides a large cache to improve the performance of the system.
When a read request comes in, the operating system in the Symmetrix checks to see if the requested data is already in memory. If so, the request is immediately serviced without being subject to seek or rotational delays. If not, a read miss occurs on the cache and a physical I/O is issued to the disk. When the data is returned from the disk, it is loaded into the cache and returned to the host. The software in the system keeps track of these accesses and attempts to determine patterns in the access. For example, in a RAID-1 configuration, the system will try to determine whether it is optimal to read always from one side of the mirror, always round-robin, or a combination of the two based on minimization of head movement. Another pattern that is acted on is that of a sequential read. If a sequential read pattern is determined, the system will go ahead and prefetch several tracks, up to a finite maximum, in order to try to predict the next request.
When a write is issued from the host, the data is written to memory and immediately acknowledged back to the host. The external controllers do not have any concept of the physical storage and are concerned only with reading and writing the cache. The cache buffers that are now dirty are scanned periodically by the internal SCSI controllers, and any dirty blocks are flushed to disk, depending on the activity of the system. If the system is too busy, the writes may be deferred until later.
The EMC system guarantees that the write will eventually make it onto disk, because it has a built-in battery power supply that provides sufficient power to run for a few minutes and fully flush all dirty cache buffers to disk prior to power down. Any reads that occur before the data is written to disk will read the correct data from the cache, and so the data returned will always be valid.
The benefit of the cache will vary from application to application. However, two things are certain:
1. Writes will be significantly faster.
2. The Oracle buffer cache will already have taken care of a good deal of the locality gains in the data access. Don't expect cache hits ratios in the 90 percent range.
In a transaction processing environment, the ability to write out the redo log entries faster will directly benefit the response time of the system.
The EMC Symmetrix supports two levels of RAID: RAID-1 and RAID-S. RAID-1 we have already discussed, but RAID-S is a new concept. Essentially, RAID-S is an optimized implementation of RAID-3, which in turn is virtually identical to RAID-5 but with a dedicated parity disk. The enhancements provided by EMC include
· Faster writes than RAID-5
· Automatic reconfiguration into plain disk on disk failure
The writes have been enhanced by performing the XOR operations at the disk drive level. This eliminates one of the I/O operations, leaving a total of three operations per write. Although this improves the write performance, it is still substantially slower than a RAID-1 configuration. The big improvement in write performance on the EMC unit is the use of the cache to buffer the writes. If the write activity comes only in spikes, the cache will successfully hide the high latency of the actual write.
When a disk in a RAID-S rank fails, the EMC provides the missing data through XOR computation, in the same way as RAID-5. However, when the new value is calculated, it is stored on the parity drive to eliminate any future calculations for that data. Eventually, the entire rank gets converted back to plain (non-RAID) disk in this way.
Another big benefit of the decoupled storage in the EMC is its ability to present hypervolumes to the host. A hypervolume is a logical extension of the physical disks, in that one disk can be represented at the external interface as several physical disks, all with different physical identifiers on the bus. Likewise, the disks are not physically connected, and so the identifiers for the storage are configured in software, allowing strong naming conventions.
This flexibility in how the storage is presented to the server can also present problems if not carefully managed. For example, when you carefully split up your most active database files onto two different disks, make sure that the two disks are not in fact the same disk once you get inside the EMC box. Also, despite any assumed gains from the cache layer, it is important always to lay out your database carefully across the physical disks, ignoring the presence of the cache. Treat any caching benefits as a bonus, not something that you can rely on to bail you out of a poor disk layout.
Finally, the Symmetrix units provide some useful value in the form of extensions of the usual services provided by disk arrays. The first of these is Symmetrix Remote Data Facility (SRDF), which provides additional levels of mirroring between cabinets. This facility allows an additional mirror to be set up in a disk array that is remote from the primary, either by ESCON channel within campus or by a WAN link to a very remote unit. There are three different methods for operating the SRDF facility: synchronous, semisynchronous, and adaptive copy.
In synchronous mode, every write must be guaranteed to be written to the caches on both systems before the host receives an acknowledgment. This is the "safest" mode of operation but also the one that could impose the largest latency on the host.
In semisynchronous mode, the write is acknowledged by the host as soon as the local cache has received it. The remote cache is updated asynchronously to the actual write request, and so the performance is the same as that of a non-SRDF-based system. No further writes will be accepted for that particular piece of disk until the remote cache is consistent. The key to this working effectively is the introduction of a track table.
The EMC unit maintains a track table for every track of every disk in the system. This is simply a list in the cache of every track in the system, with a status against it. This status is used to determine whether the version of the track is the same on mirrored peers, whether local or remote. In the case of the semisynchronous writes, the track table stores the status of the track in comparison with the remote track and determines whether the write is complete.
The adaptive copy mode of SRDF is designed for one-time updates from the local volumes to the remote volumes. This mode compares the track tables on both systems to determine which tracks need to be copied to the remote system to bring it up to date. In this way, the amount of data to be copied is kept to a minimum.
An extension of the adaptive copy mode of SRDF is the Business Continuance Volume (BCV), created by the Timefinder product. This is essentially SRDF within the local cabinet and is designed to allow fast backups of the data to be taken.
In an Oracle environment, the database is put into hot backup mode, and a Timefinder copy is made from the primary database volumes to the BCV copy volumes. These are unmirrored volumes that are updated in a way similar to the adaptive copy mode of SRDF-via the track table. Using the track table, the copy can be substantially faster than if the entire data set were copied, because it is typical for only a portion of the datafiles to be written to during the business day.
Once the BCV has been synchronized with the primary volumes, the updates are halted once again (known as splitting the BCV), and the database is taken out of backup mode. During the split, it is likely that there will be some impact on the Oracle database. The reason for this is a small implementation difference between the SRDF function and the BCV function.
When SRDF copies a track from one cabinet, the net effect is that there is a clone of the track buffer on the remote cabinet. With this clone present, the SRDF volume can be split from the master volume without flushing anything out to physical disk. Unfortunately, BCV does not have clone buffers, because clone buffers are not applicable for copies within the cabinet. The effect of this is that all dirty buffers must be flushed to physical disk before the BCV volume can be dissociated from the shared, single-track buffer.
Once the split has occurred, the BCV volumes can be archived to tape without any impact on the database operation, provided the shared busses within the cabinet are not saturated. This backup can also be performed from another machine that imports the BCV volumes after the split is complete.
2.9 Chapter Summary
The purpose of this chapter has been to give an overview of the concepts used in modern hardware architectures. While some of this information may remain as background knowledge, it is important to have a good understanding of the concepts and architectures used in order to make educated decisions about purchasing and configuring the right hardware for the job at hand.
Above all, when selecting and configuring hardware for a large system, keep it simple. If you invest the time to lay things out in a logical and consistent manner, it will pay dividends over and over again during the operation of the system.
2.10 Further Reading
Lenoski, D. and W-D. Weber. 1995. Scalable Shared-Memory Multiprocessing. Burlington, MA: Morgan Kaufmann.
Patterson, D. and J. Hennessy. 1996. Computer Architecture: A Quantitative Approach, Second Edition. Burlington, MA: Morgan Kaufmann.
Schimmel, C. 1994. UNIX Systems for Modern Architectures. Reading, MA: Addison-Wesley.
Wong, B. 1997. Configuration and Capacity Planning for Solaris Servers. Upper Saddle River, NJ: Prentice Hall.
1
Gordon Moore, Chairman of Intel in the 1960s, stated that the transistor density (and therefore potential power) of a CPU would double roughly every 18 months.

2
Some busses allow the splitting of a bus transaction into a packet-based protocol to achieve greater bandwidth.

3
CPUs without test and set use other atomic instructions to achieve the same result. However, the test-and-set operation remains the best view of the concept.

4
Cache affinity is discussed in Section 2.3.

5
Although this statement is true in theory, it is likely that vendors choose to allow some visibility of the hardware changes to the O/S to allow for further software optimization.

6
Most vendors wrote their own DLMs for Oracle7, but some adopted a DLM written by Oracle, known internally as the UNIX DLM, instead. Notable among these were HP and IBM.

7
First-generation systems had a remote:local latency of around 10:1 (ten times slower for remote access), current systems are around half that (i.e., five times slower), and future systems are likely to be in the 2:1 range.

8
The structure is padded with unused memory allocations to ensure only one latch per cache line. On systems with very long cache lines, such as Alpha-based systems, this "wastes" a larger proportion of the usable cache size than systems with smaller line sizes.

9
Assuming that the failed component is deconfigured from the system after the initial crash.

10
It is typical for a mirror to be made up of two "sides"-a primary and a mirror. Most mirroring implementations also support maintenance of three or more sides. This is generally used to implement fast backup solutions, where the third mirror is detached and backed up offline from the online database.

11
This 100 percent gain assumes that no writes are occurring at the same time.



Scale Abilities Ltd
http://www.scaleabilities.co.uk
Voice: +44 1285 644533
info@scaleabilities.co.uk
TOC PREV NEXT INDEX