Hardware Architectures and I/O Subsystems

his chapter will provide an overview of UNIX server hardware architectures. In the interests of greater scalability and lower price/performance, the architectures available today are not as straightforward as they were even three years ago. It is important to understand the distinctions among these platforms and how to configure them for maximum performance.
2.1 Introduction to Hardware Architectures
Gone are the days of the uniprocessor. Going are the days of the shared-bus symmetric multiprocessor. Building more and more powerful systems at lower and lower cost is very challenging for the hardware architect. Moore's Law
1 has been holding very true from a processor perspective, with virtually no impact on the retail cost of the hardware. The rest of the system, however, has been lagging behind the processor advances. For example, memory speeds are certainly not doubling every 18 months, and
disk speeds are lagging even farther behind. The traditional method of connecting multiple processors and memory cards together over a shared bus started to hit the practical limits of known physics sometime around 1996 (more on the specifics in
"Bus Architecture" in Section 2.1.2).
Mismatches in performance among system components, combined with the economic driving force toward production of ever cheaper hardware, have forced hardware architecture in certain directions. In order to walk our way through these new architectures, it makes sense to sort them by level of complexity, and so our path goes as follows:
1. Single-processor architectures
2. Symmetric multiprocessors, including shared-bus and crossbar switch systems
3. Clustered
Symmetric Multiprocessors
4. Massive
ly Parallel Architectures
5. No
nUniform Memory Architectures
For each of these architectures, we will look at the hardware and software modes that comprise it and at how Oracle is implemented on such a system.
Before we jump in, there are several concepts that you need to understand in order for the descriptions of each architecture to be the most useful.
2.1.1 System Interconnects
Even within the most fundamental modern computer there is some kind of system interconnect. What is an interconnect? It's a communication mechanism shared by several components, designed to allow any-point to any-point communication among the components. By definition, the system interconnect is very much a shared device, so it often becomes the subject of scaling conversations.
There are essentially two different reasons for system interconnects, and all systems have to have at least one of them:
1. Connecting components
within the system
2. Connecting systems together
When we talk about interconnects connecting systems together, we do not include LAN connections. Interconnects provide more of an intelligent link between the systems than a LAN provides, typically at lower latency and greater bandwidth.
The first type of interconnect, one that connects the components within the system, is the one in which we are primarily interested in this book, although we will address the second type when we discuss of clustered systems (Section 2.5) and massively parallel processor (MPP) systems (Section 2.6).
For many years, the principal method of connecting components together within a system has been the shared system
bus. A bus is the simplest type of interconnect and is the one on which most high-end UNIX kernels have been engineered to work. Therefore, many of the operations found in modern interconnects are directly synonymous with their old shared bus equivalents. For this reason, we will concentrate on a bus architecture for the concepts, diversifing where necessary to incorporate newer interconnect technology.
A bus must support several concurrently connected devices, such as a CPU, memory, and I/O controllers. In order to achieve this, and to allow the devices to communicate without interfering with each other, communication across the bus is handled by means of
bus transactions. A bus transaction is very similar to a database transaction in that it is an atomic packet of work that is initiated and completed without interruption. All devices on the bus communicate across it in this way.
2 For example, if a CPU on the system needs to load a cache line from main memory, it will initiate a read across the bus. The bus logic will turn this into a bus transaction to perform the read and will send the request over the bus. At this stage, it's safe to think of the bus as being arranged as shown in Figure 2.1.
The CPU will then wait while the bus logic sends both the request to the memory controller and the data back to the CPU from the appropriate memory locations.
I have used the term "Magic Transport" in Figures 2.1 and 2.2 to allow these illustrations to represent bus-based interconnects or indeed any type of interconnect method. In the case of a bus-based interconnect, the magic is simply a backplane within the system that has all devices connected to it. In this way, any device is connected to all other devices.
This is where the "practical limits of physics" mentioned earlier come into play. A bus is essentially a load of wires, one for each bit of the n-bit width of the bus, plus some extras. These wires (or tracks on a circuit board) all run in straight lines, parallel to each other, with periodic "breakouts" to connectors that allow devices to be connected to the bus.
If we ignore all the other electrical challenges of building such a bus, a limitation still exists. That limitation is the speed of an electron. All electrical flow is made up of electrons passing through a conductor, and the electrons are capable of traveling only up to a fixed maximum speed. Within a bus, all messages (electrical signals) must arrive at all devices within the clock frequency of the bus. Therefore, the electron must be capable of traveling the full distance of the bus in the time window specified by the clock rate of the bus.
To highlight how limiting this is, we will go through an example that, instead of the speed of an electron, uses the fastest known speed-the speed of light (the speeds are similar anyway). Let's assume that we have a "light-based computer," which is identical to today's shared-bus computers but uses light instead of electrons to transport the signals. The light all travels in a vacuum, and there is no overhead introduced anywhere, so the speed of light within the machine is 2.99792458 ¥ 10 10 cm s -1, or around 300,000 Km per second. If the bus is running at a rate of 200MHz (a mere fraction of today's CPU core speeds), then the maximum width of this bus would be 150 cm, or about 5 feet.
|
Clearly, we are some way from a light-based computer, and even if we had one, it doesn't look promising for the long-term future of a shared-bus architecture. Five feet is not very wide for an absolute maximum size backplane, especially once you start to reduce it due to the internal signal path length within each card.
|
Now that we have a simplistic view of a bus from a point-to-point perspective, we can complicate the issue by adding more devices to the bus. The bus now becomes a many-to-many communication device and thus needs some way to manage multiple requests at once, in order to prevent the devices from corrupting each other.
This can be achieved in either of two ways. The first, simplest, cheapest, and least scalable way is to have a single bus master. A bus master is a device capable of initiating reads and writes across the bus. A CPU
has to be a bus master, and so to limit ourselves to a single bus master is to limit ourselves to a single CPU-clearly not very scalable. In order to get around this restriction, the bus needs to support
bus arbitration.
Bus arbitration allows multiple bus masters to exist on the bus but allows only one device to perform a transaction at any one time. The arbitration logic within the bus prevents concurrent transactions and shares out the access among the bus masters in a (randomized) fair-share manner. A good way to view the bus at this point is shown in Figure 2.2.
Now we have a concurrent, multiple bus master configuration. All access to the bus is now serialized, allowing atomic memory operations. These atomic memory operations are essential for implementing any kind of locking mechanisms in the software layer.
One such locking mechanism uses the test-and-set machine code instruction available on many CPUs
3. This is an atomic instruction that tests a memory location for a value and, depending on the result, changes the value to something else. This is used as "test to see if the lock is free, and if so, give it to me," and is an implementation specific that allows processes to "spin on a latch" during active waits.
2.1.3 Direct Memory Access (DMA)
In order to increase the concurrency of the system and decrease the load on the CPUs, there is the concept of "intelligent" I/O controllers that do not require intervention from the CPU to do their work. All that the CPU is required to do is to set up the memory required to do the work and initiate a request to the controller to execute the work.
Let's use a SCSI (Small Computer Systems Interface) controller as an example of this. Figure 2.3 shows a non-DMA-based write operation from memory through the I/O controller.
All of the data needs to go through the CPU before being written to disk. The I/O controller is just a dumb controller that waits for data submissions and requests.
The DMA controller eliminates this requirement by building some intelligence into the controller itself. The card is configured with full bus mastering logic and maintains a memory map of the system memory. When the CPU needs to initiate an I/O request, it simply posts the DMA I/O controller with the location of the data and then resumes work on other tasks, as shown in Figure 2.4. Once the I/O is complete, the controller sends a hardware interrupt to the CPU in order to signal completion.
Cache coherency is an essential attribute in shared-memory multiprocessors. A system exhibits cache coherency when it is guaranteed that a read of any memory location, from any processor, will always return the most recent value of that location.
The big deal regarding cache coherency is that because of the way most modern processor caches operate, and because there are multiple processors potentially reading and writing from the same physical memory location at any one time, it is very possible to get different values from the same memory location by different processors unless steps are taken.
We have seen that caches form an essential part of all high-performance systems, because they mitigate the differences between the memory speed and the core speed of the CPU. However, when there are multiple caches in the system, each of these caches could potentially have different values for a given memory location. To compound this, caches are typically operated in
write-back mode to gain the most benefit from the cache. In this mode, when a processor changes the value of a memory location, the cache does not necessarily write the new value back to memory at that time, in order to minimize the wait time on slow memory devices. The main memory is normally updated at a later time by the operating system. So at any one time it could be possible for main memory, and all the caches, to have different values for the same memory location.
To demonstrate the effect of this problem, assume that we have a system with two processors, each with a write-back-based cache, but no measures taken to maintain cache coherency. Running on this system is a single program that increments a counter four times, with an initial value of zero for the counter. Assume that when the program starts, all caches are empty. I have also taken the liberty of making the situation as bad as possible by executing each counter increment on alternate CPUs each time. This is done purely to demonstrate the problem in an exaggerated way for clarity. The result is shown in Table 2.1.
Walking through the example, during the first increment, a cache miss occurs in CPU 1, and the value (0) is loaded from main memory and incremented to become the value 1. The next increment is executed on CPU 2, where a cache miss also occurs. Due to the use of write-back caching, the value in main memory is still 0, and so this is what is loaded into the cache of CPU 2 and incremented to become the value 1. Already the problem is evident: two increments of the counter have yielded a value of 1 in both the processor caches, and a value of zero in main memory. At this point, the system is in an inconsistent state.
In order to prevent this from happening, the caches on the system must be kept coherent. To achieve this, and to support write-back caching, a fundamental rule must be put in place on the system: the value of main memory location can be stale (old) in relation to caches, but caches cannot contain stale data. Once this rule is in place, the memory and cache consistency can be maintained with some simple rule, normally implemented in hardware.
When a value is changed by a CPU, its cache contains the most recent value of the given memory location. Using the rule presented above, this means that any other caches in the system that contain an old version of this location must either
· Be sent the new value (known as write-update) in order to conform to the rule, or
· Have the cache line invalidated, enforcing a reload on the next use (write-invalidate)
A system designer will adopt one of these methods for exclusive use within the system.
Now that the caches are kept in sync, there is only the main memory that could potentially be out of step as a result of write-back caches.
If the operating system has not yet written back a cache line to main memory when another CPU needs the most recent value, the CPU cache containing the most recent value intervenes and initiates a cache-to-cache transfer of the most recent value from its cache, thus effecting the final piece in the cache consistency resolution.
The hardware can be designed in one of two different ways in order to be "aware" of the load and stores in the system that will require write-updates, write-invalidates, or cache-to-cache transfers. The two methods of achieving this are the called the snoopy bus protocol and the directory protocol.
The snoopy bus protocol (see Figure 2.5) is a broadcast-based method. This is the method used in virtually all bus-based symmetric multiprocesor (SMP) systems and represents the most straightforward technique. It uses the fact that all bus traffic is broadcast-based: all traffic is visible to all components. This being the case, logic is built into each component to "snoop" on the addresses of the memory loads and updates/invalidates occurring across the bus.
When an address is picked up on a snoop, it is compared with those addresses in cache and appropriate action is taken. In the case of a write-back cache picking up a load request for an address that is still dirty (unwritten to main memory) in its cache, it initiates a cache-to-cache transfer of the most recent value from its cache. In the case of a cache line invalidate, the corresponding cache line is marked invalid in the snooper's cache. As every message is broadcast, the entire bus is unusable during the operation, thus limiting the scalability of the bus as more processors are added.
The directory-based protocol (see Figure 2.6) is normally associated with distributed memory systems. Each "node" maintains a separate directory of which caches contain lines from the memory on that "node." Within this directory, a bit vector is kept for each line of main memory, reflecting which processors in the system have that line in their cache. The major reason for implementing such a scheme is to remove the need for a broadcast-based protocol, because broadcast-based protocols have been found to be inherently unscalable.
Using the directory protocol, the directory for a node (and thus its memory) is used to determine which caches need to be made consistent (through invalidate or update) in the event of a write. At this point the individual caches can be made consistent, using some kind of point-to-point communication rather than a broadcast, thus allowing other parts of the interconnect to be used concurrently. This is much more efficient than the broadcast protocol, where all traffic exists on all parts of the bus at all times; this is the reason that snoopy bus protocols do not scale very effectively.
2.2 Single Processor Architectures (Uniprocessors)
A uniprocessor machine (see Figure 2.7) has the simplest architecture of all UNIX servers. It includes a single processor plus cache, connected to some memory and I/O controllers through some kind of shared bus. In many implementations, there is a separate bus on the system for connection of I/O controllers and peripheral devices. Devices on this bus communicate with the CPU and memory bus by means of a bridge between the two busses.
As there is only one CPU in the system, there are no inter-CPU cache coherency issues to be handled by the operating system. This is not to say that there is only one bus master in the system-the system still needs to be concerned about the cache coherency issues arising as a result of DMA I/O controllers. This can be handled solely by a bus snooping device to invalidate the CPU cache lines where appropriate when DMA writes take place in main memory. This is much simpler to manage than having multiple processors all writing to shared memory areas and invalidating each other's caches.
In addition, no operating system locking is required and so there is no contention for kernel resources, because the operating system can guarantee that no other processors are executing kernel code. For example, only the single processor in the system could be writing to the process table at the same time.
However, it is still possible within a uniprocessor architecture for a kernel data structure to be corrupted through the processing of
interrupts. When this happens, it is the same CPU that is corrupting the data structure, only operating in a different context. For example, if the CPU is performing a complex memory update in kernel mode, and a hardware interrupt is received (from an ethernet card, for example), the CPU must stop processing and switch to interrupt handler code and process it. Within this interrupt handler could be a memory update to the same region that is partially completed, and therefore data corruption could be produced.
This eventuality is prevented through the use of
interrrupt priority levels, which represent the only type of short-term exclusion required on a uniprocessor system. This works by setting the current priority level of the processor to the highest level available on the system, effectively temporarily disabling interrupts. Once the critical update is complete, the interrupts can be reenabled by resetting the interrupt priority level to the prior value. This kind of temporary blocking of interrupts is adequate for kernel data protection in a uniprocessor system.
In reality, many vendors run the same version of the operating system on both uniprocessors and multiprocessors, therefore slightly invalidating this advantage. This also applies only to kernel mode data structures. Oracle must still provide its own user mode locking, because any process could be preempted from the processor before Oracle memory updates are complete.
In addition to the locking that is
required for safe concurrent operation in a uniprocessor system, there is locking that is
highly desirable for increased concurrency. Communicating with peripheral devices, for example, takes a comparatively long time. It is clearly not an option simply to ignore interrupts and be nonpreemptable for what could be several milliseconds, and so another method is required to allow the processor to be safely used by another process while the response from the peripheral device is outstanding. This is achieved through the use of
long-term locks, which protect the data structures from corruption by subsequent processes but allow the kernel to be safely preempted by another process. This allows the overall throughput of the system to be unaffected by access to slower components.
So, in summary, the uniprocessor potentially requires only two types of locks: the interrupt-disabled, short-term exclusion method and the long-term, preemptable method. Both of these methods are considerably simpler to implement than the locking required in a scalable multiprocessor environment.
2.2.2 Oracle on Uniprocessors
Oracle is fundamentally not suited to a uniprocessor architecture. By its very nature as a multiple-process software architecture (DBWR, PMON, shadow processes, etc.), serious compromises in efficiency are made when these multiple processes are run in a uniprocessor environment. This is attributable to the required context switching (see
Section 2.2.3) among all the processes in the Oracle RDBMS, even if the work is being submitted by a single process. There have been many different twists in the Oracle architecture that could optimize running on a uniprocessor system, most notably the single-task connection architecture designed to reduce the number of processes for the user connection, and the "single-process Oracle" architecture that combines all the functions into a single process (actually targeted for the MS-DOS platform). However, single-processor UNIX systems simply do not make very good database systems, regardless of what you do to them.
Finally, at the end of the day, a uniprocessor system can perform only one task at a time, however good it is at pretending otherwise. Therefore, only one person can parse an SQL statement, read a block from buffer, execute a PL/SQL package, and so on, at any one time. This inherent lack of concurrency in the uniprocessor makes it less than ideal in a multiuser environment.
2.2.3 Other Disadvantages
Nobody in their right mind would suggest building a large-scale Oracle database server on a uniprocessor server. There are numerous reasons why the day of the uniprocessor in the data center is very much over.
With only one processing engine in the system, high-context switching ties up a larger percentage of available processor time in wait state when compared with an equal workload on a multiprocessor system. For example, if ten processes are executing on a uniprocessor system, each of the ten processes must be switched onto the processor, allowed to execute for the quantum period, and then switched off the process once more. On a five-way multiprocessor system, the same ten processes would generate exactly one-fifth of the context switches as the uniprocessor, because there are only two processes competing for any one processor. Each of the context switches takes a finite amount of time, during which the processor cannot be executing any useful work. The time will be spent preserving and restoring the CPU registers, the stack pointer, the program counter, and the virtual memory mappings.
The increased context switching creates another problem. As all processes run on the same processor, there is therefore only one cache by implication. This means that there are no cache affinity
4 choices available to the operating system. Therefore, all processes are competing directly for the same cache lines, and, for the same number of executing processes, the cache warmth will always be less than that achievable on a multiprocessor system.
With only one processor in the system, it is obviously not scalable beyond the speed of a single CPU. Although the single processor could be upgraded, the system remains only as scalable as the single processor within it. Even with Moore's Law being achieved, this means that the system could only double in performance every 18 months, unlike a multiprocessor system, which may have the option of doubling its capacity every week (for a few weeks, at least).
Finally, the thing that absolutely kills the uniprocessor in the datacenter is the fact that a single CPU failure impacts the entire business by the machine being unavailable until hardware is physically replaced. That is, if a processor board expires, the system cannot be restarted until a hardware engineer has been called, travels to the site, and installs the board. This takes around two hours even under the most comprehensive service contract. It is rare for a business to find this acceptable-especially because the failures always seem to happen during the peak period.
It can be seen that despite the potential advantages gained through reduced locking administration (which are very slight in comparison with modern lightweight, multithreaded SMP kernels), the disadvantages are severe. This makes a uniprocessor fundamentally unsuitable for high-end enterprise computing. However, there will always be a place for these machines in the low end of the market, in addition to their use as development platforms.
2.3 Symmetric Multiprocessors (SMPs)
A symmetric multiprocessor extends the capability of the uniprocessor by increasing the capacity of the system bus and allowing the addition of multiple processors and memory subsystems. All of the memory in the system is directly accessible by all processors as local memory, a configuration known as a
tightly coupled configuration.
Examples: Sequent Symmetry, HP T-class, Siemens E600, Sun E6500, SGI Challenge
|
The most common SMP architecture is that shown in Figure 2.8-the shared global bus architecture, where all components exist on the same, shared bus. While currently the most common, this architecture is increasingly being replaced by either a crossbar-switch-type bus (discussed in
Section 2.4) or the NUMA architecture (discussed in
Section 2.7). However, we will start off with the shared-bus model, because it represents the most simplistic view.
The SMP architecture allows additional scalability to be gained through the addition of processor cards to the system. Each of the processors added can be used by the kernel for the processing of presented work. In this way, the effective bandwidth of the system is increased each time a processor is added.
The ability to add additional processing capacity to a system at will allows significant flexibility in operating an SMP system. It also allows far greater performance gains to be made than by purely upgrading the processor in the system to a faster one.
One thing needs to be noted here. I have just stated that SMP allows "performance gains to be made." It is important to clarify this point before we get any further into the book. Performance gains will
not be gained simply by adding more processors to the system. Rather, the processing bandwidth of the system is increased, allowing performance gains to be made.
For example, if a system with a single processor can run a given task in one hour, this task would take exactly the same time to complete on a ten-way SMP system with the same processors. The reason for this is that the task is
serialized on one processor and the addition of extra processors will not make the serial stream of instructions complete any faster. In fact, it is likely to make it slightly worse, depending on the scalability of the SMP platform.
However, if the same task is split into ten pieces that can be run concurrently, the ten-way SMP system
could provide close to ten times the "speed" of the single-processor system, if the system scaled efficiently. Likewise, if the single-processor system were running 1,000 processes at any one time, the addition of nine more processors would add significant computational bandwidth to the system and individual processes would wait less time before being scheduled to run. In fact, the run queue for the system would be only 10 percent of the depth of the uniprocessor system, and all of the processes would appear to "run faster," although they are simply being allowed to run more often.
The final advantage of an SMP architecture (whether it has a single shared bus or otherwise) is that the programming model does not change from the user (non-UNIX/Oracle kernel programmer) perspective. The system is described as having a
single system image or a
single instance of the operating system. This is a very important point, in that it allowed the SMP architecture to be introduced in a virtually transparent way, because all the existing user code ran without change. It also means that the programming model remains very simplistic and allows easy programming on a multiprocessor system. This contrasts with the massively parallel processor (MPP) model (see
Section 2.6), where the programmer needs to be personally aware of the hardware configuration and is required to assign processing tasks to the processors individually.
In order to make an SMP system scale effectively, the system designer faces several challenges. Ultimately, a large number of these challenges need to be met by the operating system engineer.
The presence of multiple processors in the system automatically gives us multiple bus masters, as discussed at the beginning of this chapter. It also gives us multiple caches that need to be maintained, as discussed in
Section 2.1.4. Finally, it opens up a whole bunch of load balancing and affinity options to the process scheduler within the UNIX kernel.
We have already discussed the first two of these issues (multiple bus masters and caches), but what about the last ones (load balancing and affinity)? Really, these two things are fairly closely related. When a process becomes
runnable on the system, the kernel must associate the process with a processor and enable it to run. With a uniprocessor system, there is only one choice of where this should go, because there is only one processor. On an SMP system, however, the kernel needs to make a choice as to which processor (or engine) is optimum for executing that particular process.
The most important thing that the kernel must do is to ensure that the work on the system is well distributed across the engines. For example, if there is enough work to keep all of the processors busy, then all of the processors, rather than just a subset of them, should be kept busy. This is the load-balancing part of the problem.
Next, the kernel must decide whether it is worth attempting to get the process assigned to any particular processor. This part of the problem may seem a little strange at first, but there is good reason to assign a process back to the last processor on which it ran, provided that there is some benefit to be gained from the residual cache warmth from the last time the process ran. The kernel can make educated guesses as to whether this is the case, based on the number of processes that have executed on that CPU since this one last ran. This kind of processor preference is known as
cache affinity, because the process has an affinity for one particular cache on the system.
Modern UNIX kernels are very good at managing the administration burden of multiple processors. The ability of the kernel to handle the additional administration directly affects the scalability of the system, and some vendors are better than others at achieving this.
2.3.3 Oracle on SMP Architectures
The Oracle architecture is probably most suited to an SMP-like hardware architecture. This is no accident, because SMP has made Oracle a datacenter reality for large-scale systems. The multiprocess architecture fits perfectly into a system with multiple processors to run these processes on.
The fast, uniform access to main memory found in SMP architectures allows Oracle to use bus primitives for implementing very fast latch operations (i.e., test and set). This is vital for operation of very large single-instance implementations of Oracle, because all operations share common latches. While multi-instance implementations of Oracle (using Oracle Parallel Server) are not subject to the same intense contention for the same latches, they are subject to contention on significantly slower methods of synchronization than those used in a single-instance configuration.
However, for reasons that are separate from any bandwidth issues associated with the bus, SMP still has inherent limitations for scaling Oracle infinitely. For the same reason that SMP architectures are good Oracle platforms, they also have a very definite upper boundary: the speed of latch operations.
For a latch operation to occur, a CPU must use test and set or a similar instruction to attempt to gain control over the latch (a value in shared memory). Only when the control is gained can the processor perform the real work that it is required to do. If the memory location is in the cache of other CPUs, then the processor and/or operating system must ensure that the caches are kept coherent on those other processors.
If a large number of user sessions require a common latch, it becomes more and more likely that all processing engines in the system will be executing test and set against the memory location. Every time the latch value changes, all of these operations need to be serialized by the bus arbiter, and so there is a finite number of latch operations possible on any SMP system. Therein lies the inherent limitation. At the end of the day, it really doesn't matter if the bus has another 10GB/s of bandwidth left unused if the system spends all of its time serializing down the bus on required latch operations.
All this being said, being bound by the speed of a latch is not as big a problem as it used to be, because Oracle has spent considerable time breaking nearly all of the single latches out into multiple latches, thus reducing the actual contention (it's the contention that hurts most). As more and more vendors produce nonuniform memory access (NUMA) architectures (see
Section 2.7), Oracle is likely to start making more and more private latch allocations, in addition to implementing algorithmic alternatives to taking out the lock in the first place.
2.3.4 Shared-Bus Limitations
Due to all the system traffic going across a single shared bus, the system is limited by the capacity of the bus, in addition to the rate of arbitration and the capabilities of any bus snooping logic. As discussed earlier, the bus itself is limited by physics, and the upper limit on the number of processors that can be usefully attached is something of a religious war in the SMP industry. Many SMP designers believe that building an SMP of more than eight processors is a waste of time. Others can demonstrate reasonable scalability of SMPs of 16 to 20 processors. The real answer is very dependent on the way the bus is designed, the speed of the individual processors, the effectiveness of the cache, and so on, and there is no simple answer.
The one thing that is certain, however, is that the upper bounds for simplistic SMP scaling using a shared bus are pretty much at an end as of today. This is affirmed by the hardware vendors, who are all converting to some alternative type of SMP-like technology, using either NUMA, which we cover
Section 2.7, or some kind of intelligent, point-to-point interconnect fabric to replace the global shared bus.
The shared-bus SMP architecture is the architecture that really changed the viability of UNIX-based Oracle database servers to deliver world-class, high-performance solutions. Although there are a decreasing number of hardware manufacturers still producing vanilla SMP architectures of this design, virtually all of the current products present that same, or very similar, programming model.
One of the approaches taken to extend the useful life of a pure SMP architecture is to provide greater bandwidth and more interconnect concurrency through the use of some kind of point-to-point interconnect between components (see Figure 2.9).
Examples: Sun E10000 (a.k.a Starfire), HP V2500
|
The actual architecture of the interconnect fabric varies slightly among systems, but the common goal of the systems is to allow point-to-point communication among all of the processors and memory in the system. This allows for full connection bandwidth between any two points at any time, in addition to increasing the concurrency of the "bus" as a result of multiple connects being active at any one time (remember that the shared global bus allows only one at a time).
The typical sustainable bandwidth for such systems is in the 16GB/s range, with each connection capable of around 2GB/s.
Figure 2.9 shows a schematic view of such a system, with all of the components connected to one another. To my knowledge, there isn't actually a real system out there that looks precisely like the one in this example, but I took interesting features from all of them and made a more useful example.
The first thing to note is that my system is constructed from 16 system boards, all interconnected to one another. The actual method of interconnection demonstrated here is direct point-to-point connection, which was chosen for reasons of path demonstration rather than to show any particular technology. Most likely, the components would be interconnected by some kind of crossbar or other switch fabric (see Figure 2.10).
On each of these boards resides a single processor, a processor cache, cache coherency logic, and either a portion of system memory or a bus converter and I/O controllers. The trade-off between the memory controller and the I/O controllers was done on purpose to make the hypothetical system have the same kind of trade-offs as those of a real system.
For the sake of clarity in the illustration, none of the I/O controllers has been connected to any disk. Just assume that half of these controllers are connected to different disks in the disk cabinet and that the other half are providing alternate I/O paths to the same disks for the sake of redundancy. None of the I/O controllers is "private" to the board that it resides on: all I/O controllers are shared among all processors, as is the memory.
The important omission from the illustration is that of some kind of cache coherency mechanism. With point-to-point communication within the system, the standard SMP bus snooping method no longer works, because there is no shared bus. There are two main approaches to providing the solution to this requirement: a broadcast-based
address-only bus with a snoopy bus protocol, and a directory-based coherency protocol designed for point-to-point communication.
By building a global address-only bus, systems such as the Sun E10000 benefit from the simplicity of the snoopy bus protocol. Greater scalability is achieved when compared with a full shared (address-plus-data) bus, because the system does not need to know the details of the activity (the data itself), just the locations. This allows the most appropriate component-be it a memory controller or the cache of another CPU-to service requests for memory, and allows cache lines to be invalidated accordingly when memory locations are changed by another CPU. It is also possible for a system to have multiple address busses in addition to multiple data connections, because an address bus is very readily partitioned by address ranges into any number of subaddress busses.
All access to other components, whether or not they are resident on the same system board, follows the same logic path. This ensures that all the access times to any particular component from anywhere else in the system are the same (
uniform). This kind of fair sharing of access is critical in the design of these systems, because software changes need to be made as soon as the memory access becomes nonuniform. It is easy to see how a CPU with faster access to a certain piece of memory could "starve" the other CPUs on the system from accessing that component through unfair speed of access-a bit like having a football player on the field who can run ten times faster than all of the other players. This subject will be covered in more detail in
Section 2.7, where it will become necessary to worry about such things.
As far as the operating system and Oracle are concerned, a crossbar-type system is a shared-bus system. The changes in operation all occur in the hardware itself and are therefore transparent.
5 Therefore, the introduction of such a system does not impose as great a risk to stability and performance as does a more radical one such as MPP or NUMA.
The increase in bandwidth of the "bus" allows for immediate performance improvements in non-latch-intensive workloads. The increase in concurrency, and the ability to have multiple address busses and therefore arbitration bandwidth, allows for greater performance in latch-intensive workloads.
This type of SMP system has bought the SMP architecture some time. The introduction of multiple interconnects has significantly increased the performance capability of the SMP software model, and this model probably has some way to go before it is exhausted.
However, scaling of SMP beyond the capabilities of crossbar SMP is a difficult task. For example, if we already have point-to-point data connections that are running as fast as physics will allow, how do we increase the throughput of the system? At some stage, the open system world needs to break out of the SMP mold once and for all, but until the software support is there, this cannot happen.
Clustered SMP configurations extend the capabilities of SMP further by
clustering multiple SMP nodes together, as shown in Figure 2.11.

A cluster is also known as a
loosely coupled configuration, in that the systems that are clustered have no direct access to each other's memory or CPUs. All the systems run a discrete copy of the operating system and maintain a memory map containing only local memory addresses. The systems are coupled by a low-latency interconnect between all the systems and share only message information, which is why the coupling is termed "loose."
With this coupling, the systems can perform intersystem coordination but not much else because of the limited amount of information that they share. However, when connections are made from each system to a shared disk array, the usefulness of the cluster becomes far greater. Now the systems can share common data and coordinate the sharing through the use of messages passed over the interconnect.
In the UNIX world, a cluster is currently useful for one or both of the following:
· Provision of high availability
· Oracle Parallel Server (OPS)
A
high availability cluster is one in which the cluster interconnect is used solely as a "heartbeat" monitor between the systems. When the heartbeat from one of the systems ceases as a result of system failure, another system takes appropriate action to support the workload of the deceased system. This could include mounting of the dead system's file systems (thus the need for a shared disk), a change of IP address to make the system look like its peer, and the start-up of software subsystems.
There is no concurrent sharing of data in a high-availability solution, only sharing of mortality information.
The
Oracle Parallel Server (OPS) cluster is more complex. In this type of cluster, full access is granted to the shared disk from all systems via the database. Note that it is still only the database that is shared, not file systems or any other UNIX resources. Any system in the cluster can bring up an Oracle instance and use it as required.
Clearly, some kind of synchronization is required to prevent all the Oracle instances from writing all over the same pieces of disk and to maintain the same consistent view of the data. This synchronization is provided by the
Distributed Lock Manager, or
DLM.
The DLM is a piece of software that was, for Oracle7, written by the hardware vendor
6 to provide the interface between Oracle and the hardware communications. For Oracle8, the DLM is incorporated into the Oracle product itself, and the vendor-specific lock manager has been retired.
The specifics of the DLM and how it integrates into Oracle will be covered in Chapter 6, but in order to get the basics right here, we can assume that whenever Oracle needs to read a block from disk, it checks in with the DLM. The DLM will send a request to any other Oracle instances that have the block and will ensure that no outstanding writes are pending by flushing the block from cache to disk (this is known as a
ping). The requesting system will then be notified that the block is safe to read from disk, and normal operations will continue. This is a dramatic simplification of a complex issue, but should suffice to demonstrate the basic operation of an OPS cluster.
The DLM lock management is
not transparent to the application, whatever people may try to tell you. Actually, it's fair to say that enough people have been severely burned by that claim that it has become a rare and uninformed statement.
A good deal of work is required to alter the design of the application to operate in this type of configuration. Specifically, anything that could cause communication with other nodes in the cluster by means of the DLM should be minimized wherever possible within the application. Examples of this are
· "Hot block" data access between the nodes, causing excessive pinging
Specifically, a successful SMP cluster implementation will demonstrate good partitioning within the application. This could mean different areas of the application getting processed on different nodes or, even better, "private" data access per node through the use of a transaction processing (TP) monitor.
The implementation of a clustered system is a special requirement. This is not to say that you shouldn't do it-if you are building a very large scalable system, a clustered system of one of the types described above becomes a necessity. It's important, however, that you choose the most suitable method of clustering for your needs.
When deciding which type of clustering to use, you should keep your choice as simple as possible. Unless it is absolutely necessary, an OPS cluster should not be considered. The difference between the two methods of clustering is acute: high-availability clustering is reasonably simple whereas OPS clustering is very complex. This should be prime in your mind when making the decision.
One particularly bad reason for using OPS is for scalability, when the other option is to buy a larger machine. My advice on this is simple: although buying a second, smaller machine may be the most attractive proposition from a cost perspective, just remember that cost comes in many different forms. For example, don't forget to factor in extensive application modification (perhaps a total redesign in extreme cases), a large team of very experienced DBAs, and potentially more unplanned downtime owing to complex, multinode problems.
That having been said, the very best reason for going the OPS route is when you
need scalability-and I really do mean
need. When there is no system big enough to house the entire database processing requirement, OPS is the only way to go. Just remember that it will be an expensive ride. Do your homework, and never, ever, underestimate the undertaking.
2.6 Massively Parallel Processors (MPPs)
A massively parallel processor (MPP) is known as a
shared-nothing configuration (see Figure 2.12).
Examples: Siemens RM1000, nCube, IBM SP/2
|
Despite the grand title, there is nothing very complex about the architecture of an MPP system. Traditionally, the building block of an MPP system has been a uniprocessor node, comprising a single CPU, some memory (16 to 1024MB), a proprietary interconnect, and a finite number of
private disks.
Note that all of the disk is private; this is the reason that MPP systems are called "shared-nothing" configurations. Not only does each node run its own instance of the OS, but it does not even physically share any of its disk with other nodes. Think of an MPP system as a bunch of uniprocessors in a network, and you are pretty close to the truth of things.
The sole distinguishing feature of an MPP system (and the only one that makes it more than just "a bunch of uniprocessors in a network") is the proprietary interconnect between the nodes and the software layers that use it.
Each node runs a private copy of the OS, and thus MPP systems are comparatively simple to produce, because there are no cache coherency issues to deal with. Each processor has its own memory, its own disk controllers, and its own bus.
Typically, MPP is considered when high system bandwidth and good scalability up to potentially hundreds of nodes are required. The reason that MPP is well suited to such applications is that the nodes do not interfere with each other's resource and thus can scale very effectively, depending on the type of workload.
Workload is the key: MPP is suited to workloads that can be partitioned among the nodes in a dedicated and static fashion, with only
messages (as opposed to
data) traveling across the interconnect among the nodes. In this way, almost 100 percent scalability is possible because there is no data sharing among nodes. Examples of workloads that have been scaled effectively on MPP systems include
· Digital video streams for video on demand
· Complex mathematical problems
·
Some decision support/data warehouse applications
Only some decision support workloads scale effectively, because they are often prone to data skew, thus breaking the static partitioning rule. For example, if a parallel query is to scale effectively, it is important that all the processors in the system are processing the query. In a shared-nothing environment, this would mean that the pieces of database that are to be scanned should be evenly distributed across all the nodes in the MPP system. As soon as the physical distribution of the data changes, the system is suffering from data skew.
As the data is skewed in this way, the query can be serviced only by a subset of the nodes in the system, because in a 100 percent shared-nothing environment a node can process only data that it is physically storing.
As an example, let's assume that there is a table of historical information that is populated by daily feeds and purged by date range, with the oldest first. When it was first loaded up, there was an even spread of last name values in the table, and so it was decided that the data would be loaded onto the system in "last name" partitions. Each range of names would be stored locally on the respective nodes, as shown in Figure 2.13.
When the first queries are run (full table scans, in parallel), all nodes are busy. However, as time goes on, old data is deleted and new data is added. After a few weeks, the data distribution looks like that shown in Table 2.2.
If the same query were to be run against the new data distribution, node 4 would finish 40 times faster than node 2 and then would sit idle while the other nodes continued processing.
In fact, this particular problem is easy to fix, because the partitioning criterion (last name) was clearly incorrect even from a logical standpoint. However, unlike this very simplistic example, a real data warehouse has many hundreds of tables, or many tables of many billion rows each. This is a very serious problem-a problem that is difficult to resolve without significant periodic downtime.
Examples of workloads that definitely do not scale effectively across MPP systems (no matter what anybody tries to tell you) are
Hopefully this is clear. Transactional workloads do not scale well on an MPP architecture, because it is very unusual to be able to partition the workload beyond a handful of nodes. For the same reason that it is difficult to write scalable applications for clustered SMP systems running Oracle Parallel Server, MPP systems are even more complex. When there are tens or hundreds of nodes out there, it is virtually impossible to make a logical partition in the application that corresponds to the physical partitioning of the system.
2.6.2 Oracle on MPP Systems
You might have been wondering how, on a system that consists of physically private disk, Oracle can be run as a single database. This is a valid question, because one of the clear assumptions that the Oracle architecture makes is that it resides on a system with the entire database accessible through standard I/O system calls. Oracle is often termed a
shared-disk database for this reason.
The answer is that Oracle can't be run as a single database. Or rather it can, but not without help from the operating system. Without this special help, Oracle would not be able to open many instances across all the nodes, because each node would be able to see only a tiny portion of the database. Most nodes would not even be able to open the
SYSTEM tablespace.
The special support that the operating system provides is a Virtual Disk Layer within the kernel. This layer of software traps all I/O destined for remote disk and performs the I/O using remote calls to the node that is physically connected to the required disk. The data is then passed over the interconnect and returned to Oracle as if it had just been read from a local disk. At this point, the system looks like a very large cluster, with shared disk on all nodes.
You might be thinking that this is a terrible, nonscalable thing to do, and you would be correct. Passing data over the interconnect breaks the fundamental rule of MPP scaling: pass only
messages over the interconnect. In this way, the scenario described above, with the data skew, changes form a little on an Oracle database hosted on an MPP system. Instead of nodes becoming idle in the event of a data skew, Oracle does a fairly good job of keeping them all busy by reassigning the query slaves when they become idle. However, when this is done, it is likely that the "local disk affinity" aspect is then destroyed. The local affinity aspect is where Oracle attempts to execute the query slaves on the nodes with the most preferential disk transfer rates: the local nodes. When Oracle can no longer maintain this affinity because the local nodes are still busy, the data needs to be read over the interconnect. So, the system is kept busy, but the system is potentially more limited than an SMP system, because the interconnect does not have as much bandwidth as an SMP interconnect.
The final requirement for running Oracle on an MPP system is a DLM. The DLM provides exactly the same services as on a clustered SMP system, because the system is effectively a large cluster. However, the cluster software needs to be more robust than ever, because the number of nodes to manage (32 to 1,024) is dramatically larger than a clustered SMP system (typically a maximum of eight).
The use of MPP architecture has always been a niche piece of even the high-end marketplace. Although many customers have been sold down the river by their salespersons, the word is now getting out that MPP is definitely not the way to go if
· You need fewer than 32 processors of total computing capacity
· You do not have sufficient resource to manage an
n-node system, where a separate instance of the operating system and a separate instance of Oracle need to be maintained on each node
· You have a transactional system
With the large leaps in SMP bandwidth and, potentially more significantly, with the introduction of NUMA, the MPP is forced even more into a corner. It is my opinion that MPP had a single shot at the DSS goal, as the bus-based SMPs ran out of steam, but failed to make a big impression owing to the management overhead of implementing such a system.
2.7 Cache Coherent Nonuniform Memory Access (ccNUMA)
ccNUMA (NUMA from now on) could be described as a cross between an SMP and an MPP system, in that it looks like an SMP to the user and to the nonkernel programmer, and yet is constructed of building blocks like an MPP system. In Figure 2.14, the building block comprises four processors, similar to the Sequent NUMA-Q 2000 system.
Examples: Sequent NUMA-Q 2000, Silicon Graphics Origin, Data General ccNUMA
|
These building blocks are
not nodes-at least not by my definition. My definition of a node is an atomic entity that executes an operating system, and this does not apply to the building blocks of a NUMA system. Different manufacturers use different names for their building blocks (see Table 2.3).
For ease of explanation, I will adopt the Sequent NUMA-Q 2000 system as the primary example for the description, with a breakout later on to describe the key differences between the Sequent approach and the SGI (formerly Silicon Graphics, Inc.) approach found on the Origin platform.