
The most fundamental level of UNIX is the kernel. User programs execute under the protection of the kernel and use its services. The kernel provides a standard interface to the system hardware and provides standard services over and above that to ease the process of developing and executing software on the hardware platform.
I must apologize for Figure 7.1. This kind of representation never helped me when I went through the familiarization process myself. That having been said, it does provide a concise view of the structure of a UNIX system, and hopefully a little verbiage will make it clear.
This diagram is supposed to represent the various layers in a UNIX system. Going from bottom to top, the software engineering effort becomes less complex and less shared. For example, every program in the system uses the virtual memory subsystem, although this is a very complex part of the system. It is a sharable part of the system, but each of its users is protected from the inherent complexity by the layers below it. This is the overriding mission of the kernel: to protect users from each other, and from the complexity of the system.
Starting at the lowest level, we have the platform itself, or the hardware domain. Included in this domain are all the physical attributes that need to be taken care of in order to execute software on the system. Some of these things are partially taken care of by the hardware but still require explicit action from the software on the system to instruct the hardware how it should, for example, ensure that the CPU cache reflects valid data for the next process to use.
Moving up the stack, we get to the core kernel services. This layer ensures that all layers above it are taken care of, in addition to providing standard interfaces to the hardware domain. For example, one of the overriding concepts of the UNIX model is that everything is a file. The core kernel services take care of this interface, providing the upper layers with a way of viewing (nearly) all hardware objects as a linear file.
This layer also provides the process abstraction. This is essentially how executing programs are handled by the system, providing each process with the illusion that it is operating independently on private hardware. It also takes care of isolation between these processes, ensuring that one process cannot corrupt the execution environment of another in any way.
In addition, the kernel also provides value-added services that can be used by the user programs. These are software modules that are not essential for the operation of the system but provide a more usable interface for the user. A prime example of this is the availability of many filesystems. A filesystem does not contribute directly to the running of the system but is more than just a simple service. It must have access to kernel memory for the sharing of the filesystem buffer cache and must have fast access to the hardware itself.
Everything else on the stack is user code. That is, it is not directly associated with the kernel. The C libraries are reusable software modules that are used to aid the rapid construction of user software. Note that Oracle is associated with the user processes at the very top of the model.
This is probably the highest-level view of what the kernel does; now it is time to take a look at how this is achieved from an implementation perspective.
The UNIX operating system operates in two distinct modes: kernel and user. The kernel does not do anything mysterious; it is just software like the rest of the system:
It is executed on the same processors as the user software. However, the kernel has special duties and normally operates with a privileged status from a hardware perspective, able to access and modify addresses and registers that user mode cannot. This mode of execution is called kernel mode.
· System housekeeping occurs, such as time-slicing and servicing interrupts (implicit kernel processing).
Starting with the user process, let's have a look at UNIX from a user perspective in order to see where the kernel fits in.
A user connects to the UNIX system by some mechanism, such as a login process for a shell user or an SQL*Net connection for a database user. Whichever way the user connects, the net result is a process on the system. This process is the executing context of a user program, and it is supported by the kernel with an operating environment. The environment for a process consists of
The private memory map is a set of memory pages over which the user has exclusive rights. This memory map is actually a virtual memory map, which will be discussed in Section 7.3.2. For now, it is enough to think of this memory as a contiguous block of memory that is private to the user process. The block of memory is composed as shown in Figure 7.2.![]()
The lower portion of the memory map is accessible to the process, and the upper half of the map is a shared mapping of the kernel memory. All user processes have the kernel memory in their address spaces but cannot access it. In 32-bit UNIX implementations, it is common for the lower 2GB of the 32-bit address range to be used for user memory and the upper 2GB for kernel memory.
The "program memory" in the Figure 7.2 refers to all memory used to execute the user program, excluding the stack. This memory is further divided into as many as five types:
The program stack is typically started at the top of the user address space, growing downward, while the program memory starts at the base and grows upward.
If this process is executing a program that operates only within that private memory map-that is, never requests more memory (program memory or stack), never causes an exception, and never needs to do any I/O-this process will never explicitly switch into kernel mode.
Considering that any kind of output, including screen output, is I/O, this hypothetical program would not be very useful to anybody. In order to perform I/O, however, the process needs to access the hardware of the system, which it cannot do in user mode. In this case, the process will execute a system call, which invokes a few actions on the system and results in the process running in kernel mode. This mode of execution is considered as process context kernel mode.
In modern systems, kernel mode is a hardware state invoked by a hardware trap on the processor. This puts the processor into kernel mode and allows the executing process to perform the required privileged operations. In order to continue, the kernel must first save the processing context of the user process-that is, program counter, general-purpose registers, memory management information, and stack pointer. It's worth noting that a switch to kernel mode usually does not need to perform any processor cache management, because the memory map does not change. Kernel mode and user mode both exist within the same process context, so no cache lines need to be invalidated on a switch into kernel mode. In fact, it is likely that the kernel processing will require some of the user pages, particularly the u-area (see Section 7.2.1), and any buffers required to process system calls. Once the transition to kernel mode is complete, the kernel can start to process the system call request.
While a process is in kernel mode, it has access to both kernel memory and user memory. However, no user code is executed while the process is in kernel mode. Instead, the user request is validated and, if permitted, performed on behalf of the user mode side (user context) of the process. Kernel execution occurs in the same process context as the user program, so all processor cycles consumed during this process are logged against the process under SYS mode (as viewed with sar -u).
It was mentioned earlier that a process could cause an exception. An exception is caused by a process attempting to do something that is either impossible or prohibited. Examples of this include segmentation violations (a process attempting to access memory that has not been allocated to it) and divide-by-zero exceptions. In this case, the kernel processes the relevant exception handler for that event, such as initiating a core dump, within the context of the process.
The second way that kernel code executes is when certain events occur in the system. These events basically consist of software and hardware interrupts and occur for various reasons. By processing these interrupts with interrupt handlers, the kernel ensures that
When a processor receives interrupts, it stops executing the current process and executes the interrupt handler for that interrupt. This happens regardless of whether the process was operating in user mode or in kernel mode, although the kernel has the ability to block interrupts selectively.
Interrupts are asynchronous with respect to the currently executing process and are caused by events that are not directly initiated by the process itself. Therefore, the handler must execute outside the context of the current process and neither logs its execution cycles against that process1 nor accesses its address space.2 This type of execution is considered as system context kernel mode.
The kinds of things that generate interrupts are I/O devices returning a completion status (including network traffic) and the hardware clock.
There are situations in the execution of the kernel code in which an interrupt handler cannot be serviced. One prime example of this would be if the kernel were already executing kernel code for a prior interrupt of that type. In this case, the kernel could corrupt its own address space by having incomplete memory updates from the first interrupt when the second is received. This situation is prevented through the use of Interrupt Priority Levels (IPLs).
Using IPLs, the kernel can instruct the processor to ignore interrupts that it receives that have a lower priority than the specified level. Therefore, the kernel can set the interrupt level at a suitable level prior to processing a critical (i.e., protected) section of code. Any interrupts that have a lower priority than the one set are ignored by the processor, and the critical section can complete safely.
Interrupt levels are hardware-dependent entities and thus vary among processor architectures. Typically, there are several hardware interrupts, and a smaller number of software interrupts that can be programmed. The highest interrupt level is always reserved for machine exceptions: If there is a fatal problem in the hardware or the operating system, all processing must cease immediately to protect against widespread data corruption resulting from unknown states in the system. When a fault of this nature is detected, the highest-level interrupt is generated in order to generate a "panic" of the system.
The next level down is usually reserved for the hardware clock, followed by various peripheral interrupts (disk, network, etc.), and software interrupts have the lowest priority level. Below all of this comes level zero, which is the level at which user mode processing occurs. This allows anything to interrupt user mode processing.
The hardware clock is a critical component of a UNIX system. It not only determines the rate at which the processor is driven (this is why it is known as clock speed) but also generates interrupts that the kernel uses to implement timeslicing.
When the system boots, the kernel programs the hardware clock to interrupt the processor at defined intervals. Each time this interval arrives, a high-priority interrupt is received that is processed by the kernel's clock interrupt handler. This handler runs in system context, because it has nothing to do with any user process.
The period that the clock interrupts is determined by the value for one tick, frequently set at 100 Hz, or 10 milliseconds. Every 10 milliseconds, the interrupt handler is fired and must perform several critical functions, such as incrementing the system time, updating the system and user time statistics for running processes, and posting alarm() signals to processes that have requested them. The clock handler also ensures that the illusion of concurrent execution is maintained on the system by providing the mechanism for implementing time sharing.
Every process in the system is assigned a quantum, which is the maximum amount of time the process can execute on the processor, before other processes must be considered for execution. This period varies between platforms and between scheduler classes, but is typically 100 milliseconds, or ten ticks of the hardware clock.
When the quantum is used up, the kernel checks the run queue, and determines whether the process must be suspended from execution and put back onto the run queue. If there are no processes on the run queue of sufficient priority, the process is allowed to execute for another quantum. If the process is taken off the processor, it remains dormant until it is scheduled back onto the processor when it reaches the head of the queue. This period is typically very short and is weighted by the priority of processes on the run queue.
In this way, the kernel can provide the illusion to all the processes in the system that they are concurrently running on the processor. This, combined with the private memory map, provides a private environment for each process. More detail is provided on process scheduling in the next section.
In the preceding section, a level of detail was reached with regard to what a process is from an execution standpoint. From a high-level perspective, a process is simply the kernel's abstraction of all the attributes of a running program, in order to control its execution.
The process table is the master reference for the process. Although the detail of the process table is not directly useful in implementing large Oracle systems, an understanding of the concepts is useful and helps when developing monitoring hooks into the system.
In physical terms, the process table exists as two structures for each process: a proc structure and a u structure (or u-area). Historically, the proc structure was a table of fixed size that was set at kernel link time and was not changeable at runtime. This is the reason for the quotation marks around the word "table" in the head-it really isn't a "table" any more.
The proc structure contains all the information that the kernel could require when the process is not running, and so exists in kernel memory that is always mapped. By contrast, the u-area contains information that usually is needed only while the process is actually running; the u-area is stored in the private memory map of the process.4
The reason for using two structures to maintain the process control information is that some of the information does not need to be accessible by the kernel when the process is not running. It does not make sense to tie up global kernel memory for information that is typically not required when the process is not running.
Depending on the specific implementation, the u-area can contain the following information for each process:
· Process control block (execution context of the process, including general-purpose registers, stack pointer, MMU registers, etc.)
Though the list above covers the major items, many other pieces of information can be stored in the u-area.
All of the items listed above, with the exception of the PID, can be written to by other processes in the system or the kernel, and therefore need to be globally accessible. The list pointers will be covered in more detail in Section 7.2.2.
When the process information is divided among such structures, it can complicate the retrieval of this information. UNIX commands, such as ps and fuser, use information from both of these structures and therefore require access to all of this information, even when the process in question is not active and could even be swapped out to disk. When the process is swapped out to disk, this includes the u-area, and so special access methods need to be available for gaining access to this information from the swap area. This can be achieved using special system calls or through the use of a custom API. The actual method used is implementation-dependent.
The subject of process scheduling is one that can have a direct bearing on the performance of your Oracle database. Although this has been touched on already in this chapter, it merits some more detail.
As previously mentioned, UNIX is a time-sharing system at heart. This means that the kernel will provide the illusion to all the processes on the system that they all have their own CPU to run on. On a large database server with 4,000 processes, it is not practical (or scalable) to provide a processor for each process, and so it is the time sharing that allows a far smaller number of processors to provide this illusion.
The key to providing this illusion is the hardware clock, as previously discussed, and context switches. Context switching is best covered in Section 7.3, and so for now it is safe to view context switching from the high level of just switching processes on and off processors. Also, to make this discussion simple, assume that we are referring to a uniprocessor UNIX platform.
When a process has used its time quantum (i.e., when the hardware clock interrupt handler has fired) and another process is waiting to execute (the other process is termed runnable), it is switched off the processor and the new process is switched onto the processor. The switching of processes, and the entire clock interrupt handler, must be very efficient in order to minimize the overhead of processing small time slices. If this operation is a noticeable percentage of the quantum itself, then this is the percentage of system that is "wasted" on the task of time sharing.
This simplistic view of time sharing is useful in gaining an initial understanding of the concepts. However, the actual implementation of time sharing is a good deal more complex, having to deal with such things as different priority processes, balancing throughput against response time, managing CPU cache warmth on multiprocessor machines, and so on. The handling of this complex time-sharing requirement is called process scheduling.
When a process becomes runnable, it is placed on a run queue-that is, a queue for the processor in order to be run. This is achieved by adding the proc structure of the process onto a linked list that makes up the run queue.
The architecture of the run queues is very platform-specific, especially when the platform supports complex hardware arrangements such as NUMA. In order to keep this discussion simple, we will refer to the Berkeley software distribution (BSD) run queue architecture.
BSD maintains several run queues, each of which queues processes of a specific priority. When a process has used its quantum on the processor, the scheduler scans the run queues in order, from highest priority to lowest priority. If a process is found on a run queue with a priority greater than or equal to that of the currently executing process, the old process is switched off the processor and the queued process is switched on. If there are no processes on queues with priority greater than or equal to that of the currently executing process, the process is permitted to continue running for up to another quantum before the queues are checked once again.
If a process of higher priority becomes runnable, the current process is preempted off the processor even if it has not used its entire quantum.
The priority of a process is governed by two factors and will be constantly adjusted by the kernel during the lifetime of the process based on the system load average. These factors are the estimated recent CPU usage by this process and the "nice" value of the process.
The nice value of a process is specified by the user at process start-up using the nice command. This value ranges from -20 to +19, with the default being 0. Superuser privileges are required to decrease the nice value, because this increases the priority of the process above the normal priority for user processes. A user can elect to put a larger nice value on the process in order to run at a lower priority and therefore be "nice" to the other users of the system.
The recent CPU usage of the process is calculated using several algorithms. It is not necessary to go into the specifics of these algorithms, because they are covered comprehensively in various other books (see Section 7.9 at the end of this chapter). It worthwhile discussing the basics, however.
If a process is using a good deal of CPU, this will be reflected in the recent CPU counter, which in turn is used as a negative weighting factor in the algorithm that determines the priority of the process. Therefore, CPU-intensive processes cannot dominate a system unless there are only lower-priority processes in the run queue or the system is idle. Likewise, if there are two processes trying to use as much CPU as possible, they will end up on the same low-priority run queue and will compete against each other.
The recent CPU counter is incremented for every tick the process executes. This value is then weighted once a second, using the system load average5 and the process nice value in order to give the counter a decay over time. The load average is used as the amnesia factor in keeping track of the used CPU; if the system is very heavily loaded, the CPU counter will take a long time to forget previous CPU usage, and the priority of the process will be proportionately lower. If the load average is small, the used CPU will be forgotten relatively quickly, and the process will gain a higher priority.
If a process needs to block while it is running, such as waiting for an I/O operation to complete, it is taken off the processor and placed on a different type of queue-one of the sleep queues. The process stores a wait channel in its proc structure and puts itself onto the correct sleep queue for that resource. The wait channel is typically the address of the structure on which the process is waiting, which is hashed to obtain the correct sleep queue. When a resource becomes available, its address is hashed by the kernel to find the correct queue, and all processes waiting on it are woken and put back on a run queue.
If the process was blocked waiting for the completion of a system call, such as a read(), the process is put on a high-priority run queue until the process completes the switch back into user mode. At this point, the process priority is reduced, typically resulting in the process being switched off the processor in order to service processes of higher priority.
When processes are executing on a processor, they are neither on a run queue nor on a sleep queue. At this point, the process is active and exists on another list. In SVR4, this list is known as practive and is a linked list of all the processes currently on the processor, linked against the "active processor link" item in each proc structure.
When multiple processors are present in a system, the scheduling algorithms become more complex. In the case of NUMA systems, this is an attempt to keep active processes close to their resident memory. In SMP machines, it is implementation-dependent: Some platforms do no extra scheduling work, whereas others do.
With multiple processors (sometimes referred to as engines by kernel engineers) and a relatively low number of runnable processes at any one time, some major optimizations can be made by maintaining cache warmth in the CPU caches. This is known as cache affinity.
An algorithm can be implemented, using the number of clock ticks since the process last ran, which determines whether the process is likely to have cache warmth on the processor on which it last ran. Waiting for the correct engine to become available may incur a delay in the execution of the process, and so this algorithm needs to be well tested with real applications.
In testing with and without affinity, OLTP Oracle workloads have been shown to exhibit as much as 15 percent more throughput when affinity is enabled.
In NUMA configurations, it becomes more critical that the scheduler place processes on the correct engine. Although some NUMA configurations support dynamic page relocation between NUMA nodes, this is expensive at best, and the majority of memory for a given process will always reside on one node or another anyway. Therefore, it is a fair assumption that the scheduler should always attempt to schedule processes on engines on the same node as the resident set of the process.
This results in several per-node run queues, in order to ensure the locality bias in selecting processes to run. It's worth noting that the scheduler cannot factor in the location of any mapped shared memory, because this is not part of the process's private memory. In the case of all the processes in the system all running the same binary (such as oracle), the operating system may elect to store a single shared text segment local to each NUMA node in order to maximize the locality of reference.
Signals are the mechanism used to notify processes of asynchronous events. Even novice users are familiar with the kill command, which is used to send these signals to processes.
Most users are aware of the signal mechanism only from the user perspective-that of killing a process in order to terminate its execution. In fact, signals are very much more flexible than a simple termination mechanism, and most can be trapped and controlled in a programmable manner by the process.
Signals are passed to the process by the kernel. They can be initiated by other processes in the system or by the target process itself. The kernel passes the signal to the process by setting a bit that corresponds to the signal in the proc structure for the process, as shown in Figure 7.3.![]()
This bitmask is called p_sig in SVR4 and p_siglist in BSD-based implementations. The bitfield is used to store pending signals for the process.
Before the pending signal bitmask is updated, however, the kernel checks that the signal has not been explicitly ignored by the process. If the signal has been ignored, the kernel will not even set the bit in the bitmask. Once the bit is set in the proc structure, the kernel will attempt to wake up the process in order to receive the signal (if it is not already running). After this, the processing of the signal is entirely the responsibility of the process itself.
The trend here is for the bitmask to be checked prior to returning to user mode from kernel mode. The reason for this is that some signals (namely SIGKILL and SIGSTOP) cannot be processed in user mode, because their handlers are not programmable. Once the bitmask is checked, the signals that have been received are processed by the process in kernel mode. The first action the kernel takes is to check if the signal is SIGKILL or SIGSTOP. If so, the actions are taken in kernel mode without returning to user mode. These actions terminate the process, or suspend the process and put it to sleep, respectively.
If the signal is not one of these two, the kernel checks the u-area for the process to find out if it has a handler declared for this signal. If there is no handler, the default action is taken, as shown in Table 7.1.![]()
![]()
Any signal without a default action of "ignore" will not switch back into user mode. Instead, the process will exit, core dump, or be suspended, all without leaving kernel mode. If there is a handler defined for the signal, the kernel will switch into user mode and execute the signal handler.
It's worth noting that a process may not act on a signal for a comparatively long time. If the process is blocking on a noninterruptible kernel resource or is swapped out to disk, the process may remain unchanged in the system for a while before the signal is acted on.
To finish up with processes, several the concepts in this section are summarized in Figure 7.4, which shows an example mini-application in which a network listener process listens for connections and creates a slave process in order to process requests as they come in.![]()
In simplistic, single-tasking computer systems, programs can be compiled to locate themselves, and run, at specific memory addresses within the available physical memory. The flow of control is passed from task to task on completion, with potentially only one program in memory at any one time.
A very simple operating system, for example, has an operating system "kernel" compiled to run at a specific address and all user programs compiled to run at another specific address. The control of the system starts with the operating system, from which other programs can be executed. The other programs all locate themselves at addresses separate from that of the O/S; the O/S can remain in memory while this happens. Once the program is complete, the operating system gains control of the system once more.
· Little or no memory protection between programs: All programs can read or write all memory locations.
· Programs must execute serially and, unless compiled for physically separate memory ranges, must overlay the previously executing program.
These implications go directly against two of the design goals of UNIX, and so this arrangement is a nonstarter. Another scheme must therefore be adopted-virtual memory and its associated memory management.
UNIX systems implement virtual memory. Virtual memory separates the address space seen by a process from real physical memory addresses. This is achieved using memory address translation as an operating system function.
Address translation allows all processes in the system to address the same locations as if they were private, with the system ensuring that the physical address used is distinct from other processes.![]()
In Figure 7.5, there are two processes, each using half of the physical memory in the machine (for simplicity of example). Each of the two processes has an identical address space, or memory map, that is the same size as the physical address space. Although these processes are using physically different memory locations, both are under the illusion that they start at address zero6 and have contiguous memory allocated beyond that (plus stack).
Each of the cells in Figure 7.5 represents the smallest unit of granularity that the UNIX system considers from a memory management perspective-a page. The concept of a page is used in order to reduce the space and resource overhead of managing a large amount of memory on the system. It is common in modern systems for the page size to be set at 4KB,7 although some systems now support variable page sizes in order to increase the efficiency of very large memory (VLM) applications.
The glue that holds this process together is the address translation tables. These tables hold the mapping of virtual addresses (those that the process uses) to physical addresses (actual memory locations). In UNIX systems, these tables normally come in the form of page tables and, optionally, translation lookaside buffers (TLBs) in the processor memory management unit (MMU).
The page table consists of several page table entries (PTEs) for each process, arranged as an array, as shown in Figure 7.6.![]()
For a given virtual page frame (00 to 07 in Figure 7.6) of a process, there is a corresponding PTE, located at the same offset in the array that makes up the page table. So, the first page that makes up the address space of the process in this example is mapped to the page starting at physical address 0x000E1000. The actual offset within the page remains the same as the offset within the virtual page.
It was mentioned earlier that some systems now support variable-size pages. This is a result of the overhead now imposed in managing the huge quantities of memory found in very large systems. For example, if an Oracle database server has a 4GB SGA, this typically means that each connection to the database needs to have a page table large enough to cater for every page in the SGA in addition to the process memory itself. Each PTE is typically 32 bits, and so a 4KB page size would yield a 4MB (1 million times 4 bytes) page table for each process. Not only does this size of page table mean the kernel is spending a good deal of time managing page tables, but this memory is also located in kernel memory, not user memory. If 5,000 users are due to connect to this system, this means the kernel memory needs to be greater than 20GB. When variable page sizes are used, the SGA can be assigned, say, 4MB pages, thus reducing the size of the page table for the mapping.
The actual structure of each entry in the page table is defined by the hardware architecture-specifically the MMU of the processor. Each processor family has its own structure for defining virtual to physical mappings in the MMU, and some support more functionality in the hardware than others, specifically in the area of memory protection and the presence of a referenced bit.
Although the hardware dictates the structure of the page table entries, it is the UNIX kernel that is responsible for all the manipulation of the entries in the table and for ensuring that the MMU is using the correct PTEs for the running process. At this stage, it is worth mentioning the effects of the two major variants of CPU cache architecture: physically mapped and virtually mapped.
A physically mapped architecture is laid out as shown in Figure 7.7.This is the traditional approach to caching, because the operating system does not need to be aware of the operation of the cache. Whenever a request for data is made by the CPU, the address (which remains as a virtual address) is passed through to the MMU, which does a conversion of the virtual address to a physical address. It does this using the TLB, which is a fully associative cache. This means that it has the ability to search all the lines (fully associative access) in the TLB concurrently, to determine the physical address to search for in the cache.
The TLB contains PTEs specified by the kernel, and this is the reason why the structure of the PTE is dictated by the CPU architecture. The kernel is responsible for loading the registers of the TLB with the correct PTE information for the active process. For any given process, the kernel mappings will not change, because all processes have the kernel mapped in their address space. On the right of the MMU in Figure 7.7, all addresses are physical.
It is clear that having to precede even the cache access with a lookup on the TLB can impose a significant overhead. In the case that the TLB does not contain the PTE required for the operation, a reference to main memory needs to be made in order to prime the TLB. Luckily, this normally means that the page has not been accessed for a comparatively long time, and so the impact is not felt very frequently.
In some modern systems, a different approach has been taken with the cache. Instead of using the physical address to determine the correct line and tag within the cache, the virtual address is used (see Figure7.8).
The effect of this is that the CPU can request data from the cache directly by virtual address. The MMU needs to be used only when a cache miss occurs and a physical memory access is required. The requirement for a TLB is less in this arrangement, because the MMU is now positioned in a slower part of the system than it is in a physically mapped architecture. This having been said, many architectures implement virtual caches with TLBs for enhanced performance.
The downside of a virtual cache is that the kernel is now required to manage much of the cache coherency, because the hardware is unaware of the differences among the many identical virtual addresses that refer to different physical memory locations. Address 0x1000 for process A is not the same memory location as address 0x1000 for process B.
To alleviate the ambiguity in the cache, the system typically uses the process ID of the relevant process as part of the tagging infrastructure in the cache. The hardware has no concept of process IDs,8 and so the coherency of the cache must involve the operating system: At a minimum, the kernel must inform the processor of the identity of the current process in order for the correct tags to be checked in the cache.
Cache coherency across DMA operations is also more complex, because the majority of I/O devices have no concept of virtual addresses; they transfer to and from physical memory addresses.
When the operating system switches one process off the processor and puts a different one on, several actions must be taken to
These actions constitute the formal definition of a context switch. The context of a process typically includes the following:
The first three attributes are referred to as the process control block, or PCB, and relate directly to saving the execution context of the process. The last attribute is that of the PTEs, and their presence in the TLB.
When a new process is switched onto a processor, the MMU must be informed of the new address mappings for that process. This applies whether the cache is virtual or physical. If a TLB is present, the kernel must also ensure that any mappings that are not relevant for the new process are invalidated, making the MMU reload the correct mapping from memory on access. This is necessary because the MMU does not have any concept of the process executing on the CPU, and merely changing address mappings does not mean that the process has just changed. The classic example of this would be a process issuing an exec() call, which results in all prior mappings changing to support the new program.
The kernel will load up the MMU's registers with the location of the new page table for the process, followed by a flush of the irrelevant TLB entries for the new process. The new process will keep all of the kernel TLB entries, because these addresses will continue to have the same mapping for the new process.
The operation of the MMU and associated TLB are somewhat complicated by the following aspects of modern UNIX systems:
The MMU determines whether or not a page can be read from or written to. This includes mapped pages with memory protection bits set in the PTE and unmapped pages that the process cannot use. In either case, the MMU raises a trap for the kernel to deal with.
The presence of the swap area is discussed in Section 7.4, where we complicate the virtual memory system further by using more memory than we physically have.
In multiprocessor systems, the presence of the TLB introduces further cache coherency considerations that are unrelated to the CPU cache coherency. For example, if the virtual-to-physical mapping or memory protection for a kernel page changes, TLBs in all MMUs must be updated to reflect this change. Another example would be shared memory accessed by several processes. This includes explicit shared memory segments and also copy-on-write data segments for a program that has executed fork() with no exec(). Copy on write will be discussed in Section 7.4.3.
In this case, the kernel must initiate a "TLB shootdown" in order to ensure that all other TLBs are current with the new information. This is an explicit operation, for which the kernel typically maintains a set of data structures that map which PTEs are located in the various TLBs on the MP system. Using this map, the kernel can explicitly invalidate the changed PTEs in all the TLBs.
Any changes in the PTEs are typically not loaded into the various TLBs, because doing so would be a very expensive default operation. Instead, the entries are simply invalidated, forcing the MMU to reload the entry on the next reference. In many cases, the reload will not be required before the process is switched off the CPU, and so the work is prevented altogether.
The preceding section concentrated on the generic attributes of virtual memory (VM), where a finite amount of memory is available on the system and is the maximum amount of memory that can be used among all the processes.
In reality, many of the pages of a given process do not need to be resident in memory all the time, and making them so would be wasteful of the valuable memory resource. A good example of this is Oracle Forms, where the size of the physical process may be 16MB or more. In a running system, experience has shown that only around 8MB of these pages need to be in memory at any one time in order to execute the Forms application with performance quite comparable to that of a fully resident image.
In order to support this optimization of real memory, the memory hierarchy needs to be extended to physical disk, the next step down in the memory hierarchy (see Figure 7.9).
The use of physical disk spindles is a natural extension of the familiar memory hierarchy between CPU cache memory and main system memory. The next step after main system memory is physical disk, which is much cheaper and slower than system memory.
Using disk to extend memory capacity was once a larger issue than it is today. In the past, memory chips were scarce and expensive, and the system architectures were not able to address large amounts of real memory. Therefore, disk-based memory hierarchies were essential to providing the capability for many concurrent users on a UNIX system.
In modern systems, memory is a far less important issue than it used to be, in terms of both cost and addressability. In fact, in building a high-performance Oracle database server, it is preferable always to work within the confines of the physical memory on the system, and not to rely on physical disk to provide additional memory capacity.
Unlike the case of Oracle Forms cited above, Oracle Server processes share critical resources between them, meaning that one process can directly slow down all other processes in the system by holding one of the resources for an extended period. If the process needs memory that has been paged out to disk in order to complete the processing under a latch, all other processes will wait for this page-in operation before the latch is released.
The use of disk in the memory hierarchy is not limited to the sharing of physical memory, however. It also allows the rapid execution of program executables, by breaking the reliance on the entire image being in memory prior to execution. This allows the program to execute as soon as the first page is resident in memory.
Although the full detail of the hierarchy varies across implementations, all modern virtual memory systems implement a form of demand paging. Demand paging extends the concepts presented in the preceding section by specifying that any given page in the system can be resident or nonresident (on disk) at any one time. This does not affect the operation on that piece of memory, because the kernel intervenes if the page is nonresident and loads the page back into memory before the operation is carried out.
The implementation of demand paging is based on the operation of the MMU and PTEs. When a page is requested, the request is serviced by the MMU. If the MMU cannot resolve the request, because the page is either not resident or not yet allocated, a page fault is generated. This is serviced by the kernel, and the appropriate action is taken, based on why the page reference could not be resolved by the MMU. If the page has never been allocated, the kernel will generate an exception for the process and send the appropriate signal.
If the page has been allocated but is located on disk, the kernel will load the page back into memory and create a new map for that virtual address, pointing it to the new physical address. The operation will then be allowed to continue.
Virtual memory can be considered a cache of disk storage. This does not mean that it is the same as the buffer cache, but rather a more complex affair. However, it is safe to view the primary execution object as the on-disk copy of the memory itself. In the same way that a CPU cache mirrors the contents of physical memory in order to speed access, so the physical memory of the system is mirroring the contents of the disk in order to speed access to it.
In the case of program executables, this is a very straightforward concept to grasp: The program is on disk and, on each access to the pages that make up the file, the page is loaded into some available physical memory location in order to speed the access to the instructions in that page. The analogy to CPU caches here is the loading of cache lines (analogous to pages) from physical memory (analogous to disk blocks) before execution.
In the case of other memory objects-namely, anonymous objects such as data segments, user stacks, and so on-there is initially no on-disk representation. In order to deal with this, an area of disk is allocated from the swap area, a special partition or series of partitions set up for this task. This then becomes the on-disk representation of the pages allocated for this process.
When a page is first allocated to an anonymous piece of memory, nothing happens; the kernel simply ensures that the disk area is available. On first access, the page is allocated a physical memory page, which would be zero-filled by the kernel on first use. The process can now use this page as required, operating as a write-back cache (see Section 2.1.4). The disk version of this block becomes stale at this point.
This allocation is just the same as a CPU writing to a cache line for the first time; as soon as the CPU has written to it, the line in physical memory is stale, and the line in cache represents the current version.
The implementation of the actual paging interface varies among different VM systems, so the method presented here represents one of the more common approaches-that of an object-based implementation.
The first concept of an object-based pager interface was introduced above; there are two different types of memory consumer, both of which are used by a process:
When segments of these types are created by the kernel, the kernel instantiates an object specific for that type of memory allocation. Both of the classes that define this object implement a standard interface that the kernel expects to be present (see Figure 7.10).![]()
For those not familiar with object-oriented concepts, don't worry-it is not important that you fully understand these concepts at this stage. Basically, an abstract class is one that simply defines what is required from a class that subclasses it (it adopts the abstract class as its parent specification). A class is code that is fully inclusive from data and function perspectives. It contains all the data declarations and all the functions that operate on that data.9 When a class is invoked, an object is created, which is basically a physical "version" of the class-one that exists in executable form in memory.
In the case of memory objects, the VM system is no longer concerned about the implementation or special requirements of a particular type of pager. As long as the pager subclasses the abstract class (i.e., specification) that the VM system expects, the VM system does not need to be cognizant of any of the detail.
The example in Figure 7.10 uses a vnode pager object. A vnode is an important part of the operation of modern UNIX kernels. It is the kernel's view of an active file. The "v" in vnode stands for virtual-that is, the vnode is an abstraction of any particular filesystem's unique file identifier. It exists in order to break the prior hard link between the kernel and a filesystem, and allows any filesystem with a vnode interface to be mounted in the same way as any other filesystem.
At the end of the vnode pointer is an actual node pointer to the physical file. Therefore, a vnode uniquely identifies a file on the system, and this is the reason that the pager object that deals with files is based on the vnode of the file.
One pager object is instantiated for every mapped file. If more than one user has the file mapped, they will share the same pager object. Whenever an operation must be performed on a page owned by this object, the VM system will call one of the standard functions on the object in order for the object to perform the operation accordingly. This includes freeing the page for another process, or reallocating physical memory for a page that has already been paged out to disk.
The problem of handling memory mapped files is different from that of anonymous memory allocations, and so this is why there are different pager routines for each.
The pager routine for anonymous objects does not have a source file in a filesystem to pass pages to and from. Instead, these objects are transient objects that exist only for the lifetime of the process that is using them. There is no on-disk representation of these objects before they exist in physical memory. When memory pages used by these objects need to be freed, the pager writes the page out to the swap area, which is a dedicated set of disk partitions for this task that is shared by all processes. For this reason, this pager routine is frequently termed the swap pager.
Although this pager is not the vnode pager, both classes present the same interface to the kernel, using vnode and offset pairs to identify the actual page. The vnode in the case of anonymous objects is that of the swap area.
A process on the system uses both types of objects and looks a little like the arrangement shown in Figure 7.11.![]()
Figure 7.11 illustrates the pager interfaces used by a simple executable program, such as the Korn Shell (/bin/ksh). When the program starts, a vnode pager object is created to map the executable file. The program starts to execute but cannot do so until the first page of the executable10 is in memory. The vnode pager is called on at this point to allocate a page in memory and bring in the first page of the executable.
Almost immediately after this, the first page of the data segment needs to be available, and so a pager object needs to be made. The object would be the swap pager, because the contents of this memory will be modified during use, and this should not be reflected back into any file in the filesystem. The first page of the data segment is created by the swap pager, and execution continues.
As soon as anything needs to be placed on the stack, the kernel creates the first page of the stack segment and assigns the swap pager to it. Likewise, as soon as the program needs to use any memory on the heap, another object is created to store the heap segment, which also is a swap pager object.
In Figure 7.11, the vnode pager is a unidirectional arrow, because the mapped file is an executable (executables never get written to by the VM system). If the process were to map another file using mmap(), the vnode pager would also be responsible for ensuring that all dirty pages were first written back to the file before freeing the page.
Now that the actual page handlers are known, we can forget about them once again and concentrate on the VM system itself and the mechanics of its operation.
The VM system varies to a large degree across implementations, but they all have the same fundamental goal: to keep as many active (i.e., currently used) pages in physical memory at any one time. They also typically all have the same basic theory for achieving this goal.![]()
Figure 7.12 shows the four parameters used by the SVR4 UNIX memory management algorithms. Essentially, these four parameters define how aggressively the VM system should work toward freeing memory in the system. The first parameter, lotsfree, declares that if the free memory is above this point, no action needs to be taken by the VM system to free memory-there is lots free. When free memory falls below this threshold, the VM system starts to find pages that can be written out to disk and subsequently free the physical memory page. This pageout routine is run four times per second at this point. In Figure 7.12, the rate of memory decline is reduced because the VM system starts to free infrequently used pages.
The next threshold is desfree. If memory falls below this point, the VM system starts desperation paging, because memory is getting very low. The pageout process is executed on every clock() cycle at this point, and the VM system works hard to free up memory. In the chart, the free memory level starts to become more erratic, because the requests for memory continue and the VM system is almost managing to keep the number of free pages constant. Desfree defines the amount of memory the O/S must attempt to keep free at all times.
The next threshold is minfree. Serious memory starvation occurs at this point, and the system becomes far more aggressive over the selection of pages to pageout. If memory still cannot be freed, and the free memory count falls below gpgslo, the system has to admit defeat in using the paging algorithms.
At gpgslo, the VM system changes gear and goes into swapping mode. It is deemed that memory is so scarce at this point that VM operations will dominate the CPU on the system and start thrashing. Thrashing is where the system spends all of its time paging in and out in an attempt to keep active pages in memory. Swapping involves paging out entire processes to the swap area, rather than trying to work out which pages should go out.
The swapping starts with the processes that are oldest. In fact, very old, inactive processes may have been swapped out much earlier in the process, just as a housekeeping action. Then sleeping processes get swapped out, followed by processes that are at the back of the run queue. At this stage, the system is essentially unusable.
In order to implement this memory management, the VM system adopts several algorithms for identifying pages to page out. One of the common ones is the not recently used algorithm, also known as the two-handed clock (see Figure 7.13).![]()
This algorithm approximates an LRU algorithm but requires significantly less management overhead. The pages of memory are arranged as a circular linked list, where the "last" element points to the first element in order to make a full loop. There are then two "hands" that are applied to searching of this list. The front hand goes through the list and turns off the referenced bit of the PTE. If the page is accessed after this, the bit is turned back on by the reference. If the bit remains off when the back hand inspects the buffer, then it is not recently used and is eligible to be freed.
The handspread is the number of buffers that separate the scans of the two hands. A small gap implies that only very frequently referenced pages will be ineligible for freeing. A large gap allows less frequently accessed pages to remain in memory. Alternatively, the speed at which the hands go around has the same effect, and this is the variable that the kernel changes depending on the amount of free memory.
When a page is to be freed, the VM system can pull the vnode and offset for the page out of the kernel structure and use this to call the correct object pager in order to carry out the procedure. The routine in the pager is called pageout(). If the page is dirty (written to but not flushed to disk), it must be written out before being freed. It is the responsibility of the relevant pager object to ensure that this occurs.
Likewise, when a page fault results in a trap, the vnode and offset for the required page are passed to the relevant pagein() routine in the pager object. This could happen for any of the following reasons:
There is a separate routine for each of these cases. In the first case, the pager simply needs to allocate a free page and fill it with the contents of the corresponding vnode and offset on disk. In the second case, the pager performs a similar operation using the swap area. The final case requires allocation of a new page, and filling of the new page with zeros by the kernel. This is defined behavior for new pages, and some programs expect such behavior.
This kind of single-page demand paging is good in theory but not very efficient in practice. In practice, just getting the required page does not make the best use of statistical facts in the way that memory is accessed. The statistic in question, of course, is the locality of reference: If a certain page is requested, it is likely that the adjacent pages will also be required in the short term. The cost of a single page fetch from disk is so huge in comparison with the time the CPU can process that page that it makes sense to get a few adjacent pages at the same time. The additional disk overhead is tiny if the next pages are also adjacent on the physical disk, and so greater efficiency can be achieved in this way.
In addition to the standard threshold-based memory management method, some implementations adopt further optimizations. Regardless of the amount of memory left free in the system, it is also desirable to prevent proactively any processes from growing beyond a reasonable11 size. This reduces the VM work required when the thresholds are crossed. While most VM systems keep track of the resident set size (RSS) of each process, not all allow specific action to be carried out on the basis of that number. Others, such as Sequent DYNIX/ptx, compare the current RSS of the process against a tunable parameter called maxrs. If the process grows beyond the defined value for maxrs, the VM system starts to pageout older pages of the process's resident set until its size comes back down to the defined limit.
The fork() system call is used to create new processes. The kernel creates an exact copy of the process that calls fork() and gives it a new execution thread for that process. The new process has an individual PID and retains the calling process as the parent of the process. At this point, both the parent and the child continue to execute the same application code, from exactly the same point: the return of the fork() system call.
The complexity of fork() and the reason it is present in the section on memory management have to do with the way that it is typically used. It may already be evident that the kernel will do anything to avoid work that is not necessary; this is the secret of keeping operating system overhead to a minimum. In the case of fork(), it is actually unusual for the child process to continue executing the same code as the parent. Instead, the child typically calls the exec()12 system call.
The exec() call invokes another program within the process that calls it. In order to do this, the address space for the process has to be redefined, making the previous set of memory mappings irrelevant for the child process. If the kernel has just gone to all the trouble of copying all the pages that comprise the parent to make this new process, that work would be wasted.
Therefore, when the kernel receives a fork() call, it typically makes an optimized copy of the parent process. This copy involves only copying of the PTEs for the parent process, thereby making the child an exact copy of the parent, including the physical memory that it uses. Of course, if the parent or the child were to write to one of the pages, it could corrupt the other process, and so another procedure is required to support this optimized copy.
This procedure is called copy on write. The kernel makes a copy of the parent's address space and sets up protection bits on the PTEs. If the parent or the child attempts to write to one of these pages, a protection exception is raised, which the kernel processes. Only at this stage does the kernel make a copy of the page and set the master and the copy to read/write. Therefore, if the child writes to only a few pages before issuing an exec() call, only the required amount of memory copying occurs.
BSD offers an alternative to copy on write in the form of a vfork() system call. Using this call, the child "borrows" the entire address space of the parent, and the parent blocks until the child issues an exec(). This is specifically designed for use in programs that know that an exec() always follows the fork().
The VM system is a fundamental part of the UNIX kernel. For large Oracle database servers, the VM system can be required to perform a good deal of work, even though the system should be configured to keep all the processes and SGA in real memory at all times.
The reason for this is that there are still a huge number of distinct processes on the system, and Oracle relies heavily on the operating system to manage the address spaces for these processes. Each of these processes can be very large, and certainly a great deal of PTE manipulation is required.
In the case of very large SGAs, the operating system must implement changes above and beyond the provision of the standard 4KB page size. One of these optimizations-variable page sizes-was discussed in this section. Another good optimization for the reduction of PTEs in Oracle-like systems is shared PTEs. Just as the operating system already shares the PTEs for text segments, it is also possible to create just one set of PTEs for the shared memory segment to which all processes refer. This reduces the memory consumption of the page tables enormously.
Although the details of the virtual memory system are not essential for building a high-performance Oracle database system, a good understanding of VM can be helpful in comprehending system operation under load and in correlating system statistics.
One of the fundamental design concepts of the UNIX system was to abstract all I/O operations as linear file operations. This allows the kernel to take the complexity of physical hardware devices away from programmers, and to deal with it internally instead. A good example of this would be a tape drive, which has the concept of records. A linear file has no concept of records but only of a continuous stream of bytes. Therefore, the kernel takes care of the record-based communication with the tape drive, allowing the programmer to read from and write to the tape device using standard read() and write() system calls.
This concept has mostly held true to this day, although the number of ioctl()13 and other custom call types available has sometimes made it appear not to be so.
The rule, however, is that pretty much everything can be treated as a file. This makes I/O programming on UNIX systems very portable and straightforward as a result of the kernel (specifically the device drivers) taking the complexity away from the programmer.
Something to bear in mind when looking at system statistics is that the number of read() and write() system calls being executed does not correlate with disk I/O. Rather, these calls include logical I/O in and out of the filesystem buffer cache as well as all network and terminal I/O.
The filesystem is a user-friendly interface to the raw disk. It presents a hierarchical tree of directories and files (see Figure 7.14) and allows concurrent access to files and directories by all users with permission to do so.![]()
All UNIX filesystems share this common interface to the user, regardless of the implementation details of the filesystem or of whether it is local or remote.
Although the interface to the filesystem is very different from the SQL interface used by Oracle, the requirement for the filesystem is virtually identical to the Oracle database. Both are tasked with optimizing the reading and writing of data, in an organized fashion, to and from physical disk.
Just like Oracle, it would be very inefficient to make all of the I/O requests physical, requiring disk I/O for every request. Therefore, filesystems perform all I/O through a cache layer known as the buffer cache14 (see Figure 7.15).![]()
Using the layers shown in Figure 7.15, the filesystem presents a familiar interface to the user, the hierarchical view, with two types of available objects: files and directories. The access to the objects is all done by means of the buffer cache, apart from special "trusted" operations. Although the logical perspective of the file is that of a contiguous file, this is not physically true. All mappings to the blocks that make up a file are maintained by the filesystem itself.
The filesystem implementation has changed a good deal over the life of the UNIX operating system. The simplest filesystem is the System V filesystem, known as the s5fs, and we will use this as an introduction to filesystem implementations.
The s5 Filesystem maintains three different areas on disk, as shown in Figure 7.16.The superblock is the equivalent of a segment header in Oracle, in that it stores the freelist for inodes and data blocks and general information about the file. It is always stored as the first block on disk and is of fixed length. After the superblock comes the inodes.
Inodes are the entry points for all files. There is one inode for each file in the filesystem. Contained in the inode are pointers to actual data blocks, access information, permissions, and file owner information. The inode is read when ls -l is executed in a directory, in order to read the information above.
The inode is of fixed length so that each inode in the table can be addressed by a number*size formula. Included in this size is sufficient information to store up to ten block addresses for the actual file. For very small files (up to 10KB with 1,024 block size), this is sufficient. For larger files, the inode stores three levels of indirect block addresses. The first level is a pointer to a single data block that contains only further lists of block addresses. This allows for an additional 256 blocks (256KB) to be addressed. The second level of indirect block is a pointer to a block that contains only indirect block pointers. Again, 256 pointers can be stored in a single 1KB block, and so the maximum file size catered to by going to the double indirect block is 64MB (256*256*1024 bytes). Finally, there is a third level of indirection, a pointer to a block of pointers to blocks of pointers. This yields a maximum file size of 16GB (256*256*256*1024 bytes), although a 32-bit address limitation allows for only a 2GB maximum in practice, because we have to be able to lseek() forward and backward the full size of the file. Still, this is not too bad for an old, retired filesystem.
The first inode in a filesystem is reserved for the root directory of that filesystem. This is not the same as the root directory; it assumes the name of the directory at which it is mounted-that is, if a filesystem were mounted as /opt/oracle, the first inode would be a pointer to the "oracle" directory when the filesystem was mounted there. More precisely, the first inode is for the root directory of the filesystem.
A directory is simply another type of file. It is a file that has a format known to the filesystem, containing a list of names and inode numbers for the files within it. When ls is called without any options, only the directory is read. When a file is referred to by name, the directory is checked first to get the inode for the file. With the inode, it is just a case of going to corresponding offset in the inode table and retrieving the access information for the file.
If a directory gets too large (too many entries), its performance can be degraded. This is the equivalent of doing full table scans in Oracle; the filesystem code must go all the way through the directory "file" in order to satisfy certain requests, such as a vanilla ls. Likewise, ls -l requires a lookup of the actual inode for each entry; this is something to bear in mind when building filesystems.
The s5fs did a reasonable job for its life span but suffered from several drawbacks, including performance. The Berkeley fast filesystem (FFS) improved on some of these drawbacks by optimizing the organization of the filesystem on disk and allowing long filenames (s5 allowed only 14 characters). However, even FFS is not suitable for commercial application, particularly if a database is to be built on the filesystem.
Traditional filesystems do not cope well with system crashes. Any dirty (unwritten) blocks in the filesystem buffer cache at the time of the crash will be lost, and there is no formal approach to recovering the filesystem. Instead, the filesystem must be laboriously scanned with the fsck utility in order to validate its integrity. Often the utility is not able to rationalize what it finds, and intervention is required by an operator, possibly resulting in file loss or corruption. This is bad enough in itself, but it also takes a good deal of time to get to this stage-possibly many, many minutes. It is essential for a large production system to recover more quickly than this, regardless of whether or not there is a hot standby system.
In order to get around the problem of unwritten dirty blocks, traditional filesystems have to support synchronous writes. This is the only way the database writer, for example, can guarantee that a block is actually going onto disk. This means that the database writer must wait for the completion of this write, potentially taking place at several locations on the physical disk, involving several lengthy seek operations. This results in poor performance for the filesystem.
The traditional filesystems are block-based in their addressing. This is not acceptable for large database systems, because there is an enormous administration and performance overhead from this scheme when very large files are used.
Modern filesystems have been engineered to overcome these problems and to support database systems more effectively. One of the most successful of these is the journaling filesystem, notably the Veritas VxFS. These filesystems operate even more like database servers than do the traditional ones.
Essentially, these filesystems rely on redo logs within the filesystem in order to increase its performance and recovery. When a write occurs, it can be written directly to the buffer cache and also to the redo log. The redo log is on a sequential portion of disk, and so this write is fast. The write to the actual redo log on disk only occurs at the commit point for the write, which is the end of the write. Until this point, the writes go into a redo log buffer, in much the same way as in Oracle. The redo log buffer scheme also supports group commits of multiple transactions.
The write to the data area of the filesystem is not required, because the redo log is capable of rebuilding the data area in the event of a crash. The crash recovery is based on the presence of a redo log, the filesystem check at start up needs only to roll forward the changes that have occurred since the last checkpoint of the data area. This makes recovery very reliable and fast.
The filesystem is designed for very large files, and so both the filesystem and the files within it can be very large. Instead of using single block references to the data area of the filesystem, the journaling filesystem uses extents (contiguous groups of blocks), in the same way as Oracle does. A new extent is allocated for the file, and the extent is referenced instead of a single block. This allows large files to be supported with less overhead than in the traditional block-based approach and to provide considerably greater performance as a result of the contiguous grouping of data blocks.
UNIX allows direct access to raw disk. The interface to the disk is through a UNIX special file, which is opened using standard system calls and accessed using normal read, write, and seek calls. To an application using the raw disk, it just looks like one large file.
· Cannot easily determine how much space is free within a group of raw device files, or how large the files actually are.
These advantages are very important. Essentially, using raw disk removes the operating system from the operation of the Oracle I/O process, with the exception of the device drivers that perform the actual I/O. Any number of processes can read from and write to the partition concurrently, if the partition resides on multiple physical disks (see Section 7.5.5). Also, raw disk is currently the only way of mounting the same Oracle database on multiple UNIX hosts for Oracle Parallel Server.15
In order to gain these advantages, however, it is important to work through the list of disadvantages and to produce procedures and techniques that make them operationally acceptable.
When raw disk is used, raw partitions are supplied to Oracle as a datafile. This means that the entire file, and not just a part of it, must be allocated to Oracle at any one time. In reality, this does not present much of a problem. The physical disk can be sliced into several raw partitions of different sizes.
The size and number of these partitions will vary with database requirements, but the concept maps to all databases. It makes sense to keep the range of partition sizes fairly small in order to make the management simplistic. For example, use allocation sizes of 4GB, 2GB, 1GB, 500MB, and 100MB. This allows quite large tablespaces to be built from a relatively small number of files, and the small ones to be built from small files.
Often the hardest attributes of raw disk to get used to is the inability to use the standard UNIX file utilities such as cp and mv. In reality, a well-configured database system does not require a great deal of file manipulation once it is laid out. If a disk is found to be "hot," a file can still be physically moved by using the dd command to suck the contents out of one partition and put it into a new partition.
When raw devices are used, it also makes sense to work around the naming of typical raw devices. In fact, the naming of the files is one of the major objections most people have to using them. What does /dev/rdsk/c10t5d3s4 mean to anyone, as far as the size and location of the raw device are concerned? Logical volume managers help a little here by providing more user-definable naming conventions, but there are even better ways around this.
The first thing to do after creating all the devices is to make a directory of symbolic links (ln -s). These links should be placed in a common location away from the /dev directory, somewhere near the Oracle codeset. A good convention is to have a root from which all these things happen, such as /opt/oracle. Under this root you can locate the codeset (say, /opt/oracle/oracle815) and the directory containing the symbolic links to the actual raw devices.
Each symbolic link should have a descriptive name, such as "db00_1000M_10_000," which means database set 00, 1000MB slice, RAID 1+0, slice number 000. With a directory called /opt/oracle/SPARE containing all these files, it is very easy to determine the types and numbers of available datafile slices and their RAID characteristics.
At database creation time, make another directory under /opt/oracle to put the used symbolic links in, such as /opt/oracle/PRD1. Creating the database would then go a little like this (assuming that the init.ora has been created and resides in /tmp for now):
For control files, they are ideally located in small pieces of "leftover" disk allocation. These are pieces of disk that represent the 0.1GB of a 9.1GB drive, for example, or remainder disk after rounding the allocations down. It is required that these pieces be on raw disk if OPS is used. Otherwise, it is possible to put the control files in the UNIX filesystem, although it is best to keep all the datafiles in one form or the other (raw or FS).
Backup does not present a great problem in building very large database systems. The reason for this is that you can no longer adequately manage the backup of the database using cpio or tar anyway. For a large database, it is necessary to use a fully featured backup product in order to get software indexed backups, allowing more control over tape identification and faster recovery times. In addition, the bundled archivers do not support features such as parallel tape streaming, which allows much greater backup throughput. This becomes important as the database size becomes very large.
All of the third-party backup products provide the facility to backup raw disk partitions as easily as filesystems. However, you must ensure that the symbolic link directory also gets backed up on a regular basis, because this is the tie between the location of the files that Oracle expects, and the actual location in /dev.
When the procedures laid out above are used, space management does not present any more problems than a filesystem presents. It is very clear how much space is left for allocation to the database from looking in the SPARE directory. The only thing that is not possible with raw devices is the automatic extension of datafiles, using Oracle.
The debate over the use of filesystems versus the use of raw disk has been raging for years. There are good points and bad points on each side, and it is not clear where the balance lies. Despite this, each has several clear advantages that factor into the decision process.
There are two fundamental problems associated with the use of filesystems. First, the filesystem buffer cache causes slow writes. As this is all that occurs on redo logs during normal operation, this is bad. Second, the single-writer lock imposed on a per-file basis can limit write performance to datafiles.
On the plus side, the files in a filesystem are very visible and easy to interpret and administrate. Nevertheless, when you are dealing with very large datafiles, it is undesirable to be moving 4GB or 8GB files around, so the significance of this advantage is debatable.
The other advantage of using a filesystem is application-specific-the advantage of gaining additional benefit from the filesystem buffer cache. In cache theory terms, the filesystem buffer cache can be viewed as the L2 cache, with the Oracle cache being the L1. It can also be viewed from the perspective of double buffering; there is no difference in speed between the two caches, and so there is no benefit for the frequently used blocks and a fairly significant overhead in managing the same blocks in two caches. With 64-bit Oracle, this advantage is difficult to justify any more; all the memory that would have been in the filesystem cache can now be given to Oracle across all three buffer pools.
Raw disk essentially puts Oracle in the driver's seat as far as I/O is concerned. With the operating system responsible only for communication with the hardware, there is no additional CPU overhead incurred from managing filesystem caches. Raw disk is not subject to any kind of operating system locking and can be used in parallel server configurations.
From a logical perspective, there is no reason why a raw disk database system should ever be slower than a file-system-based system. The only condition under which this could occur would be a deficiency in the Oracle caching algorithms. With a file-system-based database, every miss in the Oracle buffer cache results in a read() system call. This bears a finite cost, as does the subsequent searching and cache management in the filesystem buffer cache. If there still is a miss, the physical read must occur anyway, making the true overhead for the miss twice as bad as it would be in a raw disk system (ignoring the constant cost of the disk read).
If the memory used for the buffer cache were used in the Oracle buffer cache, it is likely that there would be a hit in the Oracle cache layer, eliminating the additional system call for the filesystem logical read.
Advanced filesystem implementations provide more hooks to allow Oracle to circumvent the filesystem layer itself and go straight to raw disk. This leaves the filesystem performing only an offline administration function and not participating in the actual operation of the database. This may be the best of both worlds, providing all the advantages of the filesystem (visibility, file extension capability) without the disadvantages.
In Section 2.8, we introduced I/O and RAID levels. In Chapter 2, these levels were mostly hardware concepts focused on the disks and their performance. In addition to hardware RAID, most vendors offer some kind of logical volume manager (LVM).
An LVM is an addition to the kernel that provides RAID functionality from the host. No special hardware is required to do this, only standard disk drives.
There are several reasons why this is a good thing. First, if multiple controllers are used, a stripe can be set up across all the controllers. This maximizes the performance of the controllers to gain best performance. If this approach is taken, however, care must be taken that controller redundancy is preserved, because if one of the many controllers that make up the stripe were to fail, the entire stripe array would be unavailable. In the case of EMC disk arrays, the EMC does not provide a striping function. Therefore, some kind of software stripe is required in order to perform striping on this device. Although it provides the disk redundancy within, controller redundancy is still required.
Second, using a software volume manager allows disk devices to be given more user-friendly names than standard disk slicing offers. This will become apparent in this rapid introduction to the operation of volume managers.
There are several fairly basic concepts used in all software volume managers. The first of these concepts is the volume group (see Figure 7.17). A volume group is the largest grouping used within the volume manager and is composed of several disk objects.![]()
When the disks (or disk slices) have been added to the volume group, this volume group has a capacity equal to the combined capacities of the disk devices within it.
The next stage is to create logical volumes from the volume group. These logical volumes can be called anything at all, therefore allowing meaningful naming conventions to be implemented. From the volume group in Figure 7.17, we could create a single volume four times larger than a physical disk. This would create a very large logical volume, but it would be practically useless because it is likely that the I/O would be concentrated within a small portion of the volume's total size.
In order to create a more useful volume, we would instruct the volume manager to create it (or, more likely, a smaller one) but to use a stripe (or chunk) width somewhat smaller than that of a physical disk. If we use a 128KB stripe width, the volume manager will go off and distribute the volume across the disks 128KB at a time, round-robin fashion.This means that a read of 512KB in the example above would include all drives in the volume group.![]()
In Figure 7.18, six logical volumes have been created from the "bucket" of space in the volume group. All of these volumes are striped across all the disks within the volume group.
At this point, the operation of software RAID is the same as that of hardware RAID but is calculated and initiated from the host. This has two implications on the host processing. First, the administration of the RAID subsystem is now a task that the operating system must be concerned with, increasing the cost of any given I/O operation initiated on those devices. While this is a fairly small increase, it is something that should not be overlooked.
The second implication is, for the most part, valid only when LVM-based mirroring is used. When mirroring from the host, the host must make write requests to two controllers for every write request. This increases the bus utilization on write-intensive applications. In addition, when the system crashes, the host comes up trusting only one side of the mirrored pair. The other side of the mirror is marked STALE and is in need of resilvering. Although this is a process that Oracle is now involved in,16 the traditional approach to it was to perform a full disk-to-disk copy from the active side of the mirror to the stale side of the mirror. While this copy is in progress, all new writes must also be written to both sides, whereas all reads come only from the active side. This can have an enormous impact on the I/O throughput of the system, and an even bigger impact on the available CPU capacity on the system, because this operation is very CPU-intensive.
As UNIX systems became more powerful and complex, it became apparent that there was a need for some way to allow processes to communicate in order to coordinate processing of common workloads. All types of such communication are known as interprocess communication (IPC). The first and most basic form of IPC comes in the form of the pipe.
A pipe is initiated by a process prior to forking a child process. The pipe() system call opens two file descriptors. One of these descriptors is open for reading, and the other is open for writing. In this preliminary state, anything written to the write descriptor will be available for reading from the other file descriptor. This is not very useful until a fork() call is made.
After the fork() call has been made, both processes retain all the open files that the parent had before the call. The parent closes the read file descriptor, and the child closes the write descriptor, and now there is a useful way to communicate between the processes. At this stage the communication is unidirectional only; if two pipe() calls had been made, bidirectional communication could be set up.
The shell forks once in order to continue processing with processes other than itself. This parent shell process then simply blocks (suspends) with a wait() call, waiting for the child processes to terminate execution.
The first child then initiates the pipe() sequence as above, but then takes it one step further. After the fork() of the second process, the shell (it is still just two clones of the shell; no exec() has been called yet) uses the dup() system call to reallocate the file descriptors. For the process destined to become who, the stdout descriptor is closed, and the writeable end of the pipe is associated with the usual stout descriptor using dup(). This means that this process will now send all normal output to the pipe instead of to the screen.
The process destined to become grep does the opposite-it closes the stdin file descriptor and uses dup() to make the read end of the pipe become the standard input for the process. At this stage, the two processes issue the exec() system call to become the requested processes, and the output of who is passed to grep.
This is the mechanism used by the Oracle PIPE driver for client/server communication within a common host.
Shared memory is a simple concept: Create a piece of memory that many processes can map into their address spaces. Changes made by one user are immediately visible to the other processes, without the need for any kind of system call. Owing to the lack of required system calls once the shared memory segment has been created, this is a very fast form of IPC.
The Oracle SGA is created within a shared memory segment for providing IPC among all the processes that are connected to the Oracle database. The system call to create a shared memory segment is shmget(), which creates the segment and returns a key for the segment. Any number of processes can now attach the segment to their address space using the shmat() system call.
A shared memory segment, once created, is an entity in its own right. Even if no processes are attached to it, it still remains in memory. It could be created one day and only initially attached to a year later. The only thing that will clear a shared memory segment, apart from a reboot, is an explicit call to shmctl() with the remove flag set. This can be done from the command line using the ipcrm -m <key> command.
As mentioned previously, it is preferable for the operating system to provide some way of sharing page tables for shared memory regions. Otherwise, a large Oracle SGA can use a very large amount of memory in page table allocations.
Although shared memory provides no formal locking techniques, synchronization of updates can be achieved either by using the atomic operations supported in the hardware or by using semaphores. Semaphores are too slow to be used as latch mechanisms because of the required system calls, and so Oracle uses atomic memory updates with test and set to manage the synchronization of the SGA.
Semaphores are used for coordinating processes. They are kernel-controlled counters that support only two operations: increment and decrement. Rather confusing, for English speakers at least, is the fact that these operations are called V ops and P ops, respectively. They are referred to in this way because they were named by a Dutchman called Dijkstra, who is credited with the invention of the semaphore.
A semaphore can have any positive value, or a value of zero. Negative values are not permitted. The idea is that waiting processes attempt to decrement the semaphore, and processes that have finished running increment the semaphore. If decrementing the semaphore by the requested amount would result in a negative value, the process is blocked in the kernel until the semaphore has been incremented (by other processes) such that the operation would not result in a negative value.
Oracle uses semaphores (in non-OPS configurations) to implement slow waits such as latch sleeps (not latch gets) and enqueues. OPS configurations have to hand this task off to the DLM in order to manage enqueues between instances. In most configurations, Oracle uses semaphores to provide a communication mechanism between Oracle processes. For example, when a commit is issued, the message is sent to the log writer by way of a semaphore on which the log writer sleeps. Whenever required to perform work, the log writer is posted by incrementing the semaphore on which the log writer waits (seen in V$SESSION_WAIT as "rdbms ipc message"). The alternative mechanism for waking up processes to perform work is the post/wait driver. This is typically a UNIX kernel addition that allows more specific communication between the processes, through a direct posting mechanism.
Message queues are essentially multireader, multiwriter pipes that are implemented in the kernel. Processes can place messages on a message queue, and other processes can read these messages off the queue in a first-in, first-out order.
The reading of messages is subject to the messages meeting the criterion specified in the msgrcv() call. This criterion is a message type, specified as an integer. If no message meeting the criterion is found, the receiver will block (suspend) until such a message is sent to the message queue.
Message queues are not a very efficient means of communication, because a write and a subsequent read require two system calls and two copies to and from kernel memory. If the messages are large, this copying becomes prohibitively expensive in processing terms.
Message queues are sometimes useful in implementing instructions to running processes, where the message is small. Oracle does not utilize message queues.
It is worthwhile to become familiar with several of the system calls available on a modern UNIX system, particularly the ones that are frequently used by Oracle. These calls can be viewed on a running process by using the O/S system call trace utility, or on a global basis using system monitoring tools.
The system calls used by a process can be analyzed in real time using a vendor-supplied utility. This utility is frequently called truss on machines based on System V, but other names are used on other platforms. The output looks like this:
This output is from the command "ls smpool.c" on a Sequent platform. The format is common among all utilities of this type: system call, parameters, return code.
At the head of the trace is the exec() call to execute the ls command, followed by an mmap() of /dev/zero to create a zero-filled anonymous 4,096-byte page of memory in the address space. Moving on a few lines, the first open() represents the first stage of loading of a shared library (libseq.so), continuing through to the mprotect(). The brk() calls show the process allocating memory on the heap, probably using one of the function calls from the malloc() family. The actual requested task is carried out using the lstat64() system call, to get information about the file. Interestingly, if no file is passed to ls, it opens the directory "." and uses getdents() to retrieve the directory entries, because no clues are given. Issuing "ls *" performs an lstat64() call for every file, because all the names are supplied implicitly by the shell wildcard expansion.
After the lstat64() returns, the process writes the output to the terminal using the write() system call, and then exits.
These utilities can be invaluable in determining strange behavior in running processes and can identify a runaway process very quickly.
It is not necessary to be familiar with all of the available system calls; that's what the man pages are for. After all, HP-UX has 324 system calls, and Sequent has 202, so the list is fairly large.
Table 7.2 identifies some of the common system calls used by Oracle. Where not specified, the return code is zero for success or -1 for failure.![]()
![]()
![]()
This chapter may have been very tough going. It has been very deep on occasion, and you may have found yourself wondering what this has to do with Oracle. The simple answer is that Oracle relies heavily on nearly all aspects of the UNIX operating system, and understanding how these functions work is an important part of total system comprehension.
When problems occur on the system, such as memory exhaustion, it is sometimes not immediately apparent what is causing the problem. An understanding of how the operating system handles its resources can dramatically speed up determination of the root cause (pun intended).
McKusick, M. K., K. Bostic, and M. Karels. The Design and Implementation of the 4.4BSD Operating System. Reading, MA: Addison-Wesley, 1996.
Goodheart, B., and J. Cox. The Magic Garden Explained: The Internals of UNIX System V Release 4: An Open Systems Design. Upper Saddle River, NJ:
Prentice Hall, 1994.
1Some implementations log these CPU cycles on the process that was executing on the processor before the interrupt was received.
2Unless the interrupt is directly related to the process-for example, a context switch or the return from an outstanding I/O.
4Some modern implementations have migrated much of the u-area to the proc structure in order to increase flexibility.
5The system "load average" is a set of numbers, calculated over 1-minute, 5-minute, and 15-minute intervals, using several types of scheduling statistics such as run queue length, priority waitlist length, and so on. The actual algorithm varies among implementations.
6Most implementations reserve address zero in order to catch program errors. When a program dereferences a null pointer, a segmentation violation error occurs because this page is not part of the process address space.
8With the exception of a hardware register populated by the operating system to reflect the current context/process.
15There now exists a clustered filesystem, based on the journaling filesystem, that works a little like a parallel server. The filesystem can be mounted on more than one node concurrently, with the coherency managed through the system cluster software (lock manager). This filesystem will support the creation and use of OPS database files.
16Recently, Oracle has gotten involved in this process. The logic is simple: If we just had a crash, why don't we assume that both sides of the mirror are out of date and reapply the redo information since the last checkpoint?
![]() Scale Abilities Ltd http://www.scaleabilities.co.uk Voice: +44 1285 644533 info@scaleabilities.co.uk |
|