The most fundamental level of UNIX is the kernel. User programs execute under the protection of the kernel and use its services. The kernel provides a standard interface to the system hardware and provides standard services over and above that to ease the process of developing and executing software on the hardware platform.

I must apologize for Figure 7.1. This kind of representation never helped me when I went through the familiarization process myself. That having been said, it does provide a concise view of the structure of a UNIX system, and hopefully a little verbiage will make it clear.
This diagram is supposed to represent the various layers in a UNIX system. Going from bottom to top, the software engineering effort becomes less complex and less shared. For example, every program in the system uses the virtual memory subsystem, although this is a very complex part of the system. It is a sharable part of the system, but each of its users is protected from the inherent complexity by the layers below it. This is the overriding mission of the kernel: to protect users from each other, and from the complexity of the system.
Starting at the lowest level, we have the platform itself, or the hardware domain. Included in this domain are all the physical attributes that need to be taken care of in order to execute software on the system. Some of these things are partially taken care of by the hardware but still require explicit action from the software on the system to instruct the hardware how it should, for example, ensure that the CPU cache reflects valid data for the next process to use.
Moving up the stack, we get to the core kernel services. This layer ensures that all layers above it are taken care of, in addition to providing standard interfaces to the hardware domain. For example, one of the overriding concepts of the UNIX model is that everything is a file. The core kernel services take care of this interface, providing the upper layers with a way of viewing (nearly) all hardware objects as a linear file.
This layer also provides the
process abstraction. This is essentially how executing programs are handled by the system, providing each process with the illusion that it is operating independently on private hardware. It also takes care of isolation between these processes, ensuring that one process cannot corrupt the execution environment of another in any way.
In addition, the kernel also provides value-added services that can be used by the user programs. These are software modules that are not essential for the operation of the system but provide a more usable interface for the user. A prime example of this is the availability of many filesystems. A filesystem does not contribute directly to the running of the system but is more than just a simple service. It must have access to kernel memory for the sharing of the filesystem buffer cache and must have fast access to the hardware itself.
Everything else on the stack is user code. That is, it is not directly associated with the kernel. The C libraries are reusable software modules that are used to aid the rapid construction of user software. Note that Oracle is associated with the user processes at the very top of the model.
This is probably the highest-level view of what the kernel
does; now it is time to take a look at how this is achieved from an implementation perspective.
The UNIX operating system operates in two distinct modes: kernel and user. The kernel does not do anything mysterious; it is just software like the rest of the system:
unix: ELF 32-bit LSB executable 80386 Version 1
/bin/grep: ELF 32-bit LSB executable 80386 Version 1
vmunix: ELF-64 executable object file - PA-RISC 2.0 (LP64)
iotest: ELF-64 executable object file - PA-RISC 2.0 (LP64)
|
It is executed on the same processors as the user software. However, the kernel has special duties and normally operates with a privileged status from a
hardware perspective, able to access and modify addresses and registers that user mode cannot. This mode of execution is called
kernel mode.
There are two distinct situations in which kernel mode is entered, excluding the boot process:
· A user process executes a system call or raises an exception (explicit kernel processing).
· System housekeeping occurs, such as time-slicing and servicing interrupts (implicit kernel processing).
7.1.3 Explicit Kernel Processing (Process Context Kernel Mode)
Starting with the user process, let's have a look at UNIX from a user perspective in order to see where the kernel fits in.
A user connects to the UNIX system by some mechanism, such as a login process for a shell user or an SQL*Net connection for a database user. Whichever way the user connects, the net result is a
process on the system. This process is the executing context of a user program, and it is supported by the kernel with an operating environment. The environment for a process consists of
· A private memory map, including private stack and program counter
· A share of available processor capacity, up to 100 percent of one CPU if required and available
· The system call interface
The private memory map is a set of memory pages over which the user has exclusive rights. This memory map is actually a virtual memory map, which will be discussed in
Section 7.3.2. For now, it is enough to think of this memory as a contiguous block of memory that is private to the user process. The block of memory is composed as shown in Figure 7.2.
The lower portion of the memory map is accessible to the process, and the upper half of the map is a shared mapping of the kernel memory. All user processes have the kernel memory in their address spaces but cannot access it. In 32-bit UNIX implementations, it is common for the lower 2GB of the 32-bit address range to be used for user memory and the upper 2GB for kernel memory.
The "program memory" in the Figure 7.2 refers to all memory used to execute the user program, excluding the stack. This memory is further divided into as many as five types:
1. Text (the program itself)
2. Data (initialized variables)
3. BSS (uninitialized variables), including heap memory allocated by the
malloc() family
4. Shared libraries (can include all items above for each library)
5. Shared locations (shared memory, memory mapped files)
The program stack is typically started at the top of the user address space, growing downward, while the program memory starts at the base and grows upward.
If this process is executing a program that operates only within that private memory map-that is, never requests more memory (program memory or stack), never causes an exception, and never needs to do any I/O-this process will never explicitly switch into kernel mode.
Considering that any kind of output, including screen output, is I/O, this hypothetical program would not be very useful to anybody. In order to perform I/O, however, the process needs to access the hardware of the system, which it cannot do in user mode. In this case, the process will execute a system call, which invokes a few actions on the system and results in the process running in kernel mode. This mode of execution is considered as
process context kernel mode.
In modern systems, kernel mode is a hardware state invoked by a hardware trap on the processor. This puts the processor into kernel mode and allows the executing process to perform the required privileged operations. In order to continue, the kernel must first save the processing context of the user process-that is, program counter, general-purpose registers, memory management information, and stack pointer. It's worth noting that a switch to kernel mode usually does
not need to perform any processor cache management, because the memory map does not change. Kernel mode and user mode both exist within the same process context, so no cache lines need to be invalidated on a switch into kernel mode. In fact, it is likely that the kernel processing will require some of the user pages, particularly the
u-area (see
Section 7.2.1), and any buffers required to process system calls. Once the transition to kernel mode is complete, the kernel can start to process the system call request.
While a process is in kernel mode, it has access to both kernel memory and user memory. However, no user code is executed while the process is in kernel mode. Instead, the user request is validated and, if permitted, performed on
behalf of the user mode side (user context) of the process. Kernel execution occurs in the same process context as the user program, so all processor cycles consumed during this process are logged against the process under SYS mode (as viewed with
sar -u).
It was mentioned earlier that a process could cause an
exception. An exception is caused by a process attempting to do something that is either impossible or prohibited. Examples of this include segmentation violations (a process attempting to access memory that has not been allocated to it) and divide-by-zero exceptions. In this case, the kernel processes the relevant exception handler for that event, such as initiating a core dump, within the context of the process.
7.1.4 Implicit Kernel Processing (System Context Kernel Mode)
The second way that kernel code executes is when certain events occur in the system. These events basically consist of software and hardware
interrupts and occur for various reasons. By processing these interrupts with
interrupt handlers, the kernel ensures that
· All runnable processes receive a "fair share" of the CPU.
· All hardware interfacing is performed.
When a processor receives interrupts, it stops executing the current process and executes the
interrupt handler for that interrupt. This happens regardless of whether the process was operating in user mode or in kernel mode, although the kernel has the ability to block interrupts selectively.
Interrupts are asynchronous with respect to the currently executing process and are caused by events that are not directly initiated by the process itself. Therefore, the handler must execute outside the context of the current process and neither logs its execution cycles against that process
1 nor accesses its address space.
2 This type of execution is considered as
system context kernel mode.
The kinds of things that generate interrupts are I/O devices returning a completion status (including network traffic) and the hardware clock.
There are situations in the execution of the kernel code in which an interrupt handler cannot be serviced. One prime example of this would be if the kernel were already executing kernel code for a prior interrupt of that type. In this case, the kernel could corrupt its own address space by having incomplete memory updates from the first interrupt when the second is received. This situation is prevented through the use of
Interrupt Priority Levels (IPLs).
Using IPLs, the kernel can instruct the processor to ignore interrupts that it receives that have a lower priority than the specified level. Therefore, the kernel can set the interrupt level at a suitable level prior to processing a critical (i.e., protected) section of code. Any interrupts that have a lower priority than the one set are ignored by the processor, and the critical section can complete safely.
Interrupt levels are hardware-dependent entities and thus vary among processor architectures. Typically, there are several hardware interrupts, and a smaller number of software interrupts that can be programmed. The highest interrupt level is always reserved for machine exceptions: If there is a fatal problem in the hardware or the operating system, all processing must cease immediately to protect against widespread data corruption resulting from unknown states in the system. When a fault of this nature is detected, the highest-level interrupt is generated in order to generate a "panic" of the system.
The next level down is usually reserved for the hardware clock, followed by various peripheral interrupts (disk, network, etc.), and software interrupts have the lowest priority level. Below all of this comes level zero, which is the level at which user mode processing occurs. This allows
anything to interrupt user mode processing.
The hardware clock is a critical component of a UNIX system. It not only determines the rate at which the processor is driven (this is why it is known as
clock speed) but also generates interrupts that the kernel uses to implement
timeslicing.
When the system boots, the kernel programs the hardware clock to interrupt the processor at defined intervals. Each time this interval arrives, a high-priority interrupt is received that is processed by the kernel's
clock interrupt handler. This handler runs in system context, because it has nothing to do with any user process.
The period that the clock interrupts is determined by the value for one
tick, frequently set at
100 Hz, or
10 milliseconds. Every 10 milliseconds, the interrupt handler is fired and must perform several critical functions, such as incrementing the system time, updating the system and user time statistics for running processes, and posting
alarm() signals to processes that have requested them. The clock handler also ensures that the illusion of concurrent execution is maintained on the system by providing the mechanism for implementing
time sharing.
Every process in the system is assigned a
quantum, which is the maximum amount of time the process can execute on the processor, before other processes must be considered for execution. This period varies between platforms and between scheduler classes, but is typically 100 milliseconds, or ten ticks of the hardware clock.
When the quantum is used up, the kernel checks the run queue, and determines whether the process must be suspended from execution and put back onto the run queue. If there are no processes on the run queue of sufficient priority, the process is allowed to execute for another quantum. If the process is taken off the processor, it remains dormant until it is scheduled back onto the processor when it reaches the head of the queue. This period is typically very short and is weighted by the priority of processes on the run queue.
In this way, the kernel can provide the illusion to all the processes in the system that they are concurrently running on the processor. This, combined with the private memory map, provides a private environment for each process. More detail is provided on process scheduling in the next section.
In the preceding section, a level of detail was reached with regard to what a process
is from an execution standpoint. From a high-level perspective, a process is simply the kernel's abstraction of all the attributes of a running program, in order to control its execution.
All user processes in the system have an entry in the
process table, with the following attributes:
· A private, virtual
3 memory map
· A runtime context (program counter, registers, stack pointer, etc.)
7.2.1 The Process "Table"
The process table is the master reference for the process. Although the detail of the process table is not directly useful in implementing large Oracle systems, an understanding of the concepts is useful and helps when developing monitoring hooks into the system.
In physical terms, the process table exists as two structures for each process: a
proc structure and a
u structure (or
u-area). Historically, the
proc structure was a table of fixed size that was set at kernel link time and was not changeable at runtime. This is the reason for the quotation marks around the word "table" in the head-it really isn't a "table" any more.
The
proc structure contains all the information that the kernel could require when the process is not running, and so exists in kernel memory that is always mapped. By contrast, the
u-
area contains information that usually is needed only while the process is actually running; the
u-
area is stored in the private memory map of the process.
4
The reason for using two structures to maintain the process control information is that some of the information does not need to be accessible by the kernel when the process is not running. It does not make sense to tie up global kernel memory for information that is typically not required when the process is not running.
Depending on the specific implementation, the
u-area can contain the following information for each process:
· Process control block (execution context of the process, including general-purpose registers, stack pointer, MMU registers, etc.)
· Per-process kernel stack
· Pointer to corresponding
proc structure
Though the list above covers the major items, many other pieces of information can be stored in the
u-area.
The
proc structure focuses on the more globally accessed attributes of a process:
· Queue list pointers for run queue
· Indirect pointer to
u-area
All of the items listed above, with the exception of the PID, can be written to by other processes in the system or the kernel, and therefore need to be globally accessible. The list pointers will be covered in more detail in
Section 7.2.2.
When the process information is divided among such structures, it can complicate the retrieval of this information. UNIX commands, such as
ps and
fuser, use information from both of these structures and therefore require access to all of this information, even when the process in question is not active and could even be swapped out to disk. When the process is swapped out to disk, this includes the
u-area, and so special access methods need to be available for gaining access to this information from the swap area. This can be achieved using special system calls or through the use of a custom API. The actual method used is implementation-dependent.
The subject of process scheduling is one that can have a direct bearing on the performance of your Oracle database. Although this has been touched on already in this chapter, it merits some more detail.
As previously mentioned, UNIX is a time-sharing system at heart. This means that the kernel will provide the illusion to all the processes on the system that they all have their own CPU to run on. On a large database server with 4,000 processes, it is not practical (or scalable) to provide a processor for each process, and so it is the time sharing that allows a far smaller number of processors to provide this illusion.
The key to providing this illusion is the hardware clock, as previously discussed, and
context switches. Context switching is best covered in
Section 7.3, and so for now it is safe to view context switching from the high level of just switching processes on and off processors. Also, to make this discussion simple, assume that we are referring to a uniprocessor UNIX platform.
Process Scheduling Primer: Uniprocessor Platforms
When a process has used its time quantum (i.e., when the hardware clock interrupt handler has fired) and another process is waiting to execute (the other process is termed
runnable), it is switched off the processor and the new process is switched onto the processor. The switching of processes, and the entire clock interrupt handler, must be very efficient in order to minimize the overhead of processing small time slices. If this operation is a noticeable percentage of the quantum itself, then this is the percentage of system that is "wasted" on the task of time sharing.
This simplistic view of time sharing is useful in gaining an initial understanding of the concepts. However, the actual implementation of time sharing is a good deal more complex, having to deal with such things as different priority processes, balancing throughput against response time, managing CPU cache warmth on multiprocessor machines, and so on. The handling of this complex time-sharing requirement is called
process scheduling.
There are several variables that affect which process will be placed on an available processor:
· Recent CPU usage for process
When a process becomes runnable, it is placed on a run queue-that is, a queue for the processor in order to be run. This is achieved by adding the
proc structure of the process onto a linked list that makes up the run queue.
The architecture of the run queues is very platform-specific, especially when the platform supports complex hardware arrangements such as NUMA. In order to keep this discussion simple, we will refer to the Berkeley software distribution (BSD) run queue architecture.
BSD maintains several run queues, each of which queues processes of a specific priority. When a process has used its quantum on the processor, the scheduler scans the run queues in order, from highest priority to lowest priority. If a process is found on a run queue with a priority greater than or equal to that of the currently executing process, the old process is switched off the processor and the queued process is switched on. If there are no processes on queues with priority greater than or equal to that of the currently executing process, the process is permitted to continue running for up to another quantum before the queues are checked once again.
If a process of higher priority becomes runnable, the current process is preempted off the processor even if it has not used its entire quantum.
The priority of a process is governed by two factors and will be constantly adjusted by the kernel during the lifetime of the process based on the system load average. These factors are the estimated recent CPU usage by this process and the "nice" value of the process.
The nice value of a process is specified by the user at process start-up using the
nice command. This value ranges from -20 to +19, with the default being 0. Superuser privileges are required to decrease the nice value, because this increases the priority of the process above the normal priority for user processes. A user can elect to put a larger nice value on the process in order to run at a lower priority and therefore be "nice" to the other users of the system.
The recent CPU usage of the process is calculated using several algorithms. It is not necessary to go into the specifics of these algorithms, because they are covered comprehensively in various other books (see
Section 7.9 at the end of this chapter). It worthwhile discussing the basics, however.
If a process is using a good deal of CPU, this will be reflected in the recent CPU counter, which in turn is used as a negative weighting factor in the algorithm that determines the priority of the process. Therefore, CPU-intensive processes cannot dominate a system unless there are only lower-priority processes in the run queue or the system is idle. Likewise, if there are two processes trying to use as much CPU as possible, they will end up on the same low-priority run queue and will compete against each other.
The recent CPU counter is incremented for every tick the process executes. This value is then weighted once a second, using the system load average
5 and the process nice value in order to give the counter a decay over time. The load average is used as the amnesia factor in keeping track of the used CPU; if the system is very heavily loaded, the CPU counter will take a long time to forget previous CPU usage, and the priority of the process will be proportionately lower. If the load average is small, the used CPU will be forgotten relatively quickly, and the process will gain a higher priority.
If a process needs to block while it is running, such as waiting for an I/O operation to complete, it is taken off the processor and placed on a different type of queue-one of the
sleep queues. The process stores a
wait channel in its
proc structure and puts itself onto the correct sleep queue for that resource. The wait channel is typically the address of the structure on which the process is waiting, which is hashed to obtain the correct sleep queue. When a resource becomes available, its address is hashed by the kernel to find the correct queue, and all processes waiting on it are woken and put back on a run queue.
If the process was blocked waiting for the completion of a system call, such as a
read(), the process is put on a high-priority run queue until the process completes the switch back into user mode. At this point, the process priority is reduced, typically resulting in the process being switched off the processor in order to service processes of higher priority.
When processes are executing on a processor, they are neither on a run queue nor on a sleep queue. At this point, the process is active and exists on another list. In SVR4, this list is known as
practive and is a linked list of all the processes currently on the processor, linked against the "active processor link" item in each
proc structure.
Advanced Scheduling: Multiprocessor Process Scheduling
When multiple processors are present in a system, the scheduling algorithms become more complex. In the case of NUMA systems, this is an attempt to keep active processes close to their resident memory. In SMP machines, it is implementation-dependent: Some platforms do no extra scheduling work, whereas others do.
With multiple processors (sometimes referred to as
engines by kernel engineers) and a relatively low number of runnable processes at any one time, some major optimizations can be made by maintaining cache warmth in the CPU caches. This is known as
cache affinity.
An algorithm can be implemented, using the number of clock ticks since the process last ran, which determines whether the process is likely to have cache warmth on the processor on which it last ran. Waiting for the correct engine to become available may incur a delay in the execution of the process, and so this algorithm needs to be well tested with real applications.
In testing with and without affinity, OLTP Oracle workloads have been shown to exhibit as much as 15 percent more throughput when affinity is enabled.
In NUMA configurations, it becomes more critical that the scheduler place processes on the correct engine. Although some NUMA configurations support dynamic page relocation between NUMA nodes, this is expensive at best, and the majority of memory for a given process will always reside on one node or another anyway. Therefore, it is a fair assumption that the scheduler should always attempt to schedule processes on engines on the same node as the resident set of the process.
This results in several per-node run queues, in order to ensure the locality bias in selecting processes to run. It's worth noting that the scheduler cannot factor in the location of any mapped shared memory, because this is not part of the process's private memory. In the case of all the processes in the system all running the same binary (such as
oracle), the operating system may elect to store a single shared text segment local to each NUMA node in order to maximize the locality of reference.
Signals are the mechanism used to notify processes of asynchronous events. Even novice users are familiar with the
kill command, which is used to send these signals to processes.
Most users are aware of the signal mechanism only from the user perspective-that of killing a process in order to terminate its execution. In fact, signals are very much more flexible than a simple termination mechanism, and most can be trapped and controlled in a programmable manner by the process.
Signals are passed to the process by the kernel. They can be initiated by other processes in the system or by the target process itself. The kernel passes the signal to the process by setting a bit that corresponds to the signal in the
proc structure for the process, as shown in Figure 7.3.
This bitmask is called
p_sig in SVR4 and
p_siglist in BSD-based implementations. The bitfield is used to store
pending signals for the process.
Before the pending signal bitmask is updated, however, the kernel checks that the signal has not been explicitly ignored by the process. If the signal has been ignored, the kernel will not even set the bit in the bitmask. Once the bit is set in the
proc structure, the kernel will attempt to wake up the process in order to receive the signal (if it is not already running). After this, the processing of the signal is entirely the responsibility of the process itself.
The signal bitmask is checked
· Every time the process returns to user mode from kernel mode
· Before going to sleep in kernel mode
· On waking from an interruptible sleep
The trend here is for the bitmask to be checked prior to returning to user mode from kernel mode. The reason for this is that some signals (namely
SIGKILL and
SIGSTOP) cannot be processed in user mode, because their handlers are not programmable. Once the bitmask is checked, the signals that have been received are processed by the process
in kernel mode. The first action the kernel takes is to check if the signal is
SIGKILL or
SIGSTOP. If so, the actions are taken in kernel mode without returning to user mode. These actions terminate the process, or suspend the process and put it to sleep, respectively.
If the signal is not one of these two, the kernel checks the
u-area for the process to find out if it has a handler declared for this signal. If there is no handler, the default action is taken, as shown in Table 7.1.
Any signal without a default action of "ignore" will not switch back into user mode. Instead, the process will exit, core dump, or be suspended, all without leaving kernel mode. If there is a handler defined for the signal, the kernel will switch into user mode and execute the signal handler.
It's worth noting that a process may not act on a signal for a comparatively long time. If the process is blocking on a noninterruptible kernel resource or is swapped out to disk, the process may remain unchanged in the system for a while before the signal is acted on.
To finish up with processes, several the concepts in this section are summarized in Figure 7.4, which shows an example mini-application in which a network listener process listens for connections and creates a slave process in order to process requests as they come in.
7.3 Memory Management: The Virtual Memory System
In simplistic, single-tasking computer systems, programs can be compiled to locate themselves, and run, at specific memory addresses within the available physical memory. The flow of control is passed from task to task on completion, with potentially only one program in memory at any one time.
A very simple operating system, for example, has an operating system "kernel" compiled to run at a specific address and all user programs compiled to run at another specific address. The control of the system starts with the operating system, from which other programs can be executed. The other programs all locate themselves at addresses separate from that of the O/S; the O/S can remain in memory while this happens. Once the program is complete, the operating system gains control of the system once more.
There are several problems with this arrangement:
· Little or no memory protection between programs: All programs can read or write all memory locations.
· Programs are limited to an unknown, finite amount of physical memory on the system.
· Programs must execute serially and, unless compiled for physically separate memory ranges, must overlay the previously executing program.
The last problem has two serious implications:
1. The system cannot support multitasking.
2. Software development is complex and machine-dependent.
These implications go directly against two of the design goals of UNIX, and so this arrangement is a nonstarter. Another scheme must therefore be adopted-
virtual memory and its associated
memory management.
7.3.2 Virtual Memory Introduction
UNIX systems implement virtual memory. Virtual memory separates the address space seen by a process from real physical memory addresses. This is achieved using memory address translation as an operating system function.
Address translation allows all processes in the system to address the same locations as if they were private, with the system ensuring that the physical address used is distinct from other processes.
In Figure 7.5, there are two processes, each using half of the physical memory in the machine (for simplicity of example). Each of the two processes has an identical address space, or memory map, that is the same size as the physical address space. Although these processes are using physically different memory locations, both are under the illusion that they start at address zero
6 and have contiguous memory allocated beyond that (plus stack).
Each of the cells in Figure 7.5 represents the smallest unit of granularity that the UNIX system considers from a memory management perspective-a
page. The concept of a page is used in order to reduce the space and resource overhead of managing a large amount of memory on the system. It is common in modern systems for the page size to be set at 4KB,
7 although some systems now support variable page sizes in order to increase the efficiency of very large memory (VLM) applications.
The glue that holds this process together is the
address translation tables. These tables hold the mapping of virtual addresses (those that the process uses) to physical addresses (actual memory locations). In UNIX systems, these tables normally come in the form of
page tables and, optionally,
translation lookaside buffers (TLBs) in the processor
memory management unit (
MMU).
The page table consists of several
page table entries (PTEs) for each process, arranged as an array, as shown in Figure 7.6.
For a given
virtual page frame (00 to 07 in
Figure 7.6) of a process, there is a corresponding PTE, located at the same offset in the array that makes up the page table. So, the first page that makes up the address space of the process in this example is mapped to the page starting at
physical address
0x000E1000. The actual offset within the page remains the same as the offset within the virtual page.
It was mentioned earlier that some systems now support variable-size pages. This is a result of the overhead now imposed in managing the huge quantities of memory found in very large systems. For example, if an Oracle database server has a 4GB SGA, this typically means that each connection to the database needs to have a page table large enough to cater for every page in the SGA in addition to the process memory itself. Each PTE is typically 32 bits, and so a 4KB page size would yield a 4MB (1 million times 4 bytes) page table for each process. Not only does this size of page table mean the kernel is spending a good deal of time managing page tables, but this memory is also located in kernel memory, not user memory. If 5,000 users are due to connect to this system, this means the kernel memory needs to be greater than 20GB. When variable page sizes are used, the SGA can be assigned, say, 4MB pages, thus reducing the size of the page table for the mapping.
The actual structure of each entry in the page table is defined by the hardware architecture-specifically the MMU of the processor. Each processor family has its own structure for defining virtual to physical mappings in the MMU, and some support more functionality in the hardware than others, specifically in the area of
memory protection and the presence of a
referenced bit.
Although the hardware dictates the structure of the page table entries, it is the UNIX kernel that is responsible for all the manipulation of the entries in the table and for ensuring that the MMU is using the correct PTEs for the running process. At this stage, it is worth mentioning the effects of the two major variants of CPU cache architecture:
physically mapped and
virtually mapped.
A physically mapped architecture is laid out as shown in Figure 7.7.

This is the traditional approach to caching, because the operating system does not need to be aware of the operation of the cache. Whenever a request for data is made by the CPU, the address (which remains as a virtual address) is passed through to the MMU, which does a conversion of the virtual address to a physical address. It does this using the TLB, which is a
fully associative cache. This means that it has the ability to search all the lines (fully associative access) in the TLB concurrently, to determine the physical address to search for in the cache.
The TLB contains PTEs specified by the kernel, and this is the reason why the structure of the PTE is dictated by the CPU architecture. The kernel is responsible for loading the registers of the TLB with the correct PTE information for the active process. For any given process, the kernel mappings will not change, because all processes have the kernel mapped in their address space. On the right of the MMU in Figure 7.7, all addresses are physical.
It is clear that having to precede even the cache access with a lookup on the TLB can impose a significant overhead. In the case that the TLB does not contain the PTE required for the operation, a reference to main memory needs to be made in order to prime the TLB. Luckily, this normally means that the page has not been accessed for a comparatively long time, and so the impact is not felt very frequently.
In some modern systems, a different approach has been taken with the cache. Instead of using the physical address to determine the correct line and tag within the cache, the virtual address is used (see Figure

7.8).
The effect of this is that the CPU can request data from the cache directly by virtual address. The MMU needs to be used only when a cache miss occurs and a physical memory access is required. The requirement for a TLB is less in this arrangement, because the MMU is now positioned in a slower part of the system than it is in a physically mapped architecture. This having been said, many architectures implement virtual caches with TLBs for enhanced performance.
The downside of a virtual cache is that the kernel is now required to manage much of the cache coherency, because the hardware is unaware of the differences among the many identical virtual addresses that refer to different physical memory locations. Address 0x1000 for process A is
not the same memory location as address 0x1000 for process B.
To alleviate the ambiguity in the cache, the system typically uses the process ID of the relevant process as part of the tagging infrastructure in the cache. The hardware has no concept of process IDs,
8 and so the coherency of the cache must involve the operating system: At a minimum, the kernel must inform the processor of the identity of the current process in order for the correct tags to be checked in the cache.
Cache coherency across DMA operations is also more complex, because the majority of I/O devices have no concept of virtual addresses; they transfer to and from physical memory addresses.
When the operating system switches one process off the processor and puts a different one on, several actions must be taken to
· Preserve the context of the prior process.
· Restore the context of the activated process.
These actions constitute the formal definition of a
context switch. The context of a process typically includes the following:
· Other general-use registers in CPU
· Virtual address mapping
The first three attributes are referred to as the
process control block, or PCB, and relate directly to saving the execution context of the process. The last attribute is that of the PTEs, and their presence in the TLB.
When a new process is switched onto a processor, the MMU must be informed of the new address mappings for that process. This applies whether the cache is virtual or physical. If a TLB is present, the kernel must also ensure that any mappings that are not relevant for the new process are invalidated, making the MMU reload the correct mapping from memory on access. This is necessary because the MMU does not have any concept of the process executing on the CPU, and merely changing address mappings does not mean that the process has just changed. The classic example of this would be a process issuing an
exec() call, which results in all prior mappings changing to support the new program.
The kernel will load up the MMU's registers with the location of the new page table for the process, followed by a flush of the irrelevant TLB entries for the new process. The new process will keep all of the kernel TLB entries, because these addresses will continue to have the same mapping for the new process.
Once this process has been completed, the new process is allowed to run.
Further MMU Considerations
The operation of the MMU and associated TLB are somewhat complicated by the following aspects of modern UNIX systems:
· Multiprocessor (MP) architectures
The MMU determines whether or not a page can be read from or written to. This includes mapped pages with memory protection bits set in the PTE and unmapped pages that the process cannot use. In either case, the MMU raises a trap for the kernel to deal with.
The presence of the swap area is discussed in
Section 7.4, where we complicate the virtual memory system further by using more memory than we physically have.
In multiprocessor systems, the presence of the TLB introduces further cache coherency considerations that are unrelated to the CPU cache coherency. For example, if the virtual-to-physical mapping or memory protection for a
kernel page changes, TLBs in
all MMUs
must be updated to reflect this change. Another example would be shared memory accessed by several processes. This includes explicit shared memory segments and also
copy-on-write data segments for a program that has executed
fork() with no
exec(). Copy on write will be discussed in
Section 7.4.3.
In this case, the kernel must initiate a "TLB shootdown" in order to ensure that all other TLBs are current with the new information. This is an explicit operation, for which the kernel typically maintains a set of data structures that map which PTEs are located in the various TLBs on the MP system. Using this map, the kernel can explicitly invalidate the changed PTEs in all the TLBs.
Any changes in the PTEs are typically
not loaded into the various TLBs, because doing so would be a very expensive default operation. Instead, the entries are simply invalidated, forcing the MMU to reload the entry on the next reference. In many cases, the reload will not be required before the process is switched off the CPU, and so the work is prevented altogether.
7.4 Virtual Memory Hierarchy
The preceding section concentrated on the generic attributes of virtual memory (VM), where a finite amount of memory is available on the system and is the maximum amount of memory that can be used among all the processes.
In reality, many of the pages of a given process do not need to be resident in memory all the time, and making them so would be wasteful of the valuable memory resource. A good example of this is Oracle Forms, where the size of the physical process may be 16MB or more. In a running system, experience has shown that only around 8MB of these pages need to be in memory at any one time in order to execute the Forms application with performance quite comparable to that of a fully resident image.
In order to support this optimization of real memory, the memory hierarchy needs to be extended to physical disk, the next step down in the memory hierarchy (see Figure 7.9).
7.4.1 The Memory/Disk Hierarchy
The use of physical disk spindles is a natural extension of the familiar memory hierarchy between CPU cache memory and main system memory. The next step after main system memory is physical disk, which is much cheaper and slower than system memory.
Using disk to extend memory capacity was once a larger issue than it is today. In the past, memory chips were scarce and expensive, and the system architectures were not able to address large amounts of real memory. Therefore, disk-based memory hierarchies were essential to providing the capability for many concurrent users on a UNIX system.
In modern systems, memory is a far less important issue than it used to be, in terms of both cost and addressability. In fact, in building a high-performance Oracle database server, it is preferable always to work within the confines of the physical memory on the system, and not to rely on physical disk to provide additional memory capacity.
Unlike the case of Oracle Forms cited above, Oracle Server processes share critical resources between them, meaning that one process can directly slow down all other processes in the system by holding one of the resources for an extended period. If the process needs memory that has been paged out to disk in order to complete the processing under a latch, all other processes will wait for this page-in operation before the latch is released.
The use of disk in the memory hierarch