TOC PREV NEXT INDEX

Scale Abilities Ltd Logo


Chapter 8
Oracle User's
Guide to UNIX
8.1 Coengineering the Oracle Kernel
The Oracle RDBMS is engineered as a platform-independent product. This means that there are many rules and procedures involved in the development cycle at Oracle, in order to ensure that the source can be built on a huge array of totally different platforms. The array of platforms that Oracle runs on includes IBM VM mainframes, Microsoft NT, VMS, and Netware, not to mention the huge array of UNIX variants. The differences among these platforms are more complex than those among different UNIX platforms; the fundamental operation of the system is totally different, including data representation formats, scheduling algorithms, interprocess communication, and "file" architectures. 
In order to allow the Oracle product to run on such an array of different architectures, several steps are taken, including:
· Strict coding and naming standards in the base code
· Complete separation between host "natives" and base product
The first step involves the adherence to rules that allow the source to be built on different platforms. Examples of this would be naming conventions to cater to the lowest common denominator in linker capability, and restrictions in the use of globally scoped variables.
The second step is where Oracle adopts the virtual operating system (VOS) in order to provide separation of function.
8.1.1 VOS Responsibilities
In order to implement separation between the base code and the operating system specifics, Oracle divides the product source code into two distinct regions: generic (base) and operating system dependent (OSD). The base code portion contains the common base code for the product, whereas the OSD portion provides the OS-specific code.
This segregation has been a part of the Oracle product from the very beginning and is how Oracle has been able to provide the product on so many platforms.
However, until release 8.0, the interface was not 100 percent pure. This meant that OSD changes often required engineering changes further up in the base code (see Figure 8.1).
For Oracle8, an effort was undertaken to clean this up, and the VOS layer was the result (see Figure 8.2).
The VOS layer is a formalized interface to the OSD code. It allows complete separation of the generic code from the OSD code, thus making the platform-specific engineering effort and the base development more defined.
The base/generic code itself is divided further into named layers, but all of these layers reside within core development from an organizational standpoint and interface with the hardware through the generic VOS interface.
The platform-specific developers build VOS-compliant OSD code for their platforms. This is sometimes a group within Oracle, a group within the platform vendor, or a combination of both. Regardless of where the work is done, it is not done by the same people who do the base coding of the Oracle kernel.
When an OSD developer feels that some change in the generic code would benefit the platform, this no longer results in OSD implants in the base code. Instead, the change (assuming it is a good one) is adopted in the base kernel, with a VOS interface to the OSD code. In this way, the base code stays clean, and other platforms can benefit from the change by writing their own OSD code. An example of this is covered in Section 8.1.2.
The VOS/OSD software is responsible for a variety of interfaces to the operating system, including
· Datatype specification and manipulation
· I/O interfaces (including asynchronous I/O where applicable)
· Process management and threads
· IPC
· Memory management
· Stack management facilities/other utilities
Datatype Specification
With each operating system and architecture, there may be a different specification for, say, a long. On one platform it may be 32 bits wide, whereas on another it may be 64 bits wide. This kind of ambiguity of data specification is avoided by Oracle by leaving it to the VOS to define the datatypes for the base code. The base code avoids the use of int, long, double, and so on, adopting instead an internal naming convention for data typing. Anybody who has used OCI will be familiar with the types ub4, sb2, text *, and so on. These names explicitly define the width and usage of the datatype; in this example, the types would be Unsigned-Byte-4, Signed-Byte-2, and a pointer to text.
In addition to the typing of data, the VOS is responsible for the handling of these datatypes in operations that are sensitive to byte ordering. The base code is not concerned with the endian (byte-ordering) orientation of the system.
I/O Interfaces
As the I/O interface can also vary a great deal among platforms, it is ultimately the responsibility of the VOS/OSD to perform the actual manipulation of files on the system. The base code calls VOS functions with common parameters in order to perform the I/O on the system. This includes the opening, closing, seeking, reading, writing, and creation of files. In addition, the VOS code is responsible for the implementation of asynchronous I/O on the platform, with the system calls and semantics of using the async I/O varying from platform to platform.
Process Management and Threads
On a UNIX system, new processes are created with the fork() system call. In other operating systems, this is almost certainly not the method for creating processes, if processes are indeed the method used in order to have multiple execution streams on that platform. Therefore, the VOS is responsible for all process creation and management from an Oracle perspective.
On the WindowsNT platform, the database architecture is implemented using threads instead of processes: All background and foreground sessions exist as threads within a single process on the system. The base code is simply calling the same VOS functions that create multiple processes on the UNIX platform, but the NT OSD implementation manipulates threads instead of processes.
IPC
Different platforms have different methods for IPC. In fact, it may be that the operating system in question does not support a process model but rather a threaded model. Once again, the base code is not concerned with this-it is only aware of requesting common services from the VOS.
Memory Management
Memory management in this case includes two unrelated items: the SGA and the process space. Both of these items are platform-specific. On UNIX platforms, the semantics are at least the same-shared memory and some kind of malloc() library. In other operating systems, the method for getting the required result may be very different. Likewise, this area is of increasing importance on NUMA-based UNIX servers, where the allocation memory needs to be sensitive to location.
Stack Management Facilities/Other Utilities
Oracle relies on a variety of other miscellaneous services from the VOS, including the provision of a full feature stack unwind library for the creation of trace files when failures occur. Without this facility, Oracle would be unable to support the product and therefore unable to ship it.
In addition to the stack tools, the VOS is called on for other OS interfacing, such as extraction of operating statistics, interfacing with the process scheduler, and other important aspects of operation on the UNIX platform.
8.1.2 Vendor Coengineering
Although the VOS provides the Oracle base developer with the required separation from the OS, there are also situations in which the product requires changes in both the OSD and the base code to support enhanced operation on certain platforms. This is especially true when the platform has physical attributes that directly impact the operation of Oracle, such as a NUMA system.
Case Study: Sequent NUMA-Q 2000
A good case study of Oracle coengineering is the work done by Sequent and Oracle to enhance Oracle operation on the NUMA-Q platform.1 Although the NUMA-Q platform will run a standard Oracle port, the nature of the NUMA architecture allows greater scalability if certain changes are made in the way Oracle uses the platform. Several of these changes have been incorporated by Sequent and Oracle, with each release adding more sophistication. In Oracle8i, the NUMA-Q port of Oracle adds the following features:
· Extended buffer cache
· SGA memory placement
· Rapid SGA creation
· Quad-Local Buffer Preference (QLBP)
Extended Buffer Cache
The provision of an extended buffer cache capability is not a NUMA optimization but rather a 32-bit operating system optimization. Normally, a 32-bit OS can address a maximum of only 4GB per process, including the kernel memory. In order to provide very large memory (VLM) support for the Oracle buffer cache, an extended cache facility has been engineered.
The extended cache code builds logic into the Oracle processes that allows the processes selectively to map and unmap pieces of the SGA into their 32-bit address space. When the process requires a portion of the buffer cache that is not in its address space, it can make calls to the operating system to map the required buffers to a region within the 32-bit address range. Oracle can then access the buffer as usual.
SGA Memory Placement
Owing to the distributed nature of the physical memory in a NUMA system, the location of the SGA can make a big difference in the performance of the system. By default, the operating system distributes the shared memory segment across all the quads in equal portions. Unfortunately, this typically leaves a large proportion of the fixed and variable portions of the SGA on the first quad in the system. Therefore, the first quad is the only one that gets local memory access to this portion of the SGA.
In the first of three optimizations, the portions of the SGA are treated differently when creating the shared memory segment. The fixed and variable portions of the SGA are striped across all the quads in the system, at a 4KB page granularity. This yields a probability of 1 divided by the number of quads that the reference will be local, with the remainder of the access distributed evenly across all of the remote cache components. The log buffer is also treated this way by default. The buffer cache is divided equally across the quads and distributed in contiguous pieces. It is not striped, because the block size could be greater than the 4KB stripe size, which would result in two or more memory references being required to retrieve a single block.
The memory striping is a good example of the coengineering between Oracle and, in this case, Sequent. Sequent provides the support in the OS for explicit placement of shared memory on the physical quads. Oracle then uses this facility when creating the shared memory in order to place the memory segments on the correct quads.
The second optimization is a restriction of the number of quads across which the SGA is spread. This allows quads that are not involved in Oracle work to be kept out of the picture, or allows separate databases to be created within their own set of quads. If a quad is not hosting active Oracle processes for a database instance, it does not make sense for any of the SGA to be on that quad-this only increases the number of remote memory references.
In the final optimization of SGA placement, the log buffer is allowed to be created on one specific quad. The idea here is that the most critical latency in the system is that of the writing of the log buffer to disk. All commits wait for this event to complete, thus ultimately gating the throughput of the system. By forcing the entire log buffer to be located on the same quad that hosts the log writer, all memory references by the log writer are kept local. When this is done, it is important that Oracle be configured so that it always copies into the redo buffer using redo copy latches, not the redo allocation latch. This is important because there is a high likelihood that any given process will have to perform the entire copy operation into remote memory; using the allocation latch will mean that the redo allocation is out of action for a longer period while the copy occurs.
Rapid SGA Creation
With all this talk of extended cache sizes and spreading the SGA across multiple quads, there comes a side effect. The initial behavior of the shmgetq()2 system call was simply to allocate the memory on the correct quads as a NUMA-aware version of the standard shmget() call. The processor that executes the system call is then responsible for the creation of the page tables and validation of the pages for the entire segment across all the quads. This is not a big problem for small SGAs but rapidly becomes unscalable as the SGA size increases, because most of the memory references required to create the segment are remote.
In order to improve this situation, Sequent created a "kernel daemon" within the operating system. One of these daemons is started by the OS on every quad, and the shmgetq() call just makes calls to all the daemons to do the work locally. This means that the work is all performed locally, resulting in a speedup of approximately nine times compared with remote allocation.
This optimization has not required any changes in the Oracle code, including the VOS portion. However, it does demonstrate why it is very important for the operating system engineers to be clued in on the operation of Oracle.
In Oracle8i, the VOS is changed so that it can use a new shmgetv() system call, which also allows all the allocations to be performed in parallel, resulting in linear scalability of SGA creation time as quads are added.
Quad-Local Buffer Preference (QLBP)
This enhancement is another coengineering product. It allows the assignment of LRUs across the quads, with each LRU list comprising buffers that are local to that quad. When a buffer is required by Oracle, the requesting process determines the quad that it runs on and tries to find a free buffer on one of the local LRUs for that quad. Only if these LRUs contain no free buffers will the process try a remote LRU.
This optimization affects the following operations:
· Reading of new blocks into the buffer cache
· Creation of CR blocks
· Disk sorts
· Multiple database writers
In the case of reading in a new block, the LRU read is local, the target buffer is local, and the disk read is local. This does not provide a huge performance benefit in itself, because the latency of this operation is dominated by the disk access. However, it does reduce the overall work required to load the block into the buffer and reduces the load on the remote cache component.
In a heavy OLTP system, the creation of CR blocks in the buffer cache can represent a reasonable proportion of the system workload because of many queries and updates occurring on the same data. One of the overriding properties of a CR buffer is that it is rarely useful for any session other than the session that creates it. Therefore, when a CR block is created, it will be the local LRU that is scanned and a local buffer that it is created in, and it will probably be used only for the query executing on that quad. This provides a tremendous performance boost in CR block creation.
Large sorts-those that end up using disk storage as a secondary storage during the sort-require buffers from the buffer cache. All of these buffers would be allocated from the local LRU also.
One of the big benefits of QLBP is in its interaction with the true multiple database writers found in Oracle8. Rather than the slave writer processes of Oracle7, which performed only synchronous I/O, the database writers of Oracle8 can all perform the work associated with the old master database writer. This means they scan the LRU chains, build write batches, and issue I/O requests to write out the dirty buffers. Typically, the overhead of finding and building the write batches accounts for up to 50 percent of the database writer CPU time.
By allowing the multiple writers to be aware of QLBP, one or more database writer processes can be made resident on each of the quads that participate in the operation of the database. Each of these processes then has a preferred set of LRUs on which it will operate-the set of local LRU lists for that quad. This allows all the I/O and buffer scanning to be performed on quad-local memory and fully in parallel with the other database writers that operate on different quads and different LRUs. This enhancement allows significantly more throughput to be gained from the database writer component, and has demonstrated up to six times more throughput than the maximum amount achieved using the old writer slaves.
Further benefits are gained from QLBP and Oracle8i, because the heating of buffers in the LRU chain no longer occurs on each access and so the number of remote references is significantly lower than in prior Oracle releases.
8.2 UNIX Tools
In addition to the services provided by the kernel, several of the UNIX tools are invaluable in configuring and administering the database. A selection of these tools is presented here, with some pointers on their use in a large configuration. Some of these tools require root privileges to execute them, and this is sometimes a sticking point with the system administrator. The truth of the matter is that Oracle is the only thing that executes on the database server, and the database engineer is therefore effectively administering the system. It's important to assert some kind of easy access to root, even if it is done by means of a logged mechanism such as the sudo facility.
8.2.1 top
The top command is a shareware utility, sometimes shipped with the actual operating system as "contributed software." Written by William LeFebvre at Northwestern University, top shows real-time snapshots of the "top" processes in the system. The following screen shows the output of top on an HP platform.
System: vulcan Fri Apr 9 18:54:21 1999
Load averages: 2.81, 2.43, 2.10
3011 processes: 2987 sleeping, 24 running
Cpu states: (avg)
LOAD USER NICE SYS IDLE BLOCK SWAIT INTR SSYS
2.81 67.5% 0.0% 11.8% 20.7% 0.0% 0.0% 0.0% 0.0%
Memory: 4477088K (1510640K) real, 4712176K (1584728K) virtual, 9543984K free Page# 1/216
CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND
3 rroot 5283 oracle 154 20 19036K 1036K sleep 31:59 29.17 29.12 oraclePRD
10 rroot 4905 oracle 241 20 19276K 1284K run 221:40 23.31 23.27 oraclePRD
4 rroot 5763 oracle 154 20 19276K 1284K sleep 234:02 23.10 23.06 oraclePRD
8 rroot 5128 oracle 241 20 19644K 1644K run 228:06 22.74 22.71 oraclePRD
12 rroot 22202 oracle 154 20 19260K 1268K sleep 187:33 22.10 22.06 oraclePRD
2 rroot 5282 oracle 154 20 19276K 1284K sleep 225:26 21.70 21.66 oraclePRD
3 rroot 4523 oracle 241 20 19276K 1284K run 224:57 21.41 21.37 oraclePRD
10 rroot 22048 oracle 154 20 19276K 1284K sleep 188:32 21.07 21.03 oraclePRD
9 rroot 5965 root 50 20 9968K 9564K sleep 575:27 20.81 20.77 midaemon
0 rroot 5523 oracle 241 20 19276K 1284K run 224:54 20.55 20.52 oraclePRD
3 rroot 5382 oracle 241 20 19388K 1400K run 226:29 20.19 20.16 oraclePRD
11 rroot 4676 oracle 241 20 19276K 1284K run 227:39 19.95 19.91 oraclePRD
11 rroot 22153 oracle 154 20 19372K 1384K sleep 183:21 19.95 19.91 oraclePRD

The main part of the screen is an ordered list of active processes on the system, including the percentage of a CPU that the process is currently using. It can also be seen that all the processes are Oracle foreground processes (oraclePRD)-this part of the display is not very useful on Oracle systems. In order to use top effectively, the PID should be used as the basis for an Oracle query to determine what that session is actually doing to be the most active in the system. This ordered list is the real value of top, in that it is immediately apparent where the CPU is being used most on the system.
This HP-enhanced version includes better support for multiple processors, including per-processor loading statistics, but most uses of top do not call on this kind of granularity. The rest of the display is virtually identical to the standard public domain version, showing load average and memory statistics for the system.
The latest version of top can be found at
ftp://ftp.groupsys.com/pub/top

8.2.2 Tcl and expect
Tcl has already been extensively introduced in this book. Its power as a general scripting language makes it an essential tool in building a large system. In addition to custom extensions such as dbaman, the popular expect extensions by Don Libes are a very powerful way to reduce the amount of time spent performing laborious tasks. Using expect, the Tcl interpreter can be used to automate previously manual tasks that require human input with a keyboard.
For example, the database start-up process often gets a good deal more complex than simply starting the database with Server Manager. It might include subsequent procedures to be run after the database start-up, such as "pinning" packages and sequences, starting daemons that connect to the database, restarting system monitoring software, and so on. Using expect, all of these things can be performed in a single operation, with exception handling all built in.
expect is an extension of Tcl that allows the script to spawn processes that have their input and output connected to the Tcl interpreter. This allows the output to be captured by the interpreter, processed, and acted on by sending the relevant action back out to the process. This sounds very complex until you consider using telnet as the process that is spawned:
spawn telnet prodbox
expect ogin:
send "tigger\n"
expect word:
send "pooh\n"
expect "\\$"
send "export ORACLE_SID=PRD\n"
expect "\\$"
send "svrmgrl\n"
expect "SVRMGR>"
send "connect internal\n"
expect "SVRMGR>"
send "startup\n"
expect "Database opened."
...

In this example script, a telnet session is connected to the production machine, and the database is started. The extensions provide a good deal of control over the response handling, beyond what is shown in this small example. In fact, there is an entire book about the expect extensions, called Exploring Expect by Don Libes, published by O'Reilly & Associates, Inc.
To download the extension, go to Don's home page at
http://expect.nist.gov

For all other available Tcl extensions, visit the Tcl consortium at
http://www.Tclconsortium.com

8.2.3 Korn Shell
One of the most important tools to be familiar with is the Korn shell (/bin/ksh). The Korn shell is available on all modern UNIX variants and so provides a powerful common factor among all the operating systems. Wherever it is easily possible, scripts should be written using the Korn shell, to increase portability.
The Korn shell is based on the Bourne shell but is significantly enhanced. It is beyond the scope of this book to cover all the points of the Bourne and Korn shells, so some "select highlights" will be covered.
When Oracle is used on a UNIX system, it rapidly becomes important to be able to perform tasks that cross the boundaries of both UNIX and Oracle-for example, extracting lists of used datafiles from the database to pass to a backup utility. This kind of activity is best performed using a "here-document" in the shell:
#!/bin/ksh -
# Get list of file from database and put into specified file ($1)
outfile=$1
sqlplus -s / <<-ENDSQL >${outfile}
set echo off head off feed off pages 0
select name from v\$datafile;
ENDSQL

In this script, sqlplus is invoked in silent mode (-s), which prevents it from displaying the usual banners. The <<-ENDSQL construct is what makes up the here-document. This construct tells the shell to take all the input for this command (sqlplus) from the following lines in the script, instead of from standard input. This will continue until the word "ENDSQL" is encountered, at which point the command is terminated and inputs return to normal. The dash (-) in front of ENDSQL specifies that the word could be anywhere on the line, not just at column 0. The word ENDSQL is for demonstration purposes only; any word can be used here. The final oddity in this example is the use of a backslash in front of the $ sign in v$datafile, which is necessary to prevent the shell from expanding $datafile as a shell variable.
Taking this concept further, the shell can be used to drive command-line utilities programmatically, by remote control. This provides a primitive version of the expect functionality shown earlier:
#!/bin/ksh -
sqlplus -s / |&
print -p "set pages 0 head off feed off"
print -p "select VALUE
from v\$nls_parameters
where PARAMETER='NLS_CHARACTERSET';"
read -p charset
if [ "$charset" = "US7ASCII" ]
then
print -p "select name from v\$database;"
read -p dbname
echo "Database $dbname is US7ASCII."
fi

In this example, the script will return the string "Database <name> is US7ASCII." if this is the case, or nothing if this is not the case. The second query is executed only if the database is found to be US7ASCII. The start of this process is where the sqlplus command is appended with "|&", instructing the shell to create a coprocess for this command. A coprocess is a process that is started in the background, with a two-way pipe between it and the shell. This process can then be read from and written to using the -p argument to the read and print commands, respectively.
This type of process provides a very powerful way to interface with the Oracle tools from the shell level. The output from the SQL sessions can then be processed very effectively with the UNIX text processing tools, the most powerful of which is awk.
8.2.4 awk
The awk command is actually another language. It comes in several flavors, from old awk, to nawk (new awk), to the Free Software Foundation's gawk. Of all of these, I would have to recommend gawk as the most powerful, because it removes some of the restrictions imposed by the standard awk facilities, such as the 3,000-bytes-per-line maximum. From this point on, the term "awk" will refer to any of the variants and will be discussed generically.
The awk language is designed to process flat text files based on patterns within the file. It has all the constructs normally associated with a procedural language, such as for loops, while loops, if tests, functions, and array variables. In fact, the syntax is very similar to the C language, and so C programmers find they can pick it up very quickly.
The format of an awk program is a series of "test { action }" pairs, where the action is performed only if the defined test returns a positive result. The most common form of test performed is a pattern match against the current line of the input file. If the pattern matches the current line, the action is performed.
Each of the tests is performed once for every line in the file, with the exception of two special "tests" called BEGIN and END. The actions associated with these tags are executed only at the beginning and the end of the file, respectively.
The awk language is a very powerful way to process text files, whether large or small. With sufficient logic in the awk script, it is possible to write parsers with awk, which is how awk was used to create the dbaman scripts from Oracle tracefiles in Chapter 2. That script is a good example of many of awk's features.
In addition to processing files, awk can be run with only a BEGIN tag specified. In this case, no input file needs to be given-the action for the BEGIN tag is just executed and then awk exits. This often provides a faster way to generate data than, say, using the shell:
awk 'BEGIN { for(i=0;i<100;i++) print i}'

This example prints a stream of numbers from 0 to 99 without requiring an input file to drive from. More complex uses of BEGIN-only awk scripts can be written, including entire scripts, using procedures if required.
Further reading on awk is highly recommended in order to gain a working competence with the language. Having a good understanding of its capability will allow you to determine where and when the tool should be used. A good alternative to awk, although not typically part of a standard operating system installation, is perl. In terms of capability, perl is a tremendous language, allowing many more interactions with the outside world than awk was designed to do.
8.3 Further Reading
Aho, A. V., B. W. Kernighan, and P. J. Weinberger. The AWK Programming Language. Reading, MA: Addison-Wesley, 1988.
1
Thanks to Kevin Closson (Sequent Advanced Oracle Engineering) for his cooperation on this subject.

2
This is the Sequent-specific NUMA-aware version of shmget() call used by the VOS code to create the segments on specific quads, implementing the SGA placement optimizations.



Scale Abilities Ltd
http://www.scaleabilities.co.uk
Voice: +44 1285 644533
info@scaleabilities.co.uk
TOC PREV NEXT INDEX