This chapter will cover the other side of the Oracle database-the process and memory architecture. These are the aspects of Oracle that make it a scalable database as opposed to a static filestore, and a good understanding of these concepts is essential for building a very scalable Oracle system.
The combination of these operational aspects of Oracle form the
Oracle kernel.
The Oracle kernel is not a kernel in the same sense that an operating system kernel is. All execution on the host operating system occurs in user mode, with standard system calls made to access system functionality. Oracle does not have any special privileges with regard to memory visibility or system function.
Instead, the Oracle kernel is a set of processes and shared a memory segment upon which they operate. The processes and shared memory combined form an
instance of Oracle. The physical side of Oracle on which the instance works is an Oracle
database. Although these two entities are very much intertwined, Oracle Parallel Server provides ways to allow multiple instances to operate on the same database. This will be covered in
Section 6.9.
Having declared that Oracle is not like an operating system kernel, it is time to contradict myself in some respects. The Oracle kernel is very much like an operating system kernel in that it performs the following functions on behalf of user connections:
· File I/O (through the operating system)
· Runtime memory management (especially when using multithreaded server)
However, Oracle does not provide the following operating system kernel functions:
· Cache coherent multiprocessor support
· Timesliced process execution
· Direct hardware interfacing
These functions are all provided by the operating system and are fundamental to the operation of Oracle. They are so fundamental, in fact, that most hardware vendors provide a variety of special facilities in their operating system kernels for Oracle to use. These facilities include
· Asynchronous I/O to take the I/O scheduling burden away from Oracle
· Physically locked virtual memory that is not subject to any further virtual memory management, such as paging
· The ability to stop processes from being preempted by the operating system (OS) kernel for short periods while holding latches
In this chapter, we will look in detail at some of the services provided by the Oracle kernel.
This is the most straightforward part of the Oracle system to understand. An Oracle instance consists of two different types of processes:
· Shadow (or server) processes
All processes in an Oracle instance are considered to be part of the Oracle kernel. In fact, all background and shadow processes are invoked from the same binary image-that of
$ORACLE_HOME/bin/oracle. In this way, Oracle can rely on the fact that all the connected processes will follow the same rules about accessing objects in the SGA, and will not corrupt other sessions as a result of invalid writes to the shared areas.
The shadow processes constitute the server-side portion of a user
TWO_TASK connection and therefore are also known as "server processes." There are two different architectures provided in Oracle8 for database connections: the dedicated server and multithreaded server (MTS). Whichever option is adopted, the shadow process is the process that performs the actual request servicing within the database.
Each process attaches to the Shared Global Area (SGA) for read and write use of the global caches. The process waits for requests to come over the network either directly (dedicated server) or through a dispatcher (MTS). When a request comes in from the client application, the server process interprets the request through the Net8 software that forms part of the process.
When the Net8 work has been completed, the process acts directly on that request on behalf of the user. Although the shadow process has Oracle kernel privileges and can do anything at all to the database or SGA, the process will not allow this unless the user connection is authorized to do so. However, when the user session has permission, the shadow process will do all the I/O, sorting, package execution, and whatever else the user session has requested.
In this way, the shadow process can be considered a proxy worker for the client application. The client application never accesses the SGA or the database itself directly, but only through the shadow process. It is the responsibility of the shadow process to enforce the security of the database and to service all work on behalf of the user.
Dedicated server processes are private to the user. That is, there is one process for each connected user, and that process is dedicated to the processing of that user's requests only.
The
oracle executable is renamed on execution to become something like
or
oracleSID (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))
|
for remote and local connections, respectively. Note that the actual
oracle executable on disk is not renamed; only the name of the
running process is changed.
The dedicated server process maintains its own private working space and caches in a private piece of memory called the Process Global Area (PGA). The working memory is also known as session memory and is the portion of memory that stores the context of the session for the user session.
The dedicated server process listens on the direct network connection that was setup when the user connected to the database. The network connection is private between the dedicated server shadow process and the user application client-no other processes can use this network connection.
With one physical process for each user connection, large dedicated server-based systems can put a large burden on the host operating system in scheduling the processes, in addition to an increased overhead in memory management.
Multithreaded Server (MTS)
The multithreaded server removes dedicated shadow processes from the system by switching user connections in and out of a pool of shared servers.
In
Figure 6.1, three users are connected to the database using both dedicated server and the multi-threaded server. Where the dedicated server maintains a one-to-one relationship with its corresponding client process, the MTS clients are sharing just two servers from a common pool.
The key component to consider in an MTS configuration is the dispatcher process. The dispatcher process acts as a software switch, listening for incoming requests from the client and connecting the client to a free server when required. The client is said to have a
virtual circuit to the server process. During the time that the server is processing the request for the client, it is unavailable for any other session. Really, shared servers are not multithreaded at all, but serially reusable.
In order to support this server reuse by multiple client sessions, Oracle moves the session information, which was formerly stored in the PGA of the server process, into the SGA. Specifically, the memory is allocated from the large pool, or from the shared pool if no large pool is defined. This memory allocation is not insignificant and can increase the size of the SGA by many hundreds of megabytes in a large system. In reality, the memory savings offered by MTS are not great, if any.
The increase in required SGA memory can have other consequences on some systems. Many 32-bit platforms have quite low restrictions on the maximum size of the SGA, for example. On some platforms, this can mean that the maximum size of the SGA is less than 1GB, which somewhat limits the maximum number of connections possible with MTS.
In addition, the enlarged SGA can cause problems with the virtual memory management of the host operating system. If the host does not support the locking of the SGA into real memory, each server process that attaches to the SGA must maintain page table entries (PTEs) for each of the (typically) 4KB pages that make up the SGA. With a 4GB SGA, this equates to around one million additional page table entries to manage for each process. If the SGA can be locked into memory, all processes can share a common set of page table entries and so this is less of a problem.
The MTS shared servers are classified as background processes by Oracle, as are the dispatchers. However, unlike the other "real" background processes, Oracle will not crash if the processes are killed from the operating system. The impact of killing shared servers is minimal; the termination will be detected, and a replacement server will be started. All sessions will be unaffected by this operation
unless they were active in the shared server at that time. Killing a dispatcher will terminate all sessions connected through it.
Net8, formerly known as SQL*Net, is the software used on both sides of the client/server connection in order to provide a platform-independent transport layer between client applications and the database server. It has become increasingly complex over the last few years, but the principles remain the same from a tuning perspective.
We are not going to cover any of the Net8 functionality here; this is well covered in the Oracle documentation. Instead, there are a few basics that need to be understood, particularly if your application is to be rolled out over a large geographic area.
The fundamental thing to bear in mind with Net8 is that it operates over a network. A network, like any communication medium, has two primary measures of "speed": latency and bandwidth. It is frequently the latency aspect that is overlooked, resulting in poor application response times for the user.
If an application is written in SQL*Forms, and each user runs the actual form locally on their PC, the Net8 communication is between the actual user PC and the database server. If the two are separated by a wide area network (WAN), problems frequently arise as a result of the latency of the underlying network rather than any particular Oracle problem.
As an extreme example, imagine locating a worldwide database server in North America and user PCs as far away as Asia. This wide area network could be implemented using satellite technology, providing a high-bandwidth network at relatively low cost. Unfortunately, the latency of a satellite-based network is approximately 0.5 second for each hop between the client and the server. This means that a round trip takes about 1 second and forms the
minimum response time for any database access.
Database access includes parsing, every fetch, and so on. This means that a query on
sys.dual will take about 3 seconds using this network. Increasing the bandwidth of the network will not improve this problem at all, because it is the latency of the network that causes the problem.
Therefore, the first rule of Net8 should be to keep network latencies between the client and the server to a minimum, preferably on the same physical local area network (LAN) in the datacenter.
The second rule of Net8 also relates to the underlying network. There is a tunable for Net8, specified in the
tnsnames.ora, called the SDU (session data unit). This is the size of the data that Net8 will send to the native network transport for sending across the network.
There are occasions when the default setting of the SDU is not appropriate and the value must be increased. Typical reasons for increasing this value include
· Client performing large array fetches from the server
· Large number of requests not fitting in one physical network frame
The second reason is of particular importance where the network has a higher latency than desired. One symptom of a small SDU is when sessions are frequently found to be in "
SQL*Net more data to/from client" wait states.
The SGA is the area of shared memory that Oracle creates on start-up. All Oracle processes (background and shadow processes) connect to this shared memory when they are started. The various areas of the SGA are provided to meet the following needs of the database system:
On its own, the SGA does nothing at all-it is just a piece of shared memory. It is the use of the SGA by the Oracle kernel that provides all of the above-listed attributes of the database.
The SGA, shown in Figure 6.2, is composed of four distinct components:
All of these areas, with the exception of the redo buffer, are comprised of many other subareas.
The fixed region is very small, typically on the order of 40-60KB. Within this region Oracle stores singleton latches, such as the redo allocation latch, in addition to the parent latches for variable lists of latches. The contents of this region are defined by a platform-specific assembly language source file called
ksms.s. This file used to be resident in
$ORACLE_HOME/rdbms/lib prior to release 7.3, but is no longer shipped. It was used as the mechanism for relocating the SGA, but this is now handled through
init.ora parameters, and the assembly language file is no longer needed.
This is a shame, because the file was essentially a symbolic map of the fixed region of the SGA. For example, there was an entry showing
.set kcrfal_,sgabeg+19072
|
This shows that the structure for
kcrfal (redo allocation) is stored in the fixed region, 19,072 bytes from the beginning of the SGA. Part of this structure is the actual latch itself.
The names of structures relate to the names of the software modules that contain them and therefore are considered Oracle Proprietary (and therefore cannot be published). Anyway, it's immaterial now, because
ksms.s is no longer shipped.
Essentially, the fixed region contains the structures that relate to the contents of
V$LATCH: If you subtract the lowest
ADDR from
v$latch from the highest
ADDR, you will get an approximate size of the fixed region. It is called the fixed region because this component is sized at compile time. No amount of change in the
init.ora will change the size and structure of the fixed region, only its location in memory.
Now is a good time to get into some specifics of how Oracle implements latches.
Anatomy of an Oracle Latch
In the example above, we state that the
kcrfal structure starts at an offset of 19,072 bytes from the beginning of the SGA. This structure might look a little like that shown in Figure 6.3.
The latch is physically stored at the head of the structure. Due to the serialization provided by the system bus, we already know (from Chapter 2) that we can perform atomic writes to memory locations of certain sizes. The limit on the size of the location that can be atomically updated is dependent on the platform (processor) architecture.
In this example, we assume that a 32-bit atomic memory write is available. Therefore, the mechanism for acquiring the latch is simply a test-and-set operation on the memory location at 19,072 bytes into the SGA memory region. If the test-and-set operation is successful, you have acquired the latch and can continue to modify any other part of the structure, copy data into buffers, or modify several elements in a linked list of which this structure is the head. When the work is complete, the memory location is reset back to the "unset" value, and other sessions can allocate the latch.
You may have noticed that the latch location has been "padded" to 32 bytes in length. This, again, is just an example and assumes that the length of a single CPU cache line is 32 bytes on this platform. The reason for the padding is to guarantee that the latch itself is isolated in its own cache line and is not coexisting with another, unrelated latch. If this were not the case, it would result in false contention on the cache line, making the allocation of a little-used latch as difficult as that of a highly-contended-for latch.
The variable region contains the following elements:
· Variable lists of latches (i.e., all the
cache buffers chains latches)
· All lists (cache buffers chains, LRU lists, etc.)
In turn, the shared pool contents include the following elements:
· Session and process context
· Session memory (if running MTS and no
LARGE_POOL is defined)
· Session memory (if running MTS)
· Oracle backup work areas
The buffer cache is an area of the SGA that is set aside for the caching of data blocks. From release Oracle8 onwards, it can contain up to three distinct "buffer pools," which are separately managed caches for different types of data usage:
The default buffer pool is the buffer cache that is created by default and has all objects assigned to it by default. This can be directly compared to the single buffer cache found in Oracle7. Optionally, two other buffer caches (pools) can be defined, which can be tuned to behave more suitably for specific situations. These two additional caches are the
KEEP and
RECYCLE caches (see Figure 6.4).
All three of the buffer pools have separate allocations of buffers and LRU lists that manage the buffers. The
RECYCLE and
KEEP pools are explicitly defined in the
init.ora, whereas the
DEFAULT pool allocates the remainder of the buffers and LRU lists for itself.
There are subtle differences between the three buffer pools, and some of these are highlighted next. One of the major differences, however, is that we now have the ability to individually tune the pools individually, dependent on their use.
The idea of the
RECYCLE buffer pool is to use it more as a work area than a cache; this pool is used to store blocks that are virtually never reused after the initial read. Even though the blocks will rarely be reused, Oracle still needs to run all database blocks through the buffer cache in order to maintain consistent read views, and thus a buffer cache is still necessary for these blocks.
However, in a single buffer cache configuration, these single-use blocks could quickly dominate the buffer cache, implicitly flushing out other blocks that
will be reused. Blocks that are frequently referenced are not affected by this cache pollution, because they will always be at the most recently used (MRU) end of the LRU lists. It is blocks that are reusable, but are not necessarily used very often, that are affected by this kind of cache pollution.
One operation that can pollute a cache is random single-row lookups on a table that is much larger than the cache itself. Each time one of these rows is read from disk, it must occupy an entire buffer in the cache. If these reads are part of an index-driven scan across the table, all of these buffers will need to be allocated from the cache, quickly dominating the space in the cache. Other blocks will need to be flushed from the cache to make space for these incoming blocks.
In this case, the table that is being accessed in this way can be moved to the
RECYCLE pool. The idea of this pool is that we just give Oracle a work area for the blocks but don't really expect any kind of reusability. This having been said, the standard LRU mechanism still applies to this buffer pool but is managed in isolation from the other buffer pools. Therefore, if a few of the blocks in this table are reused from time to time (such as an index root block, as opposed to the leaf blocks), they will still remain in the cache if the access frequency merits the retention of the block.
When allocating a
RECYCLE pool, we typically want it to be just large enough to allow contention-free allocation of the buffers. If the buffer is created too small, there will be too much demand for the number of buffers available, and potentially reusable blocks will be eliminated from the cache before they are requested again. However, this cache is typically not directly aiding system performance a great deal and so should not be wasting memory that could be used by the other caches. The sizing of the
RECYCLE pool can require a little trial and error, and Oracle recommends an initial setting of one-quarter the number of buffers that the object occupies in the
DEFAULT pool prior to being moved.
The
KEEP pool is for the allocation of buffers for objects that are accessed with medium frequency, or those for which a consistent response time is desirable. Blocks that are accessed very often will remain in cache anyway, even in the
DEFAULT or
RECYCLE pools. The
KEEP pool is for the blocks that are not accessed quite enough to remain in cache, or those with a mixture of hot blocks and warm blocks.
Using the
KEEP pool, it is possible to assign just a few small objects to the cache and thus have the peace of mind of knowing that the buffers will never be aged out by unrelated blocks over time. This allows consistent response times for queries that access that object.
In order to support this a little more than the
DEFAULT pool, some subtle changes are made in the algorithms used for the
KEEP pool. First, to minimize cache pollution by excess CR versions of buffers (see
Section 6.5.5), these buffers are not created in the
KEEP pool. As they are assumed to be short-lived, CR versions of buffers in the
KEEP pool are created in the
DEFAULT pool in order to keep the content of the
KEEP pool as static as possible.
In addition, the LRU algorithm used in the
KEEP pool is slightly different. This will be discussed on
Section 6.5.3.
6.5 Operation of the Buffer Cache
The buffer cache is a portion of the SGA that is set aside for the caching of database blocks. As memory access is approximately 100,000 times faster than disk reads, it is highly desirable to service as many data requests from buffers in the cache as possible, without going to disk.
This is the ultimate goal of the buffer cache and is often the only information needed about the buffer cache in order to run a system fairly effectively. Simply keep an eye on your cache hit ratio:
Hit ratio = 1 - [physical reads/(db block gets - consistent gets)]
1
However, when the buffer cache is pushed very hard, it is useful to understand the workings of the buffer cache.
The buffer cache is a set of buffers in memory, each sized as one Oracle block. When a query needs a block of data from disk, Oracle first checks the buffer cache for the presence of that block in memory. This search is performed using
hash chains.
The hash chains are fixed-depth lists of pointers to the actual buffers in the cache. A conceptual view of the list elements is shown in Figure 6.5.
When Oracle has determined which block it requires, it can then hash this dba
2 to gain the ID of the hash chain that it should search. The buffer pool that the object is assigned to does
not form part of this hashing algorithm; the entire set of hash chains is shared among all the buffer pools, with the chain ID determined by dba only. This means that an object that is cached in the
RECYCLE pool could easily share the same hash chain as an object in the
KEEP pool by virtue of dbas that hash to the same value.
Oracle maintains many hash chains, and they can be explicitly set using the
_db_block_hash_buckets parameter in the
init.ora. The maximum total number of elements in all of the hash chains is the same as
db_block_buffers, because the hash chains are a list of the contents of the actual buffer cache.
Let's start an example that highlights this concept and the others that follow. In this example, we assume that we are using a single buffer pool, as in Oracle7.
A user submits a query to the database. It is determined that dba 15
3 is required as one of the blocks that is needed to execute this query. Oracle then hashes this dba, using the total number of hash chains as one of the hashing parameters. This yields a result of SEVEN, and so Oracle has determined that this dba will be listed in hash chain number 7 if it is present in the cache.
Oracle then starts at the top of hash chain 7 and checks the actual dbas of all the entries until it either hits the end of the chain or finds a match on the requested dba.
If the dba is matched in the hash chain, Oracle then checks to see if this buffer is OK to use. The information required to determine this is also likely to be stored within the hash chain element itself. One reason that the buffer may not be OK to use could be that the buffer is for an incompatible CR version of the block. If the buffer is not available for use, the hash chain is traversed further to determine whether more buffers store the same dba. If none of these buffers are available, or no further matching buffers are found for that dba, Oracle does not have a buffer in the cache that can satisfy the requirement and must go to the LRU list to create a new one (see
Section 6.5.3).
If the buffer is found and deemed to be usable, Oracle can
heat the buffer on the LRU list (again, see
Section 6.5.3) and pin the buffer while it is in use. This pinning operation is a little like a latch but operates on a single element in the chain.
At this point, the buffer request has been serviced for the query, and Oracle can proceed to read the buffer and to read the actual contents of the buffer to act on the query. After use, the buffer is unpinned and is available for other sessions.
Twice in this section we have mentioned the LRU list, so let's go into some more detail on this.
In order to have a cache that is based on frequency of use, a list that shows how often the blocks have been accessed needs to be maintained. In the Oracle kernel, this is implemented using an ordered LRU list, where the blocks with entries at one end of the list have been accessed most recently and the blocks entered at the other end of the list have been accessed least recently. This is the basis of a simple LRU chain.
In Chapter 1, we saw an example of an LRU chain in operation (see
Section 1.5.4). This is the way Oracle managed its LRU chain right up until version 7.2. In this release, Oracle split the single LRU chain into multiple LRU chains in order to improve the scalability of the system.
We have two scenarios passed down from the section on hash chains:
1. Heating the buffer when it is located successfully in the hash chains
2. Finding a buffer to use when a block is not present in the cache, or when one is needed to create a CR version of the block
A simple view of the heating procedure is covered in Chapter 1 (see
Section 1.5.4). Each time we access a block, it becomes the most recently used (MRU) buffer by implication. Therefore, the buffer is heated to the MRU of the LRU list. This heating process ensures that blocks that are not used as often as others will get implicitly aged down the LRU list by the heating process of the other blocks. When they get to the LRU end of the list, they become targets for replacement.
This is where the buffers come from in the second scenario. Oracle will go to the relevant LRU list when it needs a buffer and will take the first available buffer slot from the list.
Note: The buffers at the LRU end of the list may not necessarily be available for use, because they could be dirty buffers (modified, but not written to disk).
The buffer is then heated to the top of the LRU list and is moved from its prior hash chain to the correct hash chain for the dba of the new block.
With the introduction of multiple LRU lists, sessions can no longer assume the location of a single LRU list. Instead, each session will round-robin between the multiple LRU lists, based on the buffer pool in which the object is cached. If an LRU is locked, the session will move onto the next LRU without waiting.
This is the basis of a simplified linear LRU mechanism, applied to the Oracle buffer cache. However, Oracle implements LRU slightly differently in order to gain better performance from the cache.
Oracle 8.1 Buffer Cache LRU
This description applies to Oracle 8.1 only. Prior (and probably future) releases of Oracle do not (will not) necessarily operate in this way.
When Oracle reads a block into the default buffer cache, it is
not heated to the MRU end of the list immediately. The new block is generally inserted into the middle of the LRU list, because it is not known at this stage whether the block will be accessed enough to deserve a slot at the MRU end.
There then follows a short period of time during which the buffer will not be heated, even if it is accessed again. This timeout period is designed to be used in situations in which a block could be read and updated in quick succession and then never used again. One example of this would be a single row update: Oracle reads in the block in order to determine whether it meets the criteria for update and then modifies the block. This kind of activity could quickly dominate the MRU end of the cache in an OLTP system if the timeout period were not imposed. If the block continues to be accessed outside of this period, it will be heated to the MRU end of the list.
In the
KEEP pool, the blocks are all assumed to be useful, especially because the kernel does not need to cater to CR buffers in its LRU algorithm. Therefore, all new blocks going into the
KEEP pool are heated on first access.
Oracle 8.1 adds further optimizations for the LRU algorithm. Using the simple "read and heat" algorithm, every access to a block requires an LRU operation in order to heat the buffer, imposing a significant overhead in buffer access.
4 In order to improve on this, 8.1 implements an LRU algorithm that keeps a reference count for each buffer. No heating occurs at all, unless the buffer hits the end of the LRU list and has a high enough reference count. This way, the required LRU manipulations are very much reduced.
The buffer cache is covered by a large number of latches, typically more than all other parts of the Oracle kernel combined. This is vital, because the buffer cache needs to be completely free of corruption and yet retain a high degree of concurrency in access.
First, every hash chain is protected by a latch. These latches are the "
cache buffers chains" latches and, like all child latches, are reported in
v$latch_children. This is one of the important aspects of a system to watch, especially when the application is very heavy on the buffer cache.
Viewing
v$latch_children, the number of gets and sleeps can be determined for each child latch. It is possible and likely that a small number of these child latches are requested much more often than the others. This is sometimes attributed to a single block being heavy, but more often than not it is the result of several blocks hashing to the same chain. When this happens, a large proportion of accesses to the buffer cache become serialized through this latch. The following query will report on the hottest hash chains:
SELECT a.HLADDR,a.DBARFIL,a.DBABLK,b.GETS,b.SLEEPS
AND a.state != 0 /* ie, currently in use*/
AND b.sleeps > ( SELECT 100*avg(SLEEPS) from v$latch_children
WHERE name='cache buffers chains' )
|
The output from this query might look a little like this:
HLADDR DBARFIL DBABLK GETS SLEEPS
-------- ---------- ---------- ---------- ----------
AA416498 158 3139 538830353 1588779
AA416498 208 374 538830353 1588779
AA416498 480 3343 538830353 1588779
AA41CED8 480 3543 1974346642 950985
AA41CED8 158 3339 1974346642 950985
AA33A8C8 157 12803 555346192 112729
AA3F5268 307 1283 148471828 93581
AA340B98 85 71701 263159467 81874
AA3C1498 480 783 554143110 78639
AA3E48C8 72 58417 358435709 46844
|
This shows several things about the operation of the hash chains in Oracle. First, there are multiple completely different blocks in the database on the same hash chain (
HLADDR). This shows the hashing process sharing the chain among a diverse range of dbas. Second, it can be seen that there is still heavy contention for the first hash chain. There are 50 percent more sleeps occurring on the latch protecting this chain than on the next hottest chain, even though it has been acquired only one-quarter as often. This is likely to be caused by the duration of each latch operation that protects the chain, or by a series of "burst" allocation requests.
In this example, a good next step is to create more hash chains using the
_db_block_hash_buckets parameter. The top two hash chains have two file/block combinations that could be the objects contended for: File 480, blocks 3343 and 3543; or File 158, blocks 3139 and 3339. The only way we can determine which of the objects is getting requested so much is to create more hash buckets and hopefully split up the dba ranges into different hash chains: The guilty object would then be clearly visible.
The second set of latches over the buffer cache come from every LRU chain being protected by a latch, shown as "
cache buffer lru chains" in
v$latch_children. There are many times fewer LRU chains than there are cache buffers chains, because there is a minimum ratio of buffers to LRUs enforced at start-up time of 50 buffers per chain.
CR (consistent read) block creation was first discussed in
"Block Header" in
Section 5.5, in the anatomy of an Oracle block. In order to support the CR model provided by Oracle, the buffer cache must be employed to supply buffers that contain versions of a given block that differ from its on-disk representation. There may be several different versions of the same block in cache at any one time. This is a CR block.
When a CR version of a block is needed by a query, it will first of all determine if there is any version of the block in cache at that point. If so, it will choose the version that has a version number closest to, but no lower than, the desired one. If the version number were lower than the desired number, the block would need to be rolled forward from redo log; CR is supported from undo information only.
If a buffer is found that can be used to construct the CR block, it is cloned onto a free buffer in the cache and assigned to the same hash chain as the original. If no version can be found, the block is read off disk.
Once the buffer is available for constructing the CR block, the last relevant piece of undo information is applied to the block, as determined by the ITL entries in the block. This rolls back both the contents of the block and the ITL entries themselves. This process repeats until the change number of the block matches that of the start of the query. At this stage, we have a CR version of the actual block.
As previously discussed, this block is
never written to disk. It exists only in the cache, as a reflection of a previous version of a block. Writing the block to disk would mean corrupting the database.
In our discussion of the LRU list, we referred to a dirty buffer that could be found at the tail end of the LRU list. The list of dirty buffers is also maintained within the LRU structures. The reason for this is simple: There is no point in writing out dirty buffers if they are about to get updated in the very near future.
However, it can be seen that without a formal way to clean (write out) these buffers, the cache would quickly become full of dirty buffers. This never happens in reality, because a given session will search only a certain percentage of the LRU list for a clean, reusable buffer. If it does not find one before hitting the threshold, the session will post
5 the database writer and wait on "
free buffer waits" until buffers become available for reuse.
Every time the database writer is woken up to write out some dirty buffers, it will start at the cold end of each LRU and scan toward the hot end until a complete
write batch full of dirty buffers is found. The size of this write batch is determined by the write batch size. This used to be a parameter in the
init.ora but is now derived according to the algorithm in the
Oracle8 Tuning Guide ("Internal Write Batch Size").
Another way in which the database writer is woken is by the checkpoint process. The checkpoint process doesn't actually do the writing associated with the checkpoint; it simply posts the database writer to write dirty blocks that are less than a given change number out to disk. This is an incremental checkpoint, because not all the dirty buffers need to be written out at each checkpoint. Each time a checkpoint occurs, this determines where rollforward recovery must start in the event of a failure prior to the next checkpoint.
Multiple database writers can operate on a given buffer cache/instance. Each database writer will work on LRUs in a round-robin fashion, ensuring that they do not contend with themselves for the latch. This implementation of
true (i.e., independent) database writers is an improvement on Oracle7, where multiple database writers were implemented as a master/slave arrangement.
6.5.7 Flowchart Summary of Buffer Cache Operation
The flowchart presented in Figure 6.6 should serve as a useful reference for the operation of the buffer cache. It is not a complete view of the operations but provides a simplified version for ease of use. In particular, the two items marked with asterisks should be viewed in conjunction with the text in the relevant preceding sections.
6.6 Shared Pool Structure
The shared pool (see Figure 6.7) is a portion of the SGA, and itself contains several components.
These components compete for space within the space allocated to the shared pool. This means that if a poor cache hit ratio is being achieved in the library or dictionary cache, the other caches in the shared pool will be reduced in size proportionately, attempting to allow the badly controlled area to stabilize.
The "Other" portion of the shared pool is used to cache other structures needed for operation of the instance. The detail of the components in the shared pool (and the rest of the SGA) can be found by querying
v$sgastat:
----------- -------------------------- ----------
db_block_buffers 167772160
shared pool free memory 294079744
shared pool miscellaneous 5625880
shared pool transactions 3424000
shared pool db_files 264736
shared pool table columns 72928
shared pool SEQ S.O. 336000
shared pool KGK heap 24408
shared pool db_handles 1470000
shared pool KQLS heap 8121648
shared pool fixed allocation callback 3560
shared pool branches 960240
shared pool Checkpoint queue 347888
shared pool PLS non-lib hp 2104
shared pool ktlbk state objects 1712000
shared pool partitioning d 32024
shared pool db_block_hash_buckets 6062232
shared pool DML locks 1856000
shared pool State objects 2913312
shared pool KGFF heap 383656
shared pool trigger defini 18360
shared pool distributed_transactions- 392168
shared pool db_block_buffers 8683520
shared pool dictionary cache 37045784
shared pool state objects 886728
shared pool messages 504000
shared pool PL/SQL MPCODE 1899184
shared pool enqueue_resources 576000
shared pool library cache 201110312
shared pool table definiti 23448
shared pool sql area 115850968
shared pool processes 2632000
shared pool sessions 8384000
shared pool kxfp buffer su 14932792
shared pool event statistics per sess 9920000
shared pool PL/SQL DIANA 1616072
shared pool transaction_branches 736000
shared pool kxfp subheap 240616
|
6.7
Shared Pool: The Library Cache
The library cache was introduced in Oracle7 as a means of cutting down on the amount of processing required just to parse an SQL statement. Prior to Oracle7, all incoming SQL requests were individually parsed each time, regardless of whether or not the statement had been parsed before. The CPU cost of performing a parse is high, and it was determined that a method of reducing parsing was needed.
In Oracle7, the library cache was introduced as the mechanism of reducing the amount of parsing in the database.
All SQL statements are classed as cursors. Physically, a cursor consists of the following components:
· Client-side runtime memory (SQL runtime library)
· Server-side runtime memory (known as runtime memory)
· Server-side private SQL area (known as persistent memory)
· Server-side shared SQL area (the library cache)
When a cursor is opened and parsed, Oracle determines (as detailed next) whether or not the statement is identical to a previously submitted cursor. If it is, Oracle will reuse the parse information from the prior execution of the cursor.
6.7.3 The Parsing Mechanism
In order to convert a SQL request, which is essentially just a human readable text string, into an executable request, Oracle
parses the statement.
There are several steps involved in parsing a statement:
· Semantic and syntactic checking of the statement
· Dictionary validation that all the objects and columns exist
· Name translation of objects (i.e., synonym expansion to actual objects)
· Checking user privileges on the objects accessed in the cursor
· Production of an execution plan by means of the optimizer
· Loading of the statement into the library cache
These steps comprise the extreme of parsing-a
hard parse. At this stage, although the cursor exists in the shared pool, it is not currently actually
shared, other than by the session that created it. It exists as
sharable, but the final steps have not been performed to make it usable by another session. These steps occur during the first execution of the statement by another session.
Cursor Representation in the Library Cache
When a statement is loaded into the library cache, two different entities are created:
When the statement is first parsed, it creates one head and one body for the statement. The cursor head can be thought of as the master record for that statement, containing the following information:
The head is locked when the cursor is first opened and is unlocked only when the cursor is closed by all sessions. The head cannot get aged out of the library cache while it is locked. When the head is unlocked, it is eligible for aging out of the cache, at which point all the corresponding body entries are also flushed out.
The body contains all the specifics for the cursor, such as the actual execution plan and the bind variable information. A cursor may have many bodies for its single head, even though the text is the same for each. The number of bodies is reported in the
VERSION_COUNT column of
v$sqlarea, and each body's statistics are reported in
v$sql.
Identical executions of the same statement will reuse both the head and the body of the statement. New bodies are created under the following conditions:
· Bind thresholds differ greatly from prior executions.
· Optimizer plan changes as a result of bind variables supplied.
When a SQL statement is submitted using bind variables, Oracle builds a cursor body that can cater to bind variables of that specific length. The body will also cater to bind variables that are similar to that length, subject to different buckets of sizes. For example, if a query has a bind variable for
NAME, the first time it is executed the value bound to the
NAME variable is 6 bytes, and Oracle creates bind information to cope with up to 50 bytes in that bind position. The next time it is executed,
NAME is 200 bytes in length (a really long name), and Oracle cannot reuse the existing body-the runtime requirements of the statement go beyond the definition in the current body. At this point, Oracle will create a new body to cater for-say, 65 to 256 byte bind variables at that point. The previous body will remain for future use, subject to being aged out.
The optimizer plan can also change between executions, owing to the existence of histograms for the values for a column. If a value is supplied that can be approached in a more efficient way as a result of cardinality, then a new plan is created and put into a new body for the statement.
Cursor bodies are eligible for aging out at any time, regardless of whether the cursor is open or not. If a cursor is open and the body is aged out, the head (which is locked) contains sufficient information to reconstruct the body for when it is next used. This is classed as a reload, and the hit rate of executions to reloads can be determined by the following query on
V$LIBRARYCACHE:
SELECT 100*sum(reloads)/sum(pins) Reload_Ratio
|
Prior to the steps of a hard parse, Oracle hashes the statement to produce an identifier to look for in the library cache. The range of the hashing algorithm is very broad, and most of the time there will be only one statement for a given hash value. It is not guaranteed to be unique, however, and the
ADDRESS of the statement in the shared pool should always be used in conjunction with the hash value in order to gain uniqueness when querying the library cache. If this hash value corresponds with one already stored in the library cache, and further checks prove this to be the same statement as one that has already been parsed, then this statement only requires one of the three types of soft parse instead of the full hard parse.
The first time a session attempts to use a sharable cursor (i.e., one that resides in the shared pool but has never been executed by this session), two actions are taken by Oracle. First, a name translation occurs, just as in the hard parse. The reason for this is that this session is not the same user as the original, and it must be determined that this session is referring to the same objects.
Second, the user has to be authenticated on all the objects in the cursor. Once this has been completed, the user is put on the authentication list.
The second time a session attempts to use a cursor, it is now classed as shared because of the prior name translation and authentication steps. However, because grants may have changed since the last execution, the authentication step must still occur.
Soft Parse Type 3: Session Cached Cursors
When session cached cursors (
session_cached_cursor init.ora parameter) are used, it is the third parse call for a cursor that creates the entry in the session's cursor cache. Once in the session cursor cache, any CLOSE calls to the cursor are ignored, and the statement does not need to be reparsed, up to the number of cursors specified in the
init.ora.
Once the hash value for the statement has been determined, Oracle goes to the library cache and determines whether the statement is already cached. If so, Oracle uses that information to reduce the steps needed to execute the statement. This is known as a
soft parse, and there are up to three different severity levels for this type of parse.
The ratio of hard parses to soft parses should as close to zero as possible. In this case, good sharing of SQL is occurring, and the system is running efficiently from the library cache. The cache hit ratio for the library cache can be determined by the following query:
SELECT 100*(1-(gets-gethits)/gets) Ratio
WHERE namespace='SQL AREA'
|
This query reports the hit ratio as a percentage, and should be as close to 100 percent as possible.
The library cache is protected by latches, in the same way as are the other memory structures in Oracle. Prior to Oracle 7.2, this was performed using a single latch on the entire library cache, and scalability was a big problem on systems that had heavy parse overheads.
Since 7.2, Oracle has maintained multiple library cache latches, where the latch used is based on the hash value of the statement. The default number of latches is the nearest prime number greater than the number of CPUs on the system, but can be explicitly set using the
_kgl_latch_count parameter in the
init.ora. The inclusion of multiple latches protecting the library cache allows for far greater work to be achieved on the system.
Work occurring in the library cache must be carried out while holding the relevant library cache latch. For most requests, the library cache latch is acquired and released several times before the request is complete. The number of times the latch is required depends on the severity of the operation. For example, a simple
SELECT * FROM DUAL acquires the library cache latch 29 times to hard parse the statement, but only 19 times for the next execution, and only 10 times for executions after that. The amount of work performed in the library cache decreases each time, owing to the different levels of parsing.
A badly sized shared pool, or an application that does a poor job of sharing SQL, will quickly fill up the shared pool. In this eventuality, Oracle must start to age items out of the library cache using a "modified" LRU algorithm.
When space must be found, Oracle starts at the cold end of the LRU and flushes out cache entries until a contiguous area is available that is large enough for the minimum contiguous size required by the statement being parsed. The actual amount of memory freed up by this process could be far in excess of the actual required memory, because the entries that are aged out are unlikely to be contiguous themselves. This means that a fairly severe flush is required to gain the space needed for the new statement. Each time this aging process occurs, the demand on the library cache latch increases greatly as a result of the extra strain put on it by the aging process.
As time goes on, the shared pool becomes more and more fragmented, resulting in more and more pressure on the library cache latch. There comes a point where the miss rate on the library cache latch becomes unacceptably high, and the shared pool must be flushed explicitly to relieve the pressure on the latch. Even having to hard parse all the requests for the foreseeable future is not as bad as having a full and fragmented shared pool.
During the aging-out process, Oracle can age out only objects that are not currently being executed, pinned in memory, or locked (i.e., cursor heads). If an area of library cache needs to be found for the parsing of a large object, this can result in Oracle being unable to flush any further objects out of the cache and still not finding enough space for the object. In this case, an
ORA-04031 error is returned, and the request fails. The shared pool is dangerously undersized and/or badly treated by the application if this occurs.
The latch cost of a 4031 error can be estimated by a simple test. With a large shared pool, a connection to the database using SQL*Plus requires approximately 3,357 latch gets. The same test performed on the same instance with a small shared pool (to provoke 4031 errors) takes out the latch 4,596 times. These measurements reflect only acquisition counts, but it is also likely that the period during which the latch is held will increase when the management overhead is increased in this way. These two factors combined result in high latch contention on a busy system.
6.8 Other Memory Structures
The dictionary cache is a special cache designed expressly for the caching of certain objects in the SYS schema, as described in
Section 5.7.6. It is also known as the
rowcache, because it operates on a row-by-row basis rather than the block basis on which the buffer cache works. These two terms are used interchangeably in Oracle documentation.
The purpose of the dictionary cache is to speed up access to the tables that Oracle needs to operate. Included in this are all the table definitions, storage information, user information, optimizer histograms, constraints, and rollback segment information.
When an object is not found in the dictionary cache, Oracle must go to disk and retrieve the required dictionary information. The process of getting dictionary information from disk is called
recursive SQL. Often, a single SQL statement will require many recursive calls before it can be executed, if the information is not found in the dictionary cache. The recursive calls quickly dominate the cost of executing a single statement at this point.
Due to the presence of the dictionary cache, the tables that make up the data dictionary cannot be manipulated by DML. In physical terms they can be, and Oracle will allow you to do so without complaint. However, this can cause database corruption as a result of the dictionary cache being unaware of these changes.
If
SYS.FET$, for example, were to be modified by a user session, any updates would occur with a standard DML TX enqueue rather than the ST enqueue that is used when the kernel manipulates this part of the dictionary. In addition, when the update is committed, the dictionary cache would not be invalidated, therefore making the operational state of the database inconsistent with the state stored on disk.
The log buffer is the buffer that stages database changes prior to being written to the redo log. When a change is made to database blocks, they are written to both the buffer cache and the log buffer. If a commit occurs on the system, regardless of origin, all data in the log buffer is written to the redo log. This is the way Oracle guarantees that committed transactions are on disk; the dirty buffer cache blocks can be re-created from the redo log in the event of a crash.
The log buffer is a circular buffer, as shown in Figure 6.8.
This means that the log writer "chases the tail" of the new buffers being copied in. It also means that the new buffers can chase the tail of the log writer. To prevent the buffer from filling up, the log writer will start to write out the buffers when the log becomes one-third full, or every 3 seconds.
In reality, a busy system commits a good deal more often than every 3 seconds, and so this typically is not a problem (at least in a transactional system). The important thing to ensure is that the log buffer is large enough to allow the log writer to catch up before any burst writes to the buffer threaten to fill it up.
When a commit is issued, Oracle writes a commit record to the redo buffer for that transaction and flushes all of the buffer to disk. If any commits are issued while this is happening, they will all be handled as an atomic unit by the log writer when it returns. This is known as a
group commit and reduces the amount of time a session will wait for a commit confirmation.
The log buffer is ultimately protected by just one latch-the
redo allocation latch. This latch is serialized by design, because it is the only point in the design of the Oracle database that is inherently serial. All records must be written to the buffer in order, and without interfering with each other.
However, a single latch would not scale very effectively, as sessions must copy potentially large amounts of redo records into the log buffer. This could take a long time and would result in high contention for the latch as many sessions waited to copy in their data. To prevent this, Oracle provides
redo copy latches.
Redo copy latches are an abstraction of the redo allocation latch specifically geared toward reducing the amount of time that any one session holds the redo allocation latch. Using copy latches, the redo allocation latch is acquired for just long enough to reserve the amount of space needed for the copy. One of the redo copy latches is then acquired in order to perform the actual copy.
The use of redo copy latches is governed by two
init.ora parameters:
log_simultaneous_copies and
log_small_entry_max_size. The
log_simulataneous_copies parameter sets the number of copy latches to create, where
2*cpu_count is the default. The
log_small_entry_max_size parameter determines the threshold over which it is preferable to copy using copy latches rather than the allocation latch.
6.9 Oracle Parallel Server Concepts
Oracle Parallel Server (OPS) is typically surrounded in mystery, because not many people have experience in implementing or administering such a system. However, the basic concepts of OPS are not
very complex and should be understood by anyone involved in specifying or implementing very-large-scale systems.
Ultimately, the scalability of a single system has a limit at any given point in time. If your application looks to require three times more CPU than the largest server on the planet can provide, even after tuning, then a single-system solution is no longer an option. Without OPS, Oracle is limited to the scalability provided by the hardware vendor.
Figure 6.9 shows a high-level view of the operation of OPS. Starting at the bottom, we have a database that is accessible by all nodes that need to run an Oracle instance against this database. The access can be direct, through a shared disk array found in clustered systems, or indirect, through software layers commonly found in shared-nothing MPP systems. All nodes open the database in shared mode, which allows other instances to open the same database. Each node acquires a set (or thread) of redo logs for exclusive use during normal operation. The gray arrows are intended to demonstrate that the redo logs must be available to all other instances for recovery purposes.
6.9.2 Distributed Lock Manager (DLM)
On top of the physical database, each node runs an Oracle instance exactly as normal, with the addition of the distributed lock manager (DLM) to take care of communication between the instances.
In single-instance Oracle, the state of locks is held internally as a series of structures in the SGA. When a lock needs to be acquired or released, it is simply a case of taking a latch (the enqueues latch) and updating the structures in memory. When there is more than one instance accessing the same data, a common view of lock states must be maintained for all nodes; this is the function of the DLM. In addition to standard enqueues that we are familiar with, the DLM also controls access of a variety of other resources, most notably the buffer cache by means of parallel cache management (PCM) locks. This is covered in
Section 6.9.3 and is one of the fundamental differences between single-node Oracle and OPS.
In the Oracle7 days, the DLM was typically provided by the hardware vendor rather than by Oracle. This was both good and bad, depending on the implementation of the DLM.
The good thing about a vendor-supplied DLM was that it could be implemented as part of the operating system kernel and subsequently always run at kernel priority. It could also have direct access to the hardware layer in order to communicate with DLMs running on the other nodes.
The bad thing about a vendor-supplied DLM was that if problems occured (and they did, many times), the resolution of the problem lay across the boundaries of two companies. Frequently, these problems took a long time to be resolved and sometimes never got fixed because neither side believed it to be their issue. The other bad thing was that the Oracle software had to communicate with the DLM through a published API. This prevented Oracle from being able to change things from release to release and drastically reduced the speed at which improvements could be made in the DLM architecture. These problems have all gone away in Oracle8, due to the introduction of an integrated DLM in the OPS version of the RDBMS.
The DLM is implemented as a new set of background processes with special hardware support in the operating system for the best internode communication method for that platform. On SMP platforms, this is typically a private Ethernet LAN or some kind of proprietary interconnect. On MPP systems, the communication is performed over the MPP interconnect. It is the latency of the DLM communication medium that is one of the prime limiting factors in an OPS environment, because it directly affects the time taken to service each remote lock request.
The existence of the DLM allows multiple Oracle instances to synchronize their use of shared resources.
Not all Oracle resources are shared, even in an OPS environment. Examples of this would be the redo allocation latch and cache buffer LRU chains latches. These entities are private to the local instance and therefore do not need to be coordinated through the DLM.
Therefore, in an OPS configuration, Oracle has a total of five different types of resources to consider:
2. Global enqueues (non-PCM lock)
3. Global locks (non-PCM lock)
4. Local enqueues (non-PCM lock)
5. Local latches (non-PCM lock)
The first three of these resources are handled by the DLM, and the remainder are handled the same as in a single-instance configuration.
6.9.3 Parallel Cache Management
The concept of a parallel cache is fundamental to OPS. It is important to understand this concept thoroughly before attempting to recommend or implement an OPS solution.
What Is a Parallel Cache?
In a single-instance configuration of Oracle, a buffer cache is used to speed up access to frequently used pieces of data, to provide a fast commit mechanism, and to reduce the overall number of writes to disk. These attributes of the Oracle buffer cache are absolutely essential, and if OPS were to remove any of these advantages it would not be a viable option.
However, to retain these features in an OPS configuration, a cache on every instance is required. This is not an issue if the database is purely read-only, but in a deferred write environment such as Oracle, data corruption would occur as multiple instances all wrote differing data to common blocks on disk.
To eliminate this problem, and retain most of the advantages, OPS has the concept of a
parallel cache. This is simply a fancy name for multiple Oracle buffer caches glued together with a DLM. However, the DLM cannot coordinate the access to every buffer on every instance, because its latency would cause all buffer operations to slow to a crawl, not to mention use a great deal of resource. Instead, Oracle implements
parallel cache management (PCM) locks, over the datafiles in the database, and coordinates the use of these locks with the DLM.
Note: PCM locks are not related to transaction locks in any way. This will become clear as we go on.
Each PCM lock can cover one or more actual data blocks in the files, and they are used to group data blocks together into commonly managed resources. This dramatically reduces the number of resources managed by the DLM and also reduces the overall communication between the nodes, provided that the application and database implementation have been performed adequately.
When an instance wants to access a block covered by a PCM lock, it must ask the DLM for the PCM lock covering that block and must obtain it in a mode suitable for the intended operation. In simple terms, this means Shared Mode for read and Exclusive Mode for write, although more states than this are used by OPS.
If another instance has the PCM lock in any mode higher than Shared Mode, the DLM will not grant the lock to the requesting instance right away. The fact that another instance has a higher-level lock than Shared means that the other instance has modified the block. In this case, the DLM downgrades the Exclusive Mode lock of the other instance to a Shared Mode if the new request is for read or to NULL Mode (i.e., no lock) if the new request is for Exclusive Mode. This forces the other instance to flush the block from the buffer cache back onto disk and to report back when complete. When this synchronous operation is complete, the DLM grants the requested lock to the instance requesting it. This operation is known as a
ping (see Figure 6.10).
As stated earlier, a PCM lock is not the same as a transaction lock. A PCM lock has the granularity of a single database block at best, whereas a transaction lock has the granularity of a single row within a block. Therefore, the PCM operations cannot interfere with the critical ability to perform row-level locking in Oracle.
Oracle achieves this isolation by keeping transaction and PCM locks totally independent of each other. While a transaction will typically cause a buffer to be written to in the first place, it is not linked to the PCM operation in any way, or vice versa.
In the example above, an update of block "A" by Instance Two caused the buffer to be dirty in the first place. This update has not been committed at this point. When the request comes in from Instance One, the transaction has still not been committed, but because PCM and transactions are not related, the buffer can be written to disk and used by the other node. The other node is only trying to read the block and so is not affected by the fact that there is an outstanding TX lock on one of the rows (remember, writers don't block readers). When Instance Two is ready to make more changes in the block, it will request that its lock be upgraded to an Exclusive Mode once more and will continue work.
Of course, in this example, Instance One has just gained permission to read a block that is of no use to it. The block it read from disk has uncommitted data in it, and the query running on Instance One needs to see the version of the block before it is changed by Instance Two. Therefore, an additional ping is required of the transaction table and undo information in order to rebuild the old version of the block. This is known as reader/writer contention in the OPS world and is the reason that the CR server has been implemented for Oracle8.1. The CR server vastly improves this type of contention by constructing the correct version (that's where the CR part comes in) of the block using required undo (on Instance Two in this case), and ships the buffer directly over to the buffer cache of Instance One. No disk writes occur, because CR blocks are useless anyway once they are finished with by the requesting session; they should never be written to disk.
Further complexity now arises; it is impractical to have a PCM lock for every block in the database. Doing this would cause an enormous memory burden on each instance in the configuration, because each PCM lock requires memory to be allocated for the lifetime of the instance. The exact amount of memory varies among platforms and releases, but it is reasonable to estimate this memory burden at 100 bytes per lock. This would make 11GB of memory overhead per instance for a 450GB online database.
To alleviate this problem, Oracle provides two types of locks:
A fixed PCM lock is one that is created at start-up and not released until instance shutdown. Therefore, every instance has a copy of each lock, potentially in a NULL Mode if it has not been used. Fixed locks cover between one and
n blocks of the database for each lock, although fixed locks are typically used to cover multiple blocks. Therefore, the rest of this subsection will concentrate on hashed PCM locks-the type used to cover multiple blocks with a single PCM lock.
Hashed locks are specified on a per-file basis in the
init.ora and are evenly distributed across the file according to the blocking factor specified (used as the modulo in the hash), as shown in Figure 6.11.
In this example, a 20-block datafile is configured to have four locks covering it, with a blocking factor of 5. This means that every fifth block is protected by the same PCM lock.
Hashed locks have a serious problem when covering files with a comparatively small number of locks. As the number of locks covering a file decreases, the probability of lock collisions increases proportionately. A lock collision is where, say, one instance is updating block 6, and another instance wants to read block 18. Both of these blocks are covered by PCM lock "2," and so the DLM has to downgrade that lock on the other node's behalf, causing a ping. This situation is known as a
false ping, because it did not need to be performed in order to preserve coherency between the buffer caches. In reality, one false ping can potentially cause many, many disk writes, and so should be avoided wherever possible.
As previously mentioned, it is not really practical to allocate one fixed PCM lock per block in the database. Therefore, another option is needed for situations where block level PCM is needed.
A releasable lock is a PCM lock that is allocated from a pool defined in the
init.ora. If a lock has not been used, it does not exist on any instance, but only as a blank entry in the common pool. Although the coverage of releasable locks can be specified as hashed, covering many blocks, this rarely makes sense because of the additional overhead of using releasable locks for each lock operation. Therefore, releasable locks tend to be most useful for covering one block or a few blocks for each lock.
As their name suggests, releasable locks are released when they are no longer in use. That is, there is no concept of a NULL Mode lock with releasable locks-they are simply released. This allows a relatively modest pool of releasable locks to be used to provide block level locking across the entire database.
Unfortunately, nothing in life is free. There is about a 30 percent overhead imposed by using releasable locks, because the lock must be created and destroyed for each NULL-to-X or anything-to-NULL conversion pair. Therefore, greater performance can be achieved using fixed locks as long as the false pinging can be tuned to acceptable levels.
6.9.4 Design Considerations for OPS
The whole point of the preceding section is to give you a feel for operations that occur in an OPS configuration. Although this topic really needs an entire book in its own right, hopefully you will have realized that careful design and planning are required when implementing an OPS system.
The application design component of OPS is probably the single most critical component in obtaining good horizontal scalability (i.e., more nodes) from OPS. Primary in this is the concept of
partitioning the application.
Application partitioning has nothing to do with table or index partitions, although the partitioning strategy may elect to use those features. Rather, it involves the partitioning of function within the application to allow users on different instances of the OPS database to operate without incurring too much synchronization overhead from the DLM. To achieve this, the data access for users on a given node should be as independent of the data access on other nodes as possible.
Specifically, the ideal application would be one that shared only read-only data in the database, such as reference information. This is known as a hard partitioning of the application. However, this is rarely a practical proposition, particularly when many nodes become involved.
The flipside of a hard-partitioned application is a soft-partitioned application. This is where the application designer has quantified where read/write and write/write synchronization is required between nodes and provides functionality in those areas to allow measures to be taken in the database to reduce the synchronization.
In practice, a combination approach is usually required, using hard partitioning wherever possible and soft partitioning elsewhere.
Determine how much partitioning is required. A two- or three-node OPS system is significantly easier to design than a ten-, 50-, or 150-node system. In systems where the node count is significantly greater than the number of hard-partitionable functions in the application, most of the design attention needs to be put into the soft-partitioning effort.
As a rule of thumb, if there is a great deal of read/write and write/write sharing between functions, the hardware platform choice will become an increasingly dominant factor in the performance of the system. In this situation, considerably greater scalability will be obtained by minimizing the node count and maximizing the power in each node.
In read/read OPS applications, such as decision support, good scalability can be obtained by increasing the node count to quite high numbers. In these applications, the nodes quickly stabilize their locks into being Shared Mode, and the synchronization overhead is reduced significantly because lock operations remain local.
The next step should be an analysis of the tables that each function in the application requires. This is the opportunity to make the high-level hard-partitioning decisions. For example, in a financial application there may be a good split between Accounts Payable and Accounts Receivable. If this split continues to look good right down to the table level, then this will already allow the application to run efficiently on two nodes of OPS owing to minimal synchronization overhead.
If there is not a clean split between functions in the initial data model, are there any changes that can be made to change this situation? For example, if one part of the application updates a table frequently, and the other function reads only nonupdated columns in the table, can the model be amended to split this table into two? This would eliminate the pinging every time the read function needed to access a row.
Clearly, changes in the data model are very much more suited to this stage in the design process, and this is why the more successful OPS applications take care of this overhead at the right time.
Once the hard partitioning has been defined, how much further do we need to go (defined in Step 1)? In our example, we allowed the application to scale from one node to two nodes, just by finding a good hard partition in the application. Let's assume that we need twice as many nodes to support Accounts Receivable as to support Accounts Payable. In this case, we need to find a way to split Accounts Receivable. This is going to involve some support from the application to allow certain database operations to work most efficiently.
For example, in order to minimize contention on index leaf blocks for a table that is written to by both AR nodes, we may want to take either of two approaches:
1. Sequence number caching
Both of these approaches have implications for the application and therefore will need application support prior to implementation in the database.
Sequence number caching was one of the first techniques to be used in gaining index leaf block separation (and therefore minimizing concurrent updating of the same block by multiple OPS nodes). The theory is that if the primary key for an index is system-generated by a sequence number generator, then the sequentiality of the numbers will force the additional row inserts to update the same block in the primary key index. This is a result of the primary key being in exactly the same part of the index B-tree due to close values being used by all nodes.
The solution is to use a large sequence cache for this sequence number on each node. This would typically be on the order of 2,000, depending on insert frequency. This means that when the first node needs a sequence number, it will read the next value from disk and update the on-disk value to be 2,000 greater. It will then have exclusive use of a set of 2,000 numbers. The next node to use the sequence will start 2,000 higher and update the on-disk version of the sequence to be 2,000 higher still. In this way, each node gets a discrete set of numbers to use as primary key values.
The impact of this is that, if the cache is sized correctly with respect to the number of keys in a leaf block, each node will typically be updating a different leaf block to the other nodes.
Here's the downside: you should never assume that sequence numbers are sequential; they are guaranteed only to be unique. If there is an instance crash, or even a normal shutdown in OPS environments, Oracle will "lose" numbers from the defined range due to the cache. In the case of a normal shutdown, the value used at next start-up will be one greater than the highest number last used. This is the only value that Oracle can use.
In the extreme case of AR, we might be using this technique to provide index separation in the
INVOICES table, on the
INVOICE_NUMBER column. Two situations can arise here that financial people are rarely very pleased about.
First, if an invoice is created by an AR user on one node and then cancelled and replaced by a user on another AR node, it is very feasible that the replacement invoice will have an invoice number
less than the original.
Second, whenever there is a crash, or even a normal maintenance shutdown of the instances, numbers will get lost. Again, financial people get upset about this, because they have to account for all invoice numbers.
Both of these issues can be catered for with application reporting, user signoff, and so on, but they need to be taken care of up front.
Reverse key indexes are another database-level change that affects the application. Using reverse key indexes, a noncached sequence number can be used to generate the keys (although this requires greatly increased synchronization on the sequence generator itself), and the key will be inverted by Oracle before determining which leaf block it belongs to. For example, key 123456789 would become 987654321.
This ensures that the most significant digits of the index are the ones that change every time there is a sequential insert. When the key is read by Oracle, it is simply flipped back to the way the application expects it. However, all is not as transparent as it seems.
Now that the keys are inverted in the actual index, range scans can no longer be performed on the index. The only way this could occur would be for Oracle to perform either a Full Fast Scan of the index and check each key to see whether it was in the required range, or to perform potentially thousands of physical index range scans. Neither of these options bodes well for performance, and so the application designer needs to ensure that no range scans are required on this index.
Don't forget the batch cycle. Is there a defined batch window for the system? If batch needs to run concurrently with the users, is the user count a great deal smaller at that point? Can the batch processes be hard partitioned?
The reason these questions need to be answered is that batch typically runs while some users are still operational and traverses many areas of the database schema in order to produce the desired result. In this case, massive amounts of pinging can occur as a result of the large synchronization overhead.
In order to minimize this, the ideal solution would be to hard partition all the batch processes. Unfortunately, this is very rarely possible, simply because of the nature of batch processing. Instead, it often becomes an operational solution that is required.
During the batch cycle, the system can normally be operated with a single node because of reduced user load at night. This involves some kind of change in the login process during the batch window to ensure that users connect to the same system. Once all the users are on one node, batch can also be executed on that node.
This is a strange one to sell to management sometimes. They see that batch is running slow and that there are loads of unused CPUs on the idle instances. However, the real issue is the cost of synchronization. Even if the single batch node is at 100 percent CPU, it is unlikely that moving jobs to the other nodes will speed things up-usually the contrary.
Once the application design is OPS-friendly, all the attention has to be turned to the database. While the application design phase is likely to have included database changes to reduce data sharing, the physical database implementation requires a great deal of attention in order to scale the system effectively.
Removing Contention for Blocks
Contention for blocks is the single most gating factor in scaling OPS systems because of the required DLM synchronization between the nodes, which is many orders of magnitude slower than within a single instance. Therefore, contention for blocks and the resultant pinging must be minimized wherever possible. There are several options open to the DBA at this point, and some of them should be used by default in an OPS implementation.
The first of these "must haves" are
freelist groups and
process freelists. These were covered in some detail in
Section 5.5.2, where the usage of these entities was discussed. In summary, freelist groups remove segment header contention from the Parallel Server synchronization list, because each node now has its own freelist block to update. As the overhead of freelist groups is so minimal, it is almost not worth thinking about where they are required; they can be added to every table in the database. For example, in a five-node clustered OPS system with 1,500 tables (i.e., a complex application), the total overhead would be 7,500 database blocks, or approximately 30MB. This is a very small price to pay for the advantages it gives with little complexity.
With Oracle8, another "must have" became available. Partitioned objects allow physical separation of data, including freelists, high-water marks, and in many cases the indexes for the table. To use partitioned objects for OPS, some kind of node identifier needs to be used as the partition key, and this clearly has an impact on the application. Partitioned tables should be used wherever practical, not necessarily just on large objects.
If there are any tablespaces that are truly read-only, consider changing their status in Oracle to read-only. Not only does this reduce the amount of data that needs to be backed up regularly, but also removes the tablespace from the DLM's consideration.
Once all the physical measures have been taken, including the methods mentioned in the application design section, all that remains is the tuning of the PCM lock allocations with the
init.ora parameters, especially with a view toward reducing false pinging.
This is where a more in-depth analysis of the application is required, including knowledge of
SELECT,
UPDATE,
INSERT, and
DELETE frequencies and distribution to all of the tables in the application. This will allow the type and distribution of the PCM locks to be set accordingly. Consider Table 6.1. when planning for PCM lock distribution.
There are many shades of gray in the way a table is accessed, and other approaches are required accordingly. It is important to go through the kind of thought process presented in
Table 6.1 for each of the situations, in order to come up with the correct allocation and type of PCM locks.
Removing Contention for Non-PCM Locks
In addition to the PCM lock synchronization that occurs in OPS, there are several non-PCM locks that must be coordinated. The most straightforward of these is the TX or row-level lock and its associated TM table lock.
The careful use of TX locks is not really an OPS design issue, but rather an issue for any application. It is not good design to allow users to hold row-level locks for any extended period. The TM locks are created implicitly when a TX lock is created and are a preventative measure.
A TM lock is held in a non-NULL mode when a TX lock is created in the table, because this prevents the table from being dropped, or locked in exclusive mode (such as for building an index), while the transaction is active.
In OPS mode, it is preferable to remove this synchronization overhead by either setting
DML_LOCKS=0 in the
init.ora or using the
ALTER TABLE ... DISABLE TABLE LOCK directive. The implications of using either of these approaches are as follows:
· No TM locks are created.
· All attempts to lock the table are rejected.
When Oracle knows that nobody can lock the table, it does not need to synchronize the lock states between the instances, therefore cutting down on the DLM overhead.
One lock that needs very careful attention in an OPS environment is the ST (space transaction) lock. In the event of a great deal of work to be performed while holding ST, such as a large amount of free extents to coalesce, the additional overhead of passing the ST lock around the many nodes can quickly bring a system to its knees. This is especially true when anybody needs to create a new extent while the ST lock is already in demand from SIMON doing cleanup work.
One situation in which new extents are frequently created and discarded is that of temporary disk sort areas. For this reason, tablespaces of type
TEMPORARY and
sort segments should always be used in OPS environments, as discussed in
Section 5.7.3.
The goal of this section has not been to provide exhaustive OPS information. Rather, it has been to provide enough information for you to be aware of when OPS is a viable scaling option for a large application. Further reading is strongly encouraged before any attempt is made to implement OPS, as detailed below.
6.9.6 Further Reading for OPS
The best place to learn the details of OPS is the
Oracle8 Parallel Server Concepts and Administration Guide. This is a very comprehensive guide to OPS and is surprisingly readable. Included in this manual are comprehensive worksheets to aid the application design and analysis required for successful OPS deployment, in addition to background information and configuration syntax.
In this section we visit several of the more interesting
V$ views and present some examples of how they should be interrogated to extract information that can be used for operational, tuning, and capacity planning purposes.
The information in the
V$ views can be put into three high-level categories:
· Session-level information
· System-level information
· Miscellaneous information
In this section we will cover the first two of these categories in varying detail. Full details of all the
V$ views can be found in the
Oracle Server Reference Guide.
6.10.1 Session-Level Information
A great deal of information is available about individual sessions connected to the database. Connected sessions include everything that attaches to the SGA, including all the Oracle background processes. This information is centered around the
V$SESSION view, and we will use this view as the driving table as we go through the session information, as shown in Figure 6.12.
V$SESSION contains one row for every connection to the database. Connections to the database include any shadow processes (the server side of the two-task architecture) and all of the background processes, such as DBWR and any parallel query slaves. Table 6.2 is the master record of every session in the database and contains foreign key information for gaining greater detail in a variety

of areas.
The primary key for this table is the
SADDR column, but this is not the most useful column for using this table. It is far more useful to use the
SID,SERIAL# combination as the primary key, because this provides many more join options against other tables. The reason that the serial number is present is to define this session uniquely: If this session were to log off and somebody else logged on, they could very well pick up the same
SID. Therefore, a different
SERIAL# is assigned in order to distinguish among the session information for the respective sessions.
Continuing on through the view, skipping
AUDSID, the next column is
PADDR. This is a very useful column, because it is the foreign key used to join against the
V$PROCESS table on the
ADDR column

(see Table 6.3).
An important distinction needs to be made between the
SPID column of
V$PROCESS and the
PROCESS column of
V$SESSION; the
SPID is the Unix PID of the server-side shadow process, whereas the
PROCESS is the PID of the client process (
sqlplus,
f45runx, etc.). Before returning to
V$SESSION, we can go through the
V$PROCESS information.
Highlights of
V$PROCESS (apart from the server side
SPID) are the
LATCHWAIT and
LATCHSPIN columns. These columns show the address of the latch on which the process is waiting or spinning, respectively, and can be a good source of information when aggregated as a count.
The other columns in the view are not especially useful in a client/server architecture, because the program is always "
oracle," and the username is also "
oracle." In a multithreaded server (MTS) environment, there will be multiple session records that can map to a single entry in
V$PROCESS, because there are multiple sessions sharing the same physical server process. Latch waits always operate on a process level, and so this is the correct location for this information.
Back in the
V$SESSION view, we find the username information in both text form and numeric ID format. The next column is the
COMMAND column, which shows the command type of the last statement parsed. A full reference to all the command types is included in the
Server Reference Guide but typically will be 2 (INSERT), 3 (SELECT), 6 (UPDATE), 7 (DELETE), or 45 (ROLLBACK).
The
OWNERID column is used to identify the owner of a parallel query operation. All the parallel query slave processes started on this instance have an entry in
V$SESSION, showing the username to be that of the calling session. However, this is not sufficient to identify the session uniquely-particularly if the initiating session is actually connected to another instance of a parallel server database. In this case, the value of this column is a 4-byte value, with the low-order bytes identifying the session ID and the high-order bytes identifying the instance number.
The
TADDR column is a good one to remember. It is the foreign key reference through to the
V$TRANSACTION view (see Table 6.4)

and uniquely identifies the transaction that the session holds active.
Many of the columns in
V$TRANSACTION are of use only in very specific debugging situations. However, there are other columns that are invaluable in tracking down common problems and determining how long rollback operations are going to take. These columns are
The
XIDUSN column shows the (U)ndo (S)egment (N)umber for the transaction-that is, the rollback segment number. This is useful when tracking back users of a particular rollback segment, going back to the session information by means of the
SES_ADDR column. The
START_TIME column is a character representation of the time at which the transaction started and can be used to locate long-running updates.
The
USED_% columns show how much real undo space is being used by the transaction. Probably the most useful of this pair is the
UREC column, which shows the number of data records in the undo segment. This is an effective way to determine how long a rollback operation will take on the transaction, because it decreases in real-time during rollback operations.
The remaining four columns provide a nice summary of the block operations that have occurred during the transaction.
V$SESSION_WAIT (see Table 6.5)

is one of the primary views for determining the cause of poor response time. It provides a snapshot view of every connection to the database and what the session is currently waiting on. If the session is not currently waiting, it provides the last wait event for that session.
Initial queries to this table should aggregate the data to provide a readable view of the wait states in the system:
WHERE wait_time = 0 /* i.e. still waiting */
|
This query provides a rolled-up view of all the sessions in the system that are currently in a wait state. While this is useful information, it is often more useful to include all sessions in this query, not just sessions that are currently waiting. The reason for this is that it gives a more accurate feel for the trend of the system and often exaggerates problems enough for them to be easily identified. For example, the query above may return the following results:
---------------------------------------------------------------- ------------
SQL*Net message from client 988
db file sequential read 9
|
At first glance, this looks like a reasonably healthy system: Most users are not in a wait state other than waiting for a new request from the application. However, the "
db file sequential read" may not be as normal as it looks. If the query is executed without the
WHERE clause, the following is produced:
---------------------------------------------------------------- ------------
SQL*Net message from client 988
SQL*Net message to client 20
db file sequential read 198
|
Now a potential problem is evident. This clearly shows that several sessions have been waiting for single-block physical I/O from disk.
6 It is now time to drill down, using some of the other columns in the view.
We need to get more detail on the sequential read wait event, so we need to incorporate the general-purpose columns
p1,
p2, and
p3 into our query, and exclude the wait states that we are not interested in. The parameter columns (
p1,p2,p3) have a different meaning for nearly every different wait event, and so the
p?text columns should be used to find out what the values relate to. This can be done in two ways, either by
SELECT * FROM V$SESSION_WAIT
WHERE event='db file sequential read" AND ROWNUM=1;
|
or by
SELECT * FROM V$EVENT_NAME
WHERE event='db file sequential read';
|
V$EVENT_NAME is a simple view of
X$KSLED, which is one of the tables used by the
V$SESSION_WAIT view. From either of these two techniques, we find that the parameter columns for "
db file sequential read" are as shown in Table 6.6.
In this case, we do not really care about
P3, because it is likely to always be 1 anyway (single-block read). When we run the following query:
SELECT p1,FLOOR(p2/100),count(*)
WHERE event='db file sequential read'
GROUP BY p1,FLOOR(p2/100);
|
the following discovery is made:
P1 FLOOR(P2/100) COUNT(*)
---------- ------------- ----------
|
This shows that 196 of the 198 sessions that have waited for physical I/O have been reading from the same file and the same block ranges. This is likely to be a rogue query, and so the next step for this particular problem is to join
V$SESSION_WAIT and
V$SESSION in order to find common hash values for running SQL:
SELECT se.sql_hash_value,count(se.sql_hash_value)
AND sw.event='db file sequential read'
GROUP BY se.sql_hash_value;
|
This query joins the two tables by their common primary key, the SID of the session. This query produces output like this:
SQL_HASH_VALUE COUNT(SE.SQL_HASH_VALUE)
-------------- ------------------------
|
The guilty hash value is clearly 828483197, and this can now be extracted from the library cache using the
V$SQLTEXT view.
This is one example of how to interrogate the
V$SESSION_WAIT view in order to find the causes of problems in the system. Frequently, this type of approach is the only one necessary when reacting to performance problems in the system, because it provides an instant view of sessions that are spending time waiting rather than working.
One of the drawbacks of
V$SESSION_WAIT is that it provides only a transient view of the current wait state.
V$SESSION_EVENT (see Table 6.7)

is the cumulative view of wait states for all the sessions in the database. It reports the same wait events as
V$SESSION_WAIT and provides a history of wait states and durations for that session.
This view is useful for getting an idea of the severity of waits in the database. Take a look at the following example:
EVENT TOTAL_WAITS TOTAL_TIMEOUTS
---------------------------------------- ----------- --------------
SQL*Net message to client 16465777 0
SQL*Net message from client 16465776 0
db file sequential read 4938379 0
buffer busy waits 829199 125
SQL*Net more data to client 637883 0
SQL*Net more data from client 328281 0
free buffer waits 11314 1656
db file scattered read 811 0
|
In this example, the session has waited a large number of times on SQL*Net messages to and from the client. This is an indication of the number of requests this session has made of the server-in this case, the session has executed many short queries against the database.
The worrysome thing in this example is that a high number of timeouts have occurred waiting on the "latch-free" event. This event is a catchall for every latch in the system, but in this kind of situation it is typically one latch that is causing the majority of the timeouts. Multiple latch wait timeouts are normally an indicator that latch contention is occurring and should be investigated further.
The
V$SESSTAT (see Table 6.8) view provides approximately 200 different statistics related to the operation of each session in the database. The names of these statistics are recorded in the
V$STATNAME view (see Table 6.9), where they are also grouped according to the type of statistic reported.
These statistics are identical to the ones reported for the system level, and accordingly some will be more useful as a global total.
6.10.3 System-Level Information
Most of the remaining
V$ views refer to system-level information. It does not make much sense to go through all of the remaining views, because many of the views are very rarely used. Instead, we will take a look at some of the most useful views in common use.
V$SYSSTAT, V$SYSTEM_EVENT
These two views show the same statistics as their session-based counterparts.
V$SYSTEM_EVENT can give a fast overview of the system wait states, while
V$SYSSTAT can give detailed information on all of the actual activity in the database.
Joining this table to
V$DATAFILE yields statistics about physical I/O occurring against the datafile that comprise the database. Several conclusions can be made about the usage of certain datafiles based on the relationships of columns in this table. For example, if
PHYBLKRD/PHYRDS is greater than 1, full table scans or index fast scans are occurring on this file.
This view shows the component parts of the SGA and their respective sizes. The most common use of this information is to determine whether or not there is free space in the shared pool. Monitoring the free space in the shared pool is a vital step in maintaining a smoothly running system.
Following up on maintaining sufficient space in the shared pool using
V$SGASTAT,
V$LIBRARYCACHE will report on the efficiency of the cached objects in the library cache. For example, if the shared pool is correctly sized, and the application reuses SQL efficiently, the
GETHITRATIO for the
SQL AREA namespace will be close to 1 (100 percent hits).
An incorrectly sized shared pool will also result in an undersized dictionary cache (a.k.a. rowcache) as a result of the SQL area squeezing out other namespaces. If this is occurring, misses will start to occur on dictionary cache objects. The cache hit ratio is reported in this view as
GETHITS/(GETHITS+GETMISSES).
This view shows statistics about tables, views, indexes, sequences, packages, and even cursors. Historically, this table was once used for objects that had owners, not anonymous SQL statements. It was used to show whether an object was currently cached and whether or not it was "kept" in memory using the
DBMS_SHARED_POOL package. Now that this package supports the "keeping" of cursors in the shared pool, the cursors are also listed.
This view is reported in the Oracle documentation as a parallel server view. However, it is now created as part of the
catalog.sql script and therefore present in every Oracle instance. The BH in this view name stands for buffer hash and refers to the actual buffers in the buffer cache; there is one row in
V$BH for each buffer in the cache. Simple queries can be run against this view to determine such things as population of the buffer cache by a given object, and it is thus very useful for maximizing the potential of the cache by ensuring that it is fairly used.
Many of the latches in an Oracle instance consist of a parent latch and many child latches. One good example of this is the "cache buffers chains" latch. This view reports on all the child latches for those parent latches.
The columns in the
V$LATCH% views are worth discussing some more. When a latch acquisition fails, this is deemed to be a miss. A success is deemed a hit. When a miss occurs, Oracle then attempts to acquire the latch by spinning on it
spin_count times. If this is not successful, Oracle puts the process to sleep (
sleeps) and then reattempts the get. If this is unsuccessful, the process is put to sleep for an exponentially longer period each time until the acquisition is successful.
The columns from
SLEEP5 upward are no longer used by Oracle.
This view is an aggregate view of statistics for all the latches in the system. Statistics for any latches that are composed of many child latches are rolled up into a single row in
V$LATCH. This view also contains statistics for all the singleton latches, which are not reported in the
V$LATCH_CHILDREN view.
All locks are reported in this view, whether they are enqueue locks or otherwise. When sessions are found to be waiting on a lock, this view can be queried to find the holder of the lock. The columns
LMODE and
REQUEST are used for this;
LMODE shows the lock mode that the session currently has on the lock, and
REQUEST shows the mode in which the session is trying to obtain the lock.
Incidentally, there is an easy way to determine for which lock type a session is waiting on "
enqueue" in
V$SESSION_WAIT. The
P1 value in this instance is named "
name|mode," which means that the name of the enqueue is
OR'd together with the mode in the value for
P1. The net effect of this is that if you take the
P1RAW value (hex version of the value), and use the first two bytes (first pair of two characters) this is an ASCII representation of the enqueue name. The character values for these bytes are easily determined using the "
man ascii" command at the UNIX prompt.
When an Oracle session waits for a block to become available for use for some reason (such as another session having it pinned for exclusive use), Oracle increments a counter in
V$WAITSTAT for that type of block. In normal use, one would expect the "
data block" class to be the block type most frequently waited on. Problems are evident when other types of blocks are waited on frequently. For example, many waits on "
segment header" demonstrates segment header contention, probably due to heavy inserts into a table resulting in freelist contention.
When the old hot backup method (i.e., not Recovery Manager) is used, each file that is in backup mode will have a status of
ACTIVE in this view.
V$SORT_SEGMENT, V$SORT_USAGE
When a tablespace of type
TEMPORARY is created, it is used solely as storage for sort segments. These are segments that are used for on-disk sorting of data and are not automatically freed after use. Instead, they become eligible for reuse by other sessions under the control of the
sort extent pool latch instead of going through all the overhead of using the single ST lock.
The operation of Oracle is tightly linked to its physical side. As it all operates in memory structures, however, so this is not immediately apparent. Documentation of these aspects has historically been poor, and word of mouth was the only way to determine answers to some questions. Hopefully, this chapter has been enlightening and interesting in this respect.
Use of the
V$ and
X$ tables are the window to Oracle's soul, and it's important to stay on top of the new views/tables that appear in each release. Oracle is fairly unique in the amount of information it divulges, and this should be used to your advantage.
Various.
Oracle 8i Concepts, Oracle RDBMS Documentation.
Various.
Oracle8i Parallel Server Concepts and Administration, Oracle RDBMS Documentation.
Various.
Oracle 8i Tuning, Oracle RDBMS Documentation.
1
Many people are not sure about the definition of this formula.
db block gets are blocks retrieved without going through the CR mechanism, such as segment headers.
consistent gets are blocks retrieved through the CR mechanism. The two combined make the total number of blocks retrieved, whether as a cached read or as a physical read.
2
Remember that dba stands for data block address and is the way that Oracle identifies disk blocks.
3
This dba value is nothing like a real dba, of course. It is used for simplicity only.
4
Oracle has always had an algorithm that does not heat a buffer if it is close to the MRU end. This significantly cuts down on the number of LRU manipulations.
5
Posting is the term used by Oracle for telling a process to wake up.
6
The two statistics "db file sequential read" and "db file scattered read" are a little confusing in naming convention. A "sequential" read is a single block read, usually indicating the use of an index. A "scattered read" indicates a full table scan or index fast scan, and is so named (presumably) because of the use of multiblock reads.