TOC PREV NEXT INDEX

Scale Abilities Ltd Logo


Part III
How Oracle Works
Chapter 5
Physical Oracle
5.1 Introduction
Oracle can be viewed as having two different groups of aspects that work very closely together and are difficult to separate cleanly for explanation. These two groups of aspects are the physical aspects, such as files and blocks, and the operational aspects, such as the memory architecture and the Oracle kernel.
This chapter covers the physical aspects of Oracle. It is not intended to replace, or to be a subset or superset of, the Oracle Server Concepts Guide. Instead, it is a look at some of the more important aspects of Oracle operation-aspects that are most important in building a very large system using Oracle. None of the various Oracle cache or process architectures are covered in this chapter; these features are covered in the next chapter.
5.2 Overview
What is Oracle? This chapter details the common understanding of what Oracle is-a front end to a load of data on disk. I frequently like to think of Oracle in terms of the other aspect-a complex operating system that also stores a bunch of data-and this is the aspect presented in the next chapter. Anyway, in order to cover the physical structure, we will briefly discuss the higher level and then drill down to some finer detail throughout the chapter.
The Oracle database consists of several different types of files that support its operation (as shown in Figure 5.1):
· Control files
· Initialization files
· Datafiles
· Logfiles
5.3 Control Files
The control files within Oracle store information about the physical database. Within each control file is the following information:
· Datafile information (names, locations, status, etc.)
· Operating information (database name, number of threads, number of datafiles, etc.)
· Redo log thread information
· Redo log information
· Log history
· Backup information (if using Oracle Recovery Manager)
Although these files are usually quite small (typically less than 3MB if not using Recovery Manager), their importance should not be overlooked. Therefore, it is common practice to have two or more control files for each database, placed on different disk drives and controllers to increase availability in the event of a failure. Oracle automatically mirrors (i.e., keeps two or more copies in sync with each other) the control files once they have been configured into the system.
In previous releases of Oracle, the loss or corruption of a control file while in operation would result in a database crash. This appears to have been fixed in Oracle8, where certain failures can be sustained provided that at least one of the control files is still available. If the control file that is lost is also the primary control file (i.e., the first-named control file), certain operations requiring the control file (such as queries on v$datafile) will fail, but the instance will remain in operation.
5.3.1 Datafile Information
Within the control file, there is a directory of all datafiles in the database. The control file stores
· Full pathname
· Size of file in Oracle blocks
· Oracle block size
· File status
· "Stop" SCN (system change number)
The first three of these items are self-explanatory. The file status can be set to one of many values depending on whether the file is online, offline, needing recovery, and so on. The "stop" SCN is the change number up to which media recovery will continue for that file. This is set to infinity when the file is online but to the current change number when the tablespace it belongs to is taken offline or made READ ONLY. This allows the media recovery operation to recover the tablespace only to the point in time at which it was taken offline or made READ ONLY.
5.3.2 Operating Information
This portion of the control file includes how many redo threads are currently open, how many datafiles there are, and how many logfiles there are. In addition, the maximum numbers of datafiles, logfiles, and instances specified at database creation time, and whether or not the database is currently running in ARCHIVELOG mode, are recorded.
5.3.3 Redo Log Information
This part contains information about each of the redo groups-which thread it belongs to, whether the thread is public or private, and which log group is the currently active one. It also contains more detailed information about the individual logfiles, including the full pathname, the size, the low and high SCN numbers for that log, the log sequence number, and thread allocation information.
5.3.4 Log History
This is the source of the V$LOG_HISTORY view. It contains historical information about logs that have been archived, including low and high SCN numbers and timestamp. The maximum number of entries for historical information is controlled by the MAXLOGHISTORY section of the CREATE DATABASE command.
5.3.5 Backup Information
If the database backup is performed using Oracle Recovery Manager, the control file is also used to store information about backups and file copies that have taken place. This information is used on recovery of the datafiles if the Recovery Catalog is not available.
5.4 The INIT.ORA File
I make no apology for this section of the book, because it is not going to be a verbatim regurgitation of the Server Reference Guide as found in some "tuning" books. The purpose of this section is to give real advice on putting together a safe, usable initialization file to be applied to a large-scale database system. However, explanations of some of the more important init.ora parameters are included as an appendix, because explanations are more meaningful once the concepts are understood properly.
5.4.1 Rules for init.ora Creation
On the face of it, the init.ora looks to be a fairly straightforward entity-that is, it simply contains multiple parameter=value declarations. This is almost true, and for the most part, simplistic declarations are all you need to worry about. However, there are some lesser-known facts about the way Oracle parses the init.ora that you should understand, because your life will be a good deal simpler and less confusing if you do.
Value Overrides
First of all, beware of inadvertently assigning a value to a parameter twice, and thus overriding the original value. This should not happen if you follow strict layout procedures as detailed in "Layout Recommendations" but if it does it can cost you many lost hours chasing system performance problems. A good example of this would be the setting of our old friend db_block_buffers. If you are the proud owner of a file that looks like this
.
.
db_block_buffers = 10000
#changed by JM
#db_block_buffers = 100000
.
.

then you are asking for trouble. Using this example, it would be very easy for the file to be changed accidentally (say, during maintenance) to
.
.
db_block_buffers = 10000
#changed by JM
db_block_buffers = 100000
.
.

Once the maintenance was completed and the database started up, everything would probably look good. It is likely that you wouldn't even notice the extra zero on the size of the database buffers section of the SGA on instance start-up.
The rest of this story is painfully obvious, involving "missing" memory on the database server, poor performance, and a very well hidden problem in that the initial setting all looks OK. In general, you should avoid retaining multiple settings for the same parameter, except under very specific circumstances, as detailed in the next section.
Value Concatenation
There are some init.ora parameters that can be legitimately specified several times in the file without overrides occurring. A good example of this is the event specification, used to set system-wide database events (normally for debugging purposes). If there is more than one event to set in the init.ora file, this can be achieved without overriding the previous one by keeping all the event settings bunched together in the file. This means that no other parameters can be set in between, as shown below:
# Correct way to specify events
event = "1547 trace name errorstack"
event = "4031 trace name errorstack level 1"
.
.
#Incorrect way to specify events
event = "1547 trace name errorstack"
processes = 3500
event = "4031 trace name errorstack level 1"

In the first example, both of the events are successfully set for the instance. However, in the second example we would end up with the 4031 event being the only one set, because Oracle has legitimately overridden the 1547 event.
This rule also applies to all other multiple-value parameters, such as rollback_segments and control_files.
init.ora File Sizes/Limits
When working on init.ora files, it is important to be wary of the limits Oracle puts on init.ora files. There are limitations on both the size of any individual file and the length of the string assigned to a parameter.
These limitations have been lifted in Oracle8 to the extent that they are unlikely to become a problem, even with enormous numbers of comments. However, caution should be observed on the remaining Oracle7 databases in your company.
The limits in Oracle7 are port-specific, but the values given in Table 5.1 are very typical for Oracle 7.3 on a UNIX platform
Layout Recommendations
When implementing a large system, it is important to maintain a highly readable init.ora file and to incorporate strict change controls. This helps to prevent costly mistakes when making changes in the database configuration parameters.
The first rule to follow when putting together an init.ora is to create and maintain the file in sections. This makes it easier to find specific components of the file when editing it and creates some structure in the file.
In the Oracle Reference Guide, it is recommended that all parameters be grouped together alphabetically; I don't like this idea, because it puts unrelated parameters adjacent to each other, but the choice of grouping should ultimately be something that makes sense to all the people who could edit the file.
Each of the recommended groupings will be presented individually.
Change Control Header
The first section of an init.ora file should be a change control header. This can also be formalized by the use of a source code management system such as rcs or sccs, but such a system should be used in addition to the inclusion of a header at the top of the actual file. The header should include the following information:
· Name of system
· Name of file
· Owner of file
· Purpose of file
· Change history
This is a very formal approach to file headers but makes all aspects of the file absolutely clear. Don't forget, DBAs come and go, and you don't want to have to explain every minute detail of the system to each new recruit. Here's an example of a header for an init.ora file:
########################################################
# $Id: CH05.html,v 1.1 2004/07/25 20:00:10 morlej Exp $
########################################################
# Copyright (c) 1997 xyz Limited
########################################################
# System : Reservations Management System
# File : initRMS1.ora
# Owner : Production Database Administration
# Purpose : Instance RMS1 initialisation file #
#
# Change History
# --------------
#
# Who Date Change
#=========== ========= ============
# J.A.Morle 01-Jan-94 Initial Creation
#
# J.A.Morle 08-Jan-94 Increased
# shared_pool_size->200MB
# to alleviate ORA-04031
#
# A.Einstein 01-APR-95 Added event 1547
#
########################################################
# ...etc...

It is important to keep the change descriptions meaningful, because over the life of a large system, it is likely that the reasons for change will be forgotten.
Common, Database-Wide, and Parallel Server Parameters
########################
# Instance/Thread Definition
########################
thread = 1
instance_number = 1
########################
# Common Parameters
########################
ifile = /opt/oracle/RMS/initshared.ora

This section includes all Parallel Server settings, if used. These may appear to be strange things to include as the first settings, but there is a reason for this, which will become clear when the contents of the initshared.ora file are shown. This file is located on a shared-disk volume, in order to be accessible by all nodes in the OPS cluster. It is therefore readable by all instances, which will all specify the ifile instruction in the main init.ora for that instance.
########################################################
# $Id: CH05.html,v 1.1 2004/07/25 20:00:10 morlej Exp $
########################################################
# Copyright (c) 1997 xyz Limited
########################################################
# System : Reservations Management System
# File : initshared.ora
# Owner : Production Database Administration
# Purpose : Instance independent (Common)
#   initialisation parameters
#
# Change History
# --------------
#
# Who Date Change
#=========== ========= ============
# J.A.Morle 01-Jan-94 Initial Creation
########################################################
db_name = PRDRMS
control_files = ( /opt/oracle/RMS/ctrl1.dbf,
    /opt/oracle/RMS/ctrl2.dbf,
    /opt/oracle/RMS/ctrl3.dbf )
db_files = 256
dml_locks = 0
ccf_io_size = 4194304
optimizer_mode = RULE
..contd...

This is the first part of the ifile, used for setting parameters that are common to all instances but are not Parallel Server parameters themselves. Notice the change control header at the top of the file, describing the usage of the file.
If Parallel Server were not in operation on this system, this part of the file would be included (inlined) as part of the main init.ora file, because the logic is no longer there to maintain a separate file. This is the reason for including this section first in the breakdown-it is common practice among DBAs to include the db_name, control_files, and so on at the top of the main init.ora file when running a single-instance database.
The eagle-eyed will have spotted the dml_locks parameter and will be wondering why it declared as a common parameter between instances-surely there can be different settings on each instance, can't there? Yes, and no. There can be different settings unless any instance has it set to zero, which is why it must be a common setting. More information as to why one would set it to zero can be found in the Parallel Server section of this book.
.....contd..
########################
# Parallel Server
########################
gc_lck_procs = 6
gc_freelist_groups = 100
gc_rollback_segments = 164
gc_rollback_locks = 40
gc_save_rollback_locks = 64
gc_segments = 1800
gc_db_locks = 29000
gc_files_to_locks ="1,34=4000EACH"
gc_files_to_locks = "7-8,18-19,21,30-31=1000EACH"

This is the final part of the initshared.ora file, containing all the settings specific to Parallel Server operation for this database. The gc_files_to_locks parameter is another good example of using multiple lines to build up a complex parameter value, such as demonstrated with the event specification in the next section.
Event Specification
Returning to the main init.ora file, we maintain a separate section for the event specifications :
Again, note that both of the event specifications are on contiguous lines, with no other parameters between them.
Other Groupings
Rather than waste paper by listing the remainder of the example init.ora file, the list in Table 5.2 should prove useful. It suggests categories for each of an example set of parameters.
Further Tips
Although in Section 5.4.1 it is advised that value overrides should be avoided, there is one special case in which it actually can be safer to have them than not to have them. That special case is when a special maintenance ifile is created and referenced at the bottom of the main init.ora file. The contents of the file could be
sort_area_size = 10485760
dml_locks = 1000
db_block_buffers = 100000

and the tail end of the main init.ora file could look like this:
.
.
################
# Maintenance Parameter specification
################
#ifile = maint.ora

When maintenance parameters are set in this way, they are very easy to take in and out in order to perform maintenance-only a single comment character needs to be toggled in front of the ifile specification.
Setting Up a Maintenance Instance
Another excellent way to toggle in and out of maintenance mode is to set up an alternative instance. This method is very flexible and completely removes any risk of leaving parameters unsuitably set when you enter production mode once more. The method for doing this is simple.
First, create another init.ora file with a name totally different from that used for the production init.ora file-for example, initMNT.ora file to indicate the MaiNTenance mode of the database. Within this file, use the same db_name and control_file parameters as in the production init.ora file. This will allow the newly created instance definition to access the production database. Next, set up all of the required maintenance parameters in the initMNT.ora, such as db_block_buffers and sort_area_%. If space is available, rollback segments can permanently exist in the database. These can be included in the initMNT.ora in order to bring them online for use during maintenance.
The maintenance instance is now ready for use. In order to use the maintenance instance, the following procedure should be followed.
1. Shutdown the production instance cleanly.
2. Ensure that all processes related to the production instance are cleaned up. This can take a few seconds to complete even after the instance has closed cleanly. If some of the processes do not die off naturally, manually kill off the remaining processes.
3. Ensure that the shared memory segment used to accommodate the SGA has been removed. If it has not, remove it manually.
4. Reset ORACLE_SID to point to your new instance definition (i.e., ORACLE_SID=MNT).
5. Start Oracle, and perform maintenance tasks.
6. When complete, shut down the instance and switch back to the production instance. It is vital to take good care that all traces of the maintenance instance have been cleaned up, in the same way as the production instance was cleaned up in step 2.
Using this technique, a totally different environment can be safely and easily set up for database maintenance. It is also worth noting, from a safety standpoint, that the production instance cannot be started while the maintenance instance has the database mounted, and vice versa, because Oracle takes out an exclusive lock on the datafiles.
5.5 Data Storage
5.5.1 Overview: The Physical Picture
It is important to have a full understanding of tablespaces (see Figure 5.2), segments, objects, extents, and so on. If you do not have a comprehensive understanding of these concepts, it is important that you read the Oracle Server Concepts Guide. This section deals exclusively with the different types of Oracle blocks, where they are found, and what they are used for.
5.5.2 Blocks: An Introduction
Every datafile1 is comprised of blocks, as illustrated in Figure 5.3. The size of the block is defined in the init.ora (db_block_size) when the database is created and cannot be changed thereafter. All I/O operations within Oracle are performed in some multiple of the block size. Blocks are used in order to simplify the I/O, storage, and caching algorithms used within Oracle.
The block size chosen for any particular implementation affects the fundamental operation of the database and therefore needs to be set to an optimal value at database creation. This is actually not as daunting as it sounds, because there is generally one range of sizes that are optimal for transaction-based systems and another for read-intensive systems such as data warehousing. The setting of the block size is a matter of balancing the read and caching advantages of a large block against the caching advantages of a small block when write activity is high. In this way, the determination of the correct block size is very much dependent on the operation of the database and will be covered later in this book.
Several different block types are used in Oracle.
First, there is a data block. The term "data block" is used to describe a block containing table data, index data, clustered data, and so on. Basically, it is a block that contains data.
Next, there are blocks that are associated with internal Oracle operation, such as header blocks and freelist group blocks. These blocks do not contain any user data but contain important structures that the implementor should be aware of when building a large system.
Data Blocks
The structure of a data block (see Figure 5.4) is the same whether the block belongs to a table or to an index.
Block Header
Each block contains a block header. This portion contains the block's identity in terms of
· Address of block
· Type of block (index, table, rollback, etc.)
· Interested Transaction List (ITL)
Although this piece is of variable size, it remains on the order of 100 bytes. This should be considered when assessing the efficiency of using small Oracle blocks: A 2,048-byte block would lose 5 percent in overhead, whereas a 4,096-byte block would lose only 2.5 percent.
An important part of the block header from a read consistency standpoint is the ITL portion. This is a variable list of "interested transactions" for that block, and the number of entries is determined by the settings of INITRANS and MAXTRANS. For each of these entries, the undo (rollback) block address, the transaction ID, and the system change number are recorded. This information is used by Oracle to facilitate the read consistency model, for which a small explanation is necessary.
When a query starts, the SCN is recorded in order to determine the start time of the query. This SCN is then compared against the SCN of the actual block when it is read in order to determine whether the version of the block is consistent with the query start time. If the SCN is greater than the expected SCN, a new version of the block (called a constant read [CR] block) is created in the buffer cache out of the applicable undo entries to get the prior version of the block. These undo entries are determined by reading the ITL, and so the ITL is the entry point for constructing the CR block. Note: The CR block is a cache entity only-it is never written to disk.
Once the relevant undo records have been applied, the SCN of the block is checked once again. If the SCN is still greater than the required SCN, the ITLs are reviewed again-because the entire block had undo applied to it, including the block header, the ITL entries in the new CR block also reflect the state of the block prior to the last change. Therefore, this process can be continued until the desired version of the block is created, or the required undo records are no longer available, due to forced reuse. At this point, the query session receives a "snapshot too old" message, because the read-consistent version of the block can not be constructed. 
Row Directory
The row directory is the secret of how Oracle finds a specific row within the block. It is what its name suggests: a directory of the rows contained in the block. It contains the offset in the block for each row it contains. Therefore, the block does not need to be scanned beyond the row directory before the index to the row data required is found.
Take, for example, a single-row lookup for which Oracle has already determined the rowid of the required row from the index on the table. Part of the rowid is the row number within the actual block, and therefore Oracle can read the row directory out of the block header and proceed directly to the actual row data.
It is worth noting that the row directory is not subject to any shrinking when rows are deleted. The space allocated for the row directory is never returned to the block but is reused on subsequent inserts. This does not present a problem in practice, because the rows that are subsequently inserted are typically of the same size as the deleted rows, therefore yielding the same number of rows per block. Anyway, we're not talking about a large amount of space for the row directory, unless you have tables with a very small row size.
Free Space
Free space is the currently unoccupied portion of the block that is used when new rows are inserted or when existing rows are extended.
Row Data
At last! This is where the actual data is stored for the block. The row data (or index data) is stored in length:value pairs for each column. That is, if a column for a particular row contains the value "YES," this is recorded as 3:'YES'. When a NULL column is stored, it is stored as a length zero, with no data associated with it. If these NULL columns all occur at the tail end of the column list, nothing is stored at all, because the start of the next row implies that the row had trailing NULL columns.
It is worth bearing in mind that when ordering the columns in a table it makes sense to put all the NULLable columns at the tail end of the column list in order to reduce storage and the amount of data Oracle has to read to determine that there are NULL columns.
Data Block Dump
The following block dump of a table block will help to demonstrate some of these attributes:
Starting at the top, we have the rdba, or relative data block address, of the block. The rdba is an internal rowid format for every block and is used by Oracle for every operation that addresses a specific block.
The SCN for the block is listed next, as described in "File Headers" below. The "tail" field is the last 2 bytes of the SCN, combined with the "type" and the "seq" for the block. This value is the mechanism used by Oracle to determine whether the entire block is consistent with itself after a recovery operation. The "tail" value is physically stored at the tail end of the block and should always match the SCN, seq, and type stored at the head of the block. If not, recovery needs to be performed on this block to bring the entire block into sync with itself.
Next, we have the block header information. The "Object id on Block?" piece is a throwback to version 6-it should always be "Y" now, indicating that the "seg/obj" value below it is the actual object number stored in SYS.OBJ$ (main constituent of the DBA_OBJECTS view).
The "csc" is the cleanout system change, or a short extract of the SCN used to show when block cleanout was last performed on this block. Block cleanout is covered in Section 5.5.3.
The "itc" value is simply a count of the ITLs on the block, which are listed farther down. The "flg" is either "-" or "O', where "-" means that the block is not on the freelist (i.e., more than PCTUSED used in the block), and "O" means that the block is on the freelist. The remaining fields denote which ITL's freelist the block appears on (if any).
Next are the actual ITLs themselves, containing the information described at the beginning of this section, required in order to construct CR blocks out of committed or uncommitted transactions that have modified the block. This represents the end of the transactional portion of the block header.
Then comes some information about the space usage in the block and the row directory, used to locate specific row offsets within the block. Included in the general information are the following:
· tsiz is the actual usable space in the block on which PCTUSED and PCTFREE are calculated.
· ntab is the number of tables (if this is a cluster).
· nrow is the number of actual rows in the block.
· fseo is the start of row data above header overhead.
The section beginning "0xe" is the start of the row directory, which shows once again that there are 61 rows in this block. This is immediately followed by the row directory entries, which are in the following format:
<directory_address>:pri[rownum-1] offs=<offset_address_into block>
One interesting thing to note here is that the rows are inserted into the block starting at the end of the block. In other words, the row data grows toward the row header, not away from it. This is reflected in Figure 5.4, where the free space is shown as being adjacent to the row directory.
The last thing in the block dump is the actual data, formatted as described under "Row Data" above. Above each row is a header showing the row number in the block and a summary of the content:
· tl is the total number (in bytes) of the row plus the header.
· fb is a set of flags where, in this example, we have a "H"ead of a row, and we have the "F"irst piece and "L"ast piece of the row: this row is not chained.
· cc is the column count for this row.
The actual data follows, one line per column. In this dump, the first two columns are SYSDATE and NUMBER(10), respectively, stored in Oracle internal format. The following four columns are VARCHAR2 columns, and their contents can be viewed by turning the hexadecimal values into ASCII characters.
Header Blocks
There are several different types of header blocks.2 Each of these types of header blocks has a very different purpose and will be covered in turn.
File Headers
The first block of a datafile contains the file header. Stored in this block is information such as the file number, the database that it belongs to, and the last change number for that file. The overhead of this block is important to remember when creating datafiles within Oracle using raw disk devices.
When sizing the raw disk partitions, it is important to remember to add the overhead of the file header block. It is common to find raw disk based Oracle databases with the file size within Oracle defined as 99M, 199M, 499M, and so on. This is because the raw disk partitions have been sized to exactly 100M, 200M, 500M, and so on, and no provision has been made for the file header.
In this situation, Oracle will fail to create a datafile of size 100M, because a 100M+1 block is required. It is very good practice to slice the disk in order to allow for the additional Oracle block. This makes the space management tasks for the database far easier because more human numbers are used in calculating free/used space.
The change number information is used by Oracle to determine the version number of the file in order to perform recovery on the file if necessary. During a hot backup,3 this change number is frozen, and a hot backup "fuzzy bit" is also set in the file header to indicate that the file is in the process of being backed up hot. The change number is frozen in order to determine where recovery should start.
In Figure 5.5, the file (or rather the tablespace containing the file) is put into hot backup mode using the ALTER TABLESPACE .... BEGIN BACKUP syntax. At this point in time, the system change number (SCN) is 23, and all the blocks in the file are guaranteed to be at the same version number as a result of the implied checkpoint when the file is put into backup mode.
During the backup, changes continue to be applied to the datafile, contrary to many beliefs. Therefore, the image that is recorded on tape will look a lot like the middle picture; many of the blocks have the original change number (i.e., have not changed), but a great deal of them have changed during the actual saving of the file to tape.
In fact, two of the blocks are denoted as having two version numbers. This is known as a "fractured block" and is the result of having an external reader operating on the file. When the UNIX kernel issues a read from the file (in 512-byte increments), the block is not guaranteed to remain the same until it has read the entire block, because this could entail many physical read operations. Oracle could be in the process of writing this particular block at the same time, resulting in the tail end of the block having a newer version than the older header portion of the block shows. The physical mechanism for this will be shown in "Segment Headers" below.
This is the key difference between this standard hot backup approach and the Oracle Recovery Manager approach. The Recovery Manager uses the Oracle buffer cache to copy the blocks out of the file onto tape, therefore guaranteeing that a consistent, nonfractured copy of each block is made on tape.
For the standard hot backup approach, Oracle has to make a change in the way it writes redo information to the redo log for the duration of the backup. The first time Oracle writes a particular block to the redo log, it records a before-image of the entire block in addition to the changes to the block. Although the full block is recorded only on the first change to that block during backup mode, this can result in a significant increase in the amount of redo information recorded during the backup, especially if batch processing is running at the same time.
The end of the backup shows all the blocks having the same version number, because a checkpoint is performed following the backup. This checkpoint is recorded in the redo log and marks the minimum recovery point for this backup.
When this file is recovered onto disk, it will look like the middle picture once again. However, Oracle is aware that this file was copied out in backup mode (due to the fuzzy bit being set) and starts to apply redo information to the file beginning with the SCN stored in the file header. This process ensures that all blocks in the file are updated to consistent versions and that all fractured blocks are repaired accordingly.
Segment Headers
The segment header is an important header for us to understand when building databases for maximum scalability. Contained in the segment header are
· Segment identification information, such as object number
· Extent information, such as a map of extents allocated (start address, size), the high-water mark of used extents (used for ALTER TABLE xxxx DEALLOCATE UNUSED)
· Freelist information
· Segment-specific information (such as the transaction table in a rollback segment)
Freelist information is stored in the segment header of each segment. What is a freelist? It's exactly what its name suggests: a list of free things. In the case of a segment freelist, it is a list of the free space in the segment, as determined by the PCTUSED setting for the segment. If a segment had data stored as shown in Figure 5.6, the freelist would contain entries for the last two blocks only, because both of the blocks have greater than PCTUSED of the block free.
A segment header can contain several freelists, each of which is protected (locked) individually during use. This allows the number of concurrent inserts in the table to be increased, because multiple freelists are available.
In the case of Oracle Parallel Server, this does not mean multiple inserts from different instances, because the buffer for the segment header cannot be concurrently held in exclusive mode on multiple nodes. For Parallel Server installations, the concept of freelist groups is used to provide this facility, which assigns additional blocks for freelists. This gives each instance preferential access to that block, thereby allowing concurrent access by multiple nodes.
Segment Header Block Dump
A block dump of a table segment header looks like this:
buffer tsn: 1 rdba: 0x00805002 (2/20482)
scn:0x0000.00054c5d seq:0x0a flg:0x00 tail:0x4c5d100a
frmt:0x02 chkval:0x0000 type:0x10=DATA SEGMENT HEADER - UNLIMITED
Extent Control Header
-----------------------------------------------------------------
Extent Header:: spare1: 0 space2: 0 #extents: 14 #blocks: 2139
last map 0x00000000 #maps: 0 offset: 2080
Highwater:: 0x00805684 ext#: 13 blk#: 236 ext size: 710
#blocks in seg. hdr's freelists: 28
#blocks below: 1665
mapblk 0x00000000 offset: 13
Unlocked
Map Header:: next 0x00000000 #extents: 14 obj#: 2807 flag: 0x40000000
Extent Map
-----------------------------------------------------------------
0x00805003 length: 4
0x00805007 length: 5
0x0080500c length: 10
0x00805016 length: 15
0x00805025 length: 20
0x00805039 length: 30
0x00805057 length: 45
0x00805084 length: 65
0x008050c5 length: 95
0x00805124 length: 140
0x008051b0 length: 210
0x00805282 length: 315
0x008053bd length: 475
0x00805598 length: 710
nfl = 1, nfb = 1 typ = 1 nxf = 1
SEG LST:: flg: UNUSED lhd: 0x00000000 ltl: 0x00000000
XCT LST:: flg: USED lhd: 0x00805652 ltl: 0x00805062 xid: 0x0001.028.00000193

The initial portion is identical to the dump of the table data block above. After that comes the extent information for the block. Here we see that this segment has 14 extents and 2,139 blocks, and the high-water mark (i.e., any blocks above this point have never been used to store data) is at 0x00805684. This is 236 blocks into the 13th extent, which has 710 blocks allocated to it. We can see that there are 28 blocks reported in the segment's freelist(s).
It's worth noting that the high-water mark is not the same as the free-list. When Oracle goes to allocate space for a new row, it will use one or the other of these pieces of information to find space for it according to the following algorithm:
· If space is available in the segment freelist (also called the process freelist), it will be allocated from there. Each process is restricted to one segment free-list that it can search, allocated by the algorithm illustrated in Figure 5.7.
· If not, the process will search the "master freelist" for the space. The master is simply one of the segment freelists that is nominated as the master. If space is found in this list, it is allocated to the process.
· If no space is found there, the "transaction freelists" are searched. These are freelists that are created when a process gives space back to the segment during a transaction. There can be many of these freelists per segment, and they can be searched by any process. If space is found, it is allocated to the process.
· If space is not found in any of the above freelists, the high-water mark must be increased. When it is increased, all space gained from the increase is allocated to the segment freelist. This is done to allow subsequent inserts to be satisfied by the first freelist search in this sequence.
· If the high-water mark is already at the top of the final extent, a new extent must be allocated.
The respective freelists can be seen at the tail end of this block dump, showing (nfl=1, nfb=1) that we have one freelist per freelist block (a.k.a freelist group) and one freelist block. We have a currently unused segment freelist and a used transaction freelist, which stores the 28 blocks that are in the segment header's freelists.
In the extent map for the segment, we see that all the extents are five blocks in size except the first one-because the first extent stores the segment header, and if additional freelist groups were assigned to the segment (at creation time), the usable size of the initial extent would be reduced accordingly.
When multiple freelists and freelist groups are created in a segment header, they are assigned to processes using a simple modulo hash. The freelist group used is determined by <thread#>/<number of freelist groups>, and the freelist within the freelist group is determined by <process id>/<number of freelists>. For example, if a segment has three freelists and two freelist groups, it is used as shown in Figure 5.7.
The "+1" addition in the example is to prevent the algorithm from mapping to freelist group 0, which does not exist.
5.5.3 Block Cleanout
Block cleanout is an optimization made by Oracle to speed up the commit process. When a transaction is committed, Oracle will only perform the minimum amount of work in the block required to guarantee the transaction. In addition to this work, there is an amount of housekeeping that needs to be performed in order to keep the database running smoothly. Oracle does not perform this housekeeping at the commit phase in order to keep the commits as fast as possible.
Block cleanout is the name given to these housekeeping tasks, and it is performed by the next session to read the block. The idea of this is that many sessions could be doing the housekeeping in parallel, therefore keeping the overhead for any single session to a minimum.
During block cleanout, the following duties are performed, where applicable to the prior transaction:
· The SCN in the block header is updated to reflect the contents of the ITLs.
· The ITL entry is cleaned, including free space credit given back to the block, and the SCN is updated to reflect the current state of the block.
· The freespace in the block is updated in the row directory.
· The row directory is updated to show the deleted rows (gaining a tag of sfll instead of offs).
· The row data is actually purged instead of being flagged as deleted.
Block cleanout is a frequently forgotten part of the operation of Oracle, and it can affect the performance of the system significantly.
For example, data purge is mostly performed as part of the batch cycle in the evenings. Even if the deletes are kept separate from any batch access to the affected tables, the subsequent batch jobs will still be very much slower than the evenings in which the deletes do not occur. In fact, after a mass delete, it could take many times longer for a read-only batch job that follows it to complete. One good procedure for at least keeping things predictable in this scenario is to always perform full table scans on tables that have had large numbers of deletes performed against them.
The other side effect of block cleanout is that you might notice a sudden increase in redo writes that does not relate to the execution of a batch job (i.e., it has already finished). This is the logging of the block updates necessary as part of the block cleanout operation.
5.6 Redo Logfiles
The redo logfiles are responsible for the integrity of the database. All changes in the database, both committed and uncommitted, are written to the redo log.
Unlike writes to the data files4 in the database, the redo log does not operate through the buffer cache in the SGA. Instead, the redo log is accessed through the redo log buffer, which will be covered in the next chapter.
The redo logs are used to record all changes in the database by recording the changes made to the actual blocks. This includes changes made to rollback segments, and so the complete state of the database can be rebuilt by reapplying the redo information to a recovered database. When a commit is issued, all the session's redo records are written out to the redo log by the Log Writer (LGWR) process, in addition to a commit marker for that transaction. If several other sessions are to commit while this write activity takes place, all of the commits are serviced concurrently by the LGWR, in a process called group commit. In this way, the rate of writes to the redo log is not proportional to the rate of commits, and the system can scale more efficiently.
During a database recovery operation, the database is rolled forward to the specified point. During this rollforward from the redo log, all the rollback segments are rebuilt and any committed transactions are recommitted back to the database. Think of the rollforward as a more efficient replay of the work recorded in the redo logs. Once the recovery is complete, any uncommitted transactions left in the rollback segments are rolled back, in order to return the database to a consistent state.
Owing to the serial nature of the redo logs, it is important that the logs be optimized for write performance, because sessions issuing a commit will wait until the records are all synchronously (i.e., we wait for the operation to complete) written to disk. Therefore, locating the redo logs on a UNIX file system is almost always a bad thing to do unless special support is provided by the operating system for direct writes, because of the additional overhead of the file system buffer cache, where it provides no benefit to the write-only operation.
Often, the best way to optimize redo log placement is to build the logs on raw disk partitions and place the logs on their own disk, preferably with their own controller (although this gets less important when Ultra SCSI or fibre channel is used). In this way, the writing to disk is not interrupted by an external influence, such as another file operation or file system cache contention.
When the redo logs are placed on their own disks, the physical drive spindle is much more capable of sustaining a high write throughput, because very little (if any) seeking is required by the disk head. The optimal way to lay out redo logs on a large system is shown in Figure 5.8.
Figure 5.8 is more of a schematic than an I/O channel configuration. What it shows is that, at any one time, the LGWR process has a dedicated write channel through to the redo log. When Oracle switches the active redo log to the next defined log, it implicitly switches to another disk spindle, allowing the archiver process a free channel on which to archive the full log to the archive destination.
5.7 Key Database Objects
5.7.1 Rollback Segments
Rollback segments are also known as undo segments because of their role in database operation. During a transaction in the database, data manipulation language (DML) statements modify the actual data blocks that are affected by the operation. In order to have the ability to undo these operations, the before-image of the changed data is stored in the rollback segment. When the user issues a rollback command, or some kind of failure implies a rollback, this rollback information is applied back onto the data blocks, restoring them to their state before the transaction started.
In addition to undoing uncommitted transactions, the rollback segments are also used to construct a read-consistent view of the data blocks for all other users who access the blocks prior to the commit of the transaction.
The Oracle consistency model is covered in detail in the next chapter, because it uses many different attributes of the Oracle system, not the least of which is the buffer cache layer. In this chapter, we will concentrate on the physical types of rollback segments and their usage.
User Rollback Segments
User rollback segments are used to store the undo information for user transactions-that is, for nonrecursive SYS transactions. All rollback segments except the SYSTEM rollback segment are considered user rollback segments.
Full information on the physical usage of rollback segments can be determined from the Oracle Server Concepts Guide and won't be repeated here. However, some of the pertinent points will be restated.
If a long-running query is running against an object in the database, and other sessions are writing to that object at the same time, none of the undo information for the writing sessions can be released until the long-running query is complete. This occurs because the undo information for the writing sessions is required to reconstruct a read-consistent version of the modified blocks on behalf of the reading session. If the rollback segment needs to reuse these blocks that were retained for consistent read, it will do so and cause the reading session to fail with a "snapshot too old" error. The creation of consistent blocks will be covered in the next chapter.
A rollback segment's extents are reused in a circular fashion. If the tablespace that contains the rollback segment is full, or if maxextents is exceeded for the rollback segment, this affects how a rollback segment is deemed to be "full." As an extreme example, take the scenario where the first and last extents of the rollback segment contain active transactions. When a new transaction requires space in this rollback segment, it will attempt to reuse the first extent and will fail because active transactions reside in that block. This makes the rollback segment effectively "full" despite all the extents between the first and last remaining empty.
A common practice among DBAs is to use the OPTIMAL setting for rollback segments. This allows the segment to be automatically kept close to the specified size by Oracle. It does this by deallocating the oldest inactive extents in the segment until the segment is close to the requested size. This is a popular option, because it allows runaway jobs to operate in a common "pool" of rollback segments without leaving one or more of the segments dominating the space in the tablespace.
However, for a large-scale implementation, this is probably not an "optimal" approach, because it is working around a more fundamental problem. The problem is: What is the runaway job in the online pool of rollback segments? If it is a batch job, then it should use the SET TRANSACTION USE ROLLBACK SEGMENT call in order to assign it to some large, prebuilt rollback segments. If it is an online session, it probably shouldn't use so much rollback.
The online user session can only be using a lot of rollback for one of two reasons: either it is making a lot of changes in the database, or it is holding a cursor open for a long time, requiring a great deal of consistent read support from the rollback segment. In the first case, if the changes are UPDATE or DELETE, the session is committing a worse sin than growing a rollback segment, in that it is locking many, many records in the database and probably causing TX enqueue contention in the database. In the second case, it would be unusual to find a requirement that called for this kind of long-lasting, read-consistent view in an online session.
The other reason to avoid the OPTIMAL setting is that it can severely limit the capability to retain consistent reads, even for short periods, because extents that are maintained only for consistent reads are eligible for deallocation in order to shrink the segment back to its optimal size.
System Rollback Segment
The SYSTEM rollback segment resides in the SYSTEM tablespace. It is used in the same way as user rollback segments are used, but only for recursive transactions in the database, otherwise known as data definition language (DDL) statements. If no other rollback segments are created, this segment is also used for user transactions.
Deferred Rollback Segments
Deferred rollback segments are temporary rollback segments that are created in the SYSTEM tablespace automatically as needed. They are used to store undo information that cannot be applied because a tablespace is offline at the time of the rollback operation.
5.7.2 Read-Only Tablespaces
When a tablespace is defined READ ONLY, this tells Oracle that no writes can occur to the tablespace. If no writes can occur, there is no need to update the SCN for the file either, because nothing can change.
This tablespace declaration is most useful in data warehousing applications, where tablespaces can be marked READ ONLY once they are loaded. As the former restriction of a table residing in only one tablespace is now removed (when using partitioning), this becomes a very useable way of controlling the backup requirements for very large data warehouses.
Once a particular partition is loaded and is residing in its own tablespace, this entire tablespace can be made read only. After this is done, the files that make up the tablespace can be backed up to tape, possibly multiple times to protect against bad tapes, and then can be stored forever at an offsite location. As the files never change (unless they are taken out of READ ONLY mode), this backup is all that is required to effect a full restore of the tablespace.
5.7.3 Temporary Tablespaces and Temporary Segments
These two entities are closely related and overlap in function. A temporary segment can exist in any tablespace, but a tablespace of type TEMPORARY can store only temporary segments. Specifically, a TEMPORARY tablespace can contain only sort segments.
In release 7.3, Oracle introduced a tablespace of type TEMPORARY, and also introduced the principle of SORT_DIRECT_WRITES5 for all writes to temporary segments.
All users in Oracle have a tablespace assigned to them as their temporary tablespace. This tells Oracle where it should create sort segments when it needs to sort on disk. Complex join operations, GROUP BY operations, ORDER BY operations, and so on, can all result in a disk sort. In addition to temporary segments needed for the sort, temporary segments are created in the target tablespace of a CREATE INDEX operation, used for the actual construction of the index. Both of these types of temporary segments benefit from the direct write capability.
The direct writes functionality instructs Oracle not to use the buffer cache when writing sort segments-they can be written directly to the datafile by the active session.6 In addition to this, Oracle does not generate any redo information for the temporary blocks written. These two factors combined allow far greater performance for large queries than was possible before.
In addition to this change, the ability to create a tablespace of type TEMPORARY was introduced. This is a special type of tablespace solely for use in disk sorts and cannot be used to store permanent objects. In addition to gaining the direct write capability for all writes to this tablespace, the concept of a sort segment was introduced.
Sort segments were implemented to relieve the pressure exerted on the space transaction (ST) enqueue in the database. The ST enqueue is the lock that needs to be held before any space management can be done in recursive SQL. Whenever any new extents are created in or removed from the database, the ST lock must be acquired; if it is already in use, you must wait for it to become available. In a system where a great deal of this activity takes place, such as a busy system with a several disk sorts occurring at any one time, there can be fierce contention for the ST lock. This is massively compounded in a Parallel Server environment, where there is still only a single ST lock for all instances that are running.
The introduction of sort segment practically removes this type of ST contention. When a sort segment is first created, it requires the use of the ST lock, just as before. However, instead of acquiring the lock again to remove the segments once it becomes free, the segment is handed over to a pool of sort segments managed by instance-local latches instead of a database-wide enqueue. The next session that requires a sort segment can check the pool and reuse any free segments that the instance has allocated to it.
The net effect is that, once the instance has built enough sort segments using the ST lock, their management is then handled by the local sort segment management, leaving ST free for its other tasks.
5.7.4 Tables
The concept of a table is no longer as simple as it once was. In addition to the standard table and clusters that have existed since version 6, we now have partitioned tables, temporary tables, and index-organized tables. The index-organized table is covered as an index, because it resembles an index more closely than it does a table.
Standard Tables
A standard table is still the most common table in most Oracle databases. The structure of the standard table was covered under "Row Data" in Section 5.5.2.
Partitioned Tables
Partitioning is one of the big changes in Oracle8. This facility allows the administrator to break up large tables into several smaller partitions while accessing the table in the normal manner (see Figure 5.9).
The table is split into partitions based on key ranges on one or more columns, either by direct value or by hashing7 the value to gain the partition number. The choice of how the partitioning is performed is up to the administrator. In this example, the "bookings" table is broken up into four partitions by the first letter of the booker's name. Each of the partitions is stored in its own tablespace, although they could all be stored in the same tablespace if desired.
A good way to view partitions is as "subtables" of the global table. Oracle handles all the joining of the partitions to form a single view of the data, but also takes care of partition elimination (or pruning) using the cost-based optimizer. In partition elimination, the optimizer excludes certain partitions from the execution plan of a given piece of SQL based on the predicates in the query.
Partitioning brings three important features to the table:
1. Manageability
2. Performance
3. Scalability
The manageability comes from being able to perform maintenance at a partition level without affecting the other table partitions. Using partitions, we can lock an entire partition (for data maintenance, or to do local index builds8) while users continue to execute against the remaining partitions. Likewise, if a file goes offline as a result of hardware problems, only the partition that is within the file is affected; the rest of the table is accessible and does not require recovery.
We can also remove certain partitions from the table, either by turning these partitions into tables in their own right or by dropping them. If a table is partitioned by a range that also works well for the purge rules, the delete process can be as simple as dropping the oldest partition when its data becomes older than the required retention period.
In addition to purging of data, partitions allow far greater flexibility performing data loads. Data that needs to be loaded into the large partitioned table can first be loaded into an interim table that has the same definition. When all the data is loaded, the table can be merged into the partitioned table as an additional partition, making the data load comparatively transparent to the end user.
Performance gains in partitioned tables come from the fact that the optimizer can now physically exclude data from the search path. This is true for all dimensions: We now effectively have multiple tables where we previously had only one. Therefore, all operations on the table can be performed much faster, provided that the access paths are evenly distributed across the partitions.
In the case of INSERTs, we now have several segment headers from which to retrieve freelist information, and so the contention on the segment header is reduced accordingly. In the case of DELETEs and UPDATEs, a new facility is available in partitioned tables called parallel DML. Using parallel DML, these two operations can be executed in parallel from a single session, as long as the operations fall in different partitions.
The scalability advantages come from two dimensions that span the manageability and performance categories. As a table gets past a certain size,9 it simply becomes unmanageable in terms of indexing, backups-everything. A system cannot scale if the objects within it become unmanageable. With partitioning, a table can be processed in pieces, doing only what is required for one partition at a time.10
In a data warehousing environment, it makes sense to load the data by date range if possible. In this way, the data can be loaded into different partitions when the current one gets unmanageable to backup (this could mean a day, a week, a month, or any time period). When the loads start to go into a different partition, the previous one can be made read-only and can be archived off to tape as a one-time exercise. When local indexes are used (see Figure 5.15), the index building can be restricted to the comparatively small (relative to the whole table) partition.
From the performance perspective, we will cover less obvious reasons why partitions scale well later in the book. The more obvious reasons are the use of partition elimination in queries, which reduces the amount of unnecessary work that the database performs. Partition elimination is the process that the cost based optimizer (CBO) uses to determine whether or not certain partitions need to be read. Based on the values submitted for predicates using partition columns, the optimizer can elect to omit irrelevant partitions from the execution plan.
Hash Clusters
Hash clusters have been in the Oracle product for quite some time. Despite this, they remain relatively unused in the field, despite their performance and scalability advantages. The reason for this is that they also come with several limitations that make them unsuitable for most purposes.
The theory behind a hash cluster is that a column, or several columns, can be passed through a hash algorithm so as to produce the approximate location of the actual data in the database.
In the example shown in Figure 5.10, a user submits a query to the database with a single set of criteria: ID = 346782, Name = Jones, Flight = ZZ437. Oracle takes these values and performs a hash operation on them, deriving a hash value of 8465874852. Neither the user nor the administrator needs to know what this value is.
The content of the hash cluster is physically stored on disk according to the same hashing algorithm. Therefore, if the data exists in the cluster, it must exist under the same hash bucket that has just been calculated for the user's query. In this example, there are three rows that reside in this hash bucket, and so Oracle will check each one for equality against the supplied predicate and then return the data.
It is important to remember that a hash cluster will provide good performance only when
· There are a small number of records for a given hash value
· The data is accessed mostly through direct lookup as opposed to any type of scan
Keeping the number of records low for a given hash value is achieved by setting the HASHKEYS parameter appropriately on hash cluster creation. You will need to know a good deal about the current and future row counts for this table in order to set this parameter adequately, because the correct setting is likely to be different in six months as a result of table growth. Keeping the row count per hash low allows Oracle to go straight to the correct row based on the hash value, with no subsequent scanning of peer rows in that hash bucket.
Full table scans are very inefficient in hash clusters because of the relatively sparse block population resulting from a well-configured hash cluster. Likewise, range scans across the cluster key are not very efficient because of the number of blocks required to satisfy the request.
In summary, hash clusters work well where the data is relatively static and where the cluster is configured correctly for the number of hash keys. They scale well as a result of the small number of block visits (probably only one) needed to retrieve the single row requested. This eliminates the root and branch block contention encountered on very busy B-tree indexes. Hash clusters are ideal for very large reference tables, for all of these reasons.
5.7.5 Indexes
Gone are the days when there was only one type of index. We now have as many as five variations on the indexing theme. Although the old B-tree index continues to suit most purposes well, the performance requirement on indexing has increased over time, and new data access methods have been implemented to improve the capability of the database.
Structure of B-Tree
A B-tree index in its purest form is a binary tree structure that maps the path to the required leaf block by making yes/no answers. Figure 5.11 illustrates the structure of an Oracle B-tree.
Unless the index is very small (only one block required for all keys), it will have at least a root block and probably many branch blocks. There is no difference between a root block and a branch block except that the root block is always the entry point for the index and is therefore a kind of "super branch block."
When Oracle needs to find a row, and the query contains predicate information to allow Oracle to use an index, it will start by going to the root block of the appropriate index. The root block contains the information required to get to either the leaf block itself (in the case of small indexes) or the next branch block. The next branch block homes in further on the actual leaf block, based on the supplied data values.
The leaf block contains all the data values for the indexed columns and the rowid for the corresponding row in the table. If the query can be satisfied using only the columns in the index, the rowid is not used any further. However, if additional columns are required from the table, the rowid is used to access the row directly, without any scanning.
If the query contains a range of possible values for the key, Oracle can elect to perform a "range scan." Oracle will start at one end of the supplied range and use the links between the leaf blocks in order to go directly to the next applicable index block, without going back to the root and branch blocks.
Oracle maintains very shallow B-trees. This means that the number of root/branch blocks read before the actual leaf block is hit is very low-typically two (one root, one branch), with a maximum of four levels. This directly correlates to the size of the Oracle block, because this is the ultimate limit on how many pointers to the next level can be stored.
When an index block is filled, Oracle allocates another block from the freelist, as described previously. However, owing to the B-tree structure, Oracle normally cannot start to use the block immediately, because the inserted key value is likely to fall somewhere between existing keys (e.g., the inserted value is 1,000, and the leaf block contains a range from 500 to 4,000). In this case, Oracle performs an index leaf block split, which means that the upper half of the keys are transferred to the new block while the lower half remain in the old block. This way, the key can be applied to the correct leaf block in the B-tree. When the split is complete, the branch block above the blocks must be updated to reflect this new block's keys and location, in addition to updating the key range for the old block.
Reverse Key Indexes
Reverse key indexes are primarily an Oracle Parallel Server optimization. They are exactly what their name implies: The key is reversed before being inserted into the index. This allows a far more scalable spread of index keys when an ascending key is used for the leading edge of the index (see Figure 5.12).
This spread of keys is critical in a Parallel Server configuration where many instances can be inserting rows into a common, indexed table. One example of this would be invoice records, where each instance is creating invoices and by implication requiring insertion into the exact same leaf block. This can cause huge cross-instance contention, because the leaf blocks must be pinged between the nodes each time.
For insertion into a standard table, this is not a problem, because free-list groups allow each instance to maintain its own set of freespace for the inserts. This is particularly true when entire extents are explicitly allocated to certain freelist groups. In the case of a standard B-tree, this does not apply, because Oracle has to insert keys into specific blocks in order to preserve the structure of the B-tree.
When a reverse key index is used, the least significant portion of the key suddenly becomes the most significant, and the index keys "stripe" across the entire range of leaf blocks, reducing the likelihood of a hot leak block.
This benefit does not come without restrictions, however. As a result of the key now being reversed, a range scan is no longer an option, because Oracle may need to scan the entire set of index leaf blocks hundreds of times, or full scan the whole index, before the complete range is satisfied. Therefore, reverse key indexes are useful only where single-row fetches are used to retrieve the inserted rows. If this is not possible because of application design, alternative methods for reducing index leaf block contention must be considered in a Parallel Server environment.
Bitmapped Indexes
Bitmapped indexes were introduced in release 7.3 and were the first variation on the trusty B-tree index. Primarily designed for decision support queries rather than transaction processing, bitmapped indexes provide both excellent performance and comparatively small size.
They are referred to as bitmapped indexes because they provide Oracle with simple true/false answers as to whether the keys match certain criteria. In physical terms, the equivalents of branch blocks now contain the possible combinations of key values in the indexed column, equivalent to performing a SELECT DISTINCT column_name on the table.
The leaf blocks contain bitmaps for a range of rows in the table, declaring which combinations the row matches (i.e., TRUE) and which it does not (i.e., FALSE). These bitmaps are very small, and therefore the entire bitmapped index can be scanned very quickly.
For example, if we had a table called STUFF with a column called TYPE that contained the values GOOD, BAD, and INDIFFERENT, the result would be entries in the index leaf block of the form
col 0: GOOD
col 1: start dba
col 2: end dba
col 3: <bitmap>

This example shows the key to which this bitmap relates, followed by a range of dbas to which this bitmap corresponds. The bitmap consists of an array of bits set to 1 if the row it relates to has the value "GOOD," and to 0 if it does not. Every row is reported here, including NULL columns, which never "equal" anything. This makes a bitmapped index an acceptable method for counting all the rows in a table or where something IS [NOT] NULL.
Bitmaps are particularly useful in conjunction with other bitmaps, including themselves. To continue with the example, if a query were written as
SELECT count(*) FROM stuff WHERE TYPE in (`GOOD','INDIFFERENT');

there would be a bitmap "merge" of different portions of the same bitmapped index on STUFF. The "GOOD" and "INDIFFERENT" bitmaps would be ORed together to produce another bitmap that could be used to satisfy the query by counting all the set bits in it. This technique applies equally to different bitmaps, on different columns.
Bitmapped indexes are useful only if there are a small number of distinct values for the indexed column. If this is not the case, the overhead of building a bitmap for every combination can quickly make the bitmapped index unwieldy, and the advantages become liabilities.
Updates to a table covered by a bitmapped index are deferred until the entire DML operation is complete. This means that if one row or one million rows are changed, the bitmap will not be updated until the end of the operation. This does not affect readers, because the bitmap will always be updated before any other session needs it, because the update always completes before the updating session issues a commit.
Locking of the bitmap index is done on a dba range basis in order to maintain the whole bitmap portions. This is logical, considering the way that Oracle stores the bitmap fragments in the leaf blocks; managing multiple updates within a single bitmap would be very difficult. Despite being logical, however, this makes bitmaps tricky to use in a transactional environment, because the concurrency of updates is directly affected.
Partitioned Indexes
Just as tables can now be partitioned, so can indexes. However, in the case of indexes, there are more considerations to be made.
Indexes now have terminology associated with them that declares the scope of table coverage for the index. An index can be either global or local under this terminology. A global index covers all partitions of the underlying table, whereas a local index covers only the table partition to which it is assigned. A global index can be partitioned or nonpartitioned, and a local index is partitioned by implication that it covers a partition of a table.
When a global index is nonpartitioned, it is exactly the same as a standard B-tree index. It does not have to be built on a partitioned table.
The index illustrated in Figure 5.13 can be partitioned only in one of the two possible ways-that of a prefixed index and a nonprefixed index. A prefixed global index is partitioned on the leading edge column within the index. That is, if the index in Figure 5.13 were built on book_ref and book_time, the prefixed index would be split into several partitions based on ranges of book_ref or of book_ref and book_time combined. It could not be partitioned on book_time, because this would be a global nonprefixed index, which is not provided for by Oracle. The result is shown in Figure 5.14.
Global prefixed indexes are not as preferable as local prefixed indexes. However, sometimes it is not possible (or practical) to build local prefixed indexes, because the required indexes for any given table will often lead on columns other than the table partition key (as required for local prefixed indexes). Also, there are situations in which a partitioned index may be preferable for contention management reasons and partitioning of the table is not required at all.
The specific situation in which this may be desired is the case of heavy index root block contention. This kind of contention is possible for many reasons and can even occur when the index is simply being read a great deal. In this case, the object of contention is the latch protecting the piece of cache in which the root block is stored. In this situation, there is not much else to do (apart from checking for strange data pathologies) than to partition the index to form multiple root blocks. If the partition ranges are carefully chosen (by access rate, rather than by strict storage ranges), this kind of partitioning can result in multiple blocks in the cache, each protected by its own latch (assuming that the cache is well tuned-see Chapter 6).
Local indexes can be built only on partitioned tables, but can be either prefixed or nonprefixed. When prefixed, the index is partitioned on the same column and range as the table partition, shown in Figure 5.15.
With local prefixed indexes, each index partition directly maps to one corresponding table partition. The index can include additional columns but must lead on the same columns as the table partition.
With local nonprefixed indexes, the index columns do not have the same leading columns as the table partition. However, the index partitions contain only data related to the table partitions they cover: Leaf blocks in a local nonprefixed index point only to rows in the related table partition, not to any other partitions.
For example, assume that the bookings table were partitioned on book_name range, as shown above. A prefixed index on this partitioned table would be built using the same ranges of book_name. A local nonprefixed index could be built on book_time, and each local index partition would contain a mini B-tree of book_time keys, built locally on each table partition of book_name.
When one is determining which type of index to build, the following performance consideration must be taken into account. Prefixed indexes allow Oracle to determine which index partitions will be needed at execution time. When the index is nonprefixed, Oracle cannot know which partitions can be excluded from the search. Therefore, all index partitions must be examined at runtime in order to determine the locations of all the data, unless they can be eliminated on the basis of other predicates in the query.
Index-Organized Tables (IOTs)
Index-organized tables are also known as index-only tables. The best way to visualize an IOT is to view it as a normal B-tree index in which, instead of a rowid partnering the key columns, all the columns in the table are included in the actual leaf block (see Figure 5.16).
With all the columns included in the "index," the storage of an actual table is not necessary, and so the B-tree itself becomes an IOT, accompanied by a few changes in the way the table is accessed.
The major change is that an index does not have a rowid-only tables have rowids. The new unique row identifier is now the primary key on which the table is organized. In order to support the creation of additional indexes on the IOT, a new logical rowid is created in the new index in an attempt to allow direct access to the row without traversing the branch blocks of the actual IOT when a different index is used to access the table.
Unfortunately, the physical location of a given row in an IOT is not fixed, owing to the index block split phenomenon required to maintain the structure of the B-tree when inserting rows. For this reason, Oracle refers to the physical portion of the logical rowid as a "physical guess." This guess is stored in the logical rowid, along with the primary key of the row. When accessing the table through a secondary index, Oracle probes the guess point first to see if the row is still present. If it is not, Oracle must traverse the IOT, starting at the root block, in order to locate the row. In this instance, the physical guess is considered stale.
The only way that stale guesses can be rebuilt is by a physical rebuild of the secondary index. This includes a full scan of the IOT, because the physical locations cannot be determined in any other way.
Full table scans are still possible on IOTs, with the interesting side effect that the rows now come out in primary key order, regardless of the order in which they were inserted into the table. This occurs because the full table scan equivalent for an IOT is an index range scan.
Prior to Oracle 8.1, IOTs were not very usable, because several capabilities of a standard "heap" (normal) table were absent from IOTs. Included in this list were support for additional indexes, support for large object (LOB) columns, and the ability to partition the table. These restrictions have been removed in Oracle 8.1.
More importantly, Oracle has made a significant change in database operation that makes IOTs more attractive for widespread use.
Over time, tables become fragmented within the rows themselves, because rows are deleted from the table over time and updating of rows can result in row chaining. The only way to reorganize these tables, and to compact the data into the blocks, is to rebuild the table physically. In previous releases, these tables could not be rebuilt online, because a full table lock was the only way to ensure that the table was not updated while the rebuild was underway.
In release 8.1, this is still true. However, in this release, Oracle allows indexes to be rebuilt online, while DML continues against the underlying table. Because this also applies to IOTs, it therefore allows for full database reorganization online, including tables (IOTs only) and indexes.
5.7.6 Other Database Objects
Some objects do not fit under any of the prior headings, but are still worth reviewing in terms of their impact on the database.
Sequences
Sequences, or sequence number generators, are automatic number generators that are used mostly for the provision of generated primary keys for tables where this is preferred. These numbers can be cached within the SGA to improve the throughput for requests made against them. The implementation of the cache is simply the storage of a single number in the SGA, which can be incremented in memory without touching the on-disk representation.
The disk value is updated each time the cached "bundle" of numbers is depleted, and the value reflects the next value after the cache is all used up. This is a benefit for the DBA but a potential problem for application developers. Table 5.3 shows how caching relates to disk values in a two-node Parallel Server configuration.
This table clearly shows that the two instances can maintain very distinct ranges of numbers from each other through the use of caching.
The downside of this is that if instance 1 is shut down in the third step in Table 5.3, all 955 numbers in its cache will be "lost," because Oracle cannot write back the lower number after the other instance allocates its higher cache range. This will always result in lost sequence numbers, much to the frustration of application developers who were relying on sequentiality of numbering in order to provide, for example, invoice numbers.
Other ways in which numbers can get lost (in both non-OPS and OPS configurations) are (a) abnormal instance termination (crash) and (b) flushing of the shared pool when sequences are not pinned using the DBMS_SHARED_POOL.KEEP package.
The moral of this is that developers should be clearly told that sequence generators will guarantee only uniqueness. They do not, and will not, guarantee sequentiality.
The final implication of the cache model is that if the ORDER flag is specified on the sequence generator, this will implicitly defeat all caching and will go to disk every time the NEXTVAL is gotten. In frequently accessed sequences, this will result in contention in the SYSTEM tablespace and a large increase in recursive SQL in order to retrieve the next value from disk.
Packages, Procedures, and Functions
Packages are bundles of stored procedures and/or functions. They are provided to enable server-side processing using PL/SQL, aimed primarily at reducing the number of round trips to the application client over SQL*Net and improving code reuse across modules. Packages can become very large in complex applications and can impose large processing burdens on the database server.
As the database server is fundamentally the least scalable point in the network, it makes sense to review carefully the use of packages in the database and to have them perform realistic duties rather than being just convenient places to store large program modules. For example, use of packages as conduits to database server-based batch processing should be avoided.
Since release 7.3, packages have been able to use noncontiguous fragments of free memory in the shared pool when loading for the first time. This change made a huge difference in the impact of loading a large package into a fragmented shared pool. It is still very advisable, however, always to pin database packages into the shared pool at start-up, using the DBMS_SHARED_POOL.KEEP package. The method required for doing this is to reference the package once (for example, call a small function in the package) and then issue the KEEP package against it. The list of packages to keep should include the SYSTEM packages as well as the application packages.
A package is always executed with the identity and authorization levels of the owner of the actual package, not those of the user who executes it. The execution of the actual SQL within the package therefore shows up as recursive SQL calls, not user calls. Likewise, all statistics logged by the integral package contents are rolled up and also logged against the call to execute the package, in addition to the actual statements.
It's worth mentioning here that functions are very often abused by application programmers. It is commonly believed that a function is a very fast method of converting data and that this can be done on the server prior to sending the data back. However, it is impossible for a user-written function to perform anywhere near as fast as the built-in functions such as TRUNC(), because of both the language that the routines are written in and the code path required to get there.
Built-in functions are written in C and compiled into native machine instructions by the machine's optimizing compiler. They are called by branching to the function's memory location when required by the SQL engine. Conversely, a user-written function is written in PL/SQL an interpreted high-level language. It requires many orders of magnitude more machine clock cycles to process a given line of PL/SQL than to process a similar line of C. To execute it, Oracle must leap up through several functions, take out several latches, call the PL/SQL interpreter, and then go through the code. There is obviously no comparison between the impacts of these executions.
Try to limit the use of functions to the areas where no other solution is possible. When they are used in the select list, limit the row count to the number of rows that are actually required. This can be done using simple business logic (does the user really need a list of 3,000 reservations in a scrollable box?), or by some kind of arbitrary rownum restriction on the tail end of the query.
Triggers
Triggers provide a way of executing PL/SQL based on certain events occurring in the database. They can be fired BEFORE, AFTER, or INSTEAD OF the actual event that fired the trigger, and can operate on DML, DDL, and database events.
For example, a trigger can be set up to fire AFTER an UPDATE to a certain table. This is an application developer's favorite way of implementing audit requirements, because it requires no special considerations to take place on the part of the application. However, this is also an expensive way to implement what is already an expensive operation.
When the trigger is fired, it goes through the same kind of code path as demonstrated above in function execution. Typically, the code that processes updates will compare all the old values with the new values to determine which columns have actually been updated. This too is a very expensive operation, when the application already has this information.
Using triggers in this way is an attempt to gain some of the advantages of a multitier application without the work. Unfortunately, the work still has to be done, and it has to be done from scratch each time by the database server, with no context to start with.
The new INSTEAD OF trigger type is useful in allowing updatable views, because the trigger is fired instead of actually attempting the update or insert. Once again, however, be aware that this convenience is not without significant performance degradation when compared with accessing the underlying tables directly.
Perhaps one of the best uses for triggers is that of the database event triggers. These can be fired whenever there is a server error, log-on, log-off, or start-up and shut-down. Using the start-up trigger, for example, a package could be executed to take care of post-start-up tasks such as pinning packages.
Security Roles
Database roles are a facility in Oracle that allows transparent, role-based privileges based on login ID. These roles are designed to group together certain privileges in the database and to be able to assign them to users as single entities.
For example, a role could be created to allow users to connect to the database, to execute the ALTER SESSION command, and to access a set of specific tables in the database. This role is then assigned to a user, and that user can then perform any of the allowed functions.
A user can have several roles assigned, and can switch between roles based on location in the application and password authentication.
Although the use of roles is very flexible, allowing simple management of privileges within the database, it currently does not scale very well. In a very large system, where several thousand users are defined, in addition to more than 100 defined roles, the overhead of queries on the %_ENABLED_ROLES views, for example, can be very unwieldy. One example of this is on a large Oracle8 system, where a single user could generate 8 million buffer gets from a single roles query.
It does not take very complex use of the roles facility for this kind of burden to become evident. For example, Form 4.5 (in Developer/2000 1.X) uses database roles in order to implement menu security. This is not a very sophisticated use of the roles, but was the reality behind the 8 million buffer get example above. In this case, the problem was removed by
· Moving all the data from the database roles configuration to user tables
· Creating correct indexes on the new user tables
· Dropping the FRM45_ENABLED_ROLES view and recreating it so that it referenced the new user tables
· Changing all code that issued CREATE ROLE, ALTER ROLE, and so on, simply to INSERT, UPDATE, and DELETE the user table.
This made all accesses to this data go from 200 to 8,000,000 buffer gets to about 4, with subsequent improvements in both response time and system utilization on the server.
There can be an additional side effect that compounds the performance problems of using database roles. If the dictionary cache is squeezed to a small size in the shared pool, owing to an unruly set of SQL statements in the library cache, many of the calls to check database roles do not get hits in the dictionary cache. The impact of this is that many of the 8 million buffer gets end up going to disk, the recursive calls rate goes through the roof, and the system performance goes down.
5.8 The Data Dictionary
The data dictionary is a widely misunderstood entity in the database. It is comprised of two components:
1. The objects in the SYS schema
2. The dictionary cache (a.k.a. the rowcache)
It's true, the dictionary cache is the same thing as the rowcache. It is known as the rowcache because the unit of management is a row out of the SYS tables, rather than a block unit in the buffer cache for user tables. The dictionary cache itself will be covered in Section 6.8. For now, we will concentrate on the physical elements of the dictionary cache.
The data dictionary is physically stored in the database as objects within the SYS schema. For all intents and purposes, these tables are just the same as normal user tables in any other tablespace. However, the consistency model is completely different for these tables, and they should therefore never be written to directly by the user, even if the user has a good understanding of their usage.
Table 5.4 shows some of the more interesting tables in the SYS schema, along with their parent clusters. Many of the tables in the SYS schema are mostly joined with each other and therefore merit the use of a cluster for their storage.
The main time to be aware of these tables is when looking at the definitions of the DBA_% views and other SYS views. These views will break out into views of these underlying tables, and it is useful to be able to interpret the view definition, potentially to write a more efficient version of the view for specific reasons.
5.9 Chapter Summary
In this chapter, we have attempted to cover Oracle from a physical standpoint, without venturing into the operating architecture of Oracle. These two areas are finely interwoven, and some of the concepts in this chapter or the next may not make much sense in isolation from each other. For this reason, Part III should be read as a single unit.
Oracle is a complex product and has many more physical attributes than those covered in this chapter. It is for this reason that it is imperative to be well versed in all areas of physical Oracle, at least from a "Server Concepts" perspective, before attempting to implement anything on a large scale.
5.10 Further Reading
Various. "Oracle 8i Concepts," Oracle RDBBMS Documentation.
1
A "datafile" is a file containing information from a tablespace. Logfiles and controlfiles are not considered datafiles in the Oracle naming convention.

2
Header blocks should not be confused with block headers (see "Block Header" above).

3
Not using Oracle Recovery Manager.

4
Excluding temporary tablespaces, which neither generate any redo information nor go through the buffer cache for read or write activity.

5
This parameter is now obsolete. The functionality still exists, however, and is managed automatically by Oracle.

6
Apart from some sort directory information, which is still maintained in the buffer cache.

7
In Oracle 8.1 and upward.

8
See "Partitioned Indexes" in Section 5.7.5.

9
The meaning of "a certain size" keeps changing: as I/O and processing get faster, the "line in the sand" gets farther away. However, we are normally pushing way beyond wherever the line is drawn anyway.

10
Global indexes are the exception to this rule.



Scale Abilities Ltd
http://www.scaleabilities.co.uk
Voice: +44 1285 644533
info@scaleabilities.co.uk
TOC PREV NEXT INDEX