TOC PREV NEXT INDEX

Scale Abilities Ltd Logo


Part III
How Oracle Works
Chapter 5
Physical Oracle
5.1 Introduction
Oracle can be viewed as having two different groups of aspects that work very closely together and are difficult to separate cleanly for explanation. These two groups of aspects are the physical aspects, such as files and blocks, and the operational aspects, such as the memory architecture and the Oracle kernel.
This chapter covers the physical aspects of Oracle. It is not intended to replace, or to be a subset or superset of, the Oracle Server Concepts Guide. Instead, it is a look at some of the more important aspects of Oracle operation-aspects that are most important in building a very large system using Oracle. None of the various Oracle cache or process architectures are covered in this chapter; these features are covered in the next chapter.
5.2 Overview
What is Oracle? This chapter details the common understanding of what Oracle is-a front end to a load of data on disk. I frequently like to think of Oracle in terms of the other aspect-a complex operating system that also stores a bunch of data-and this is the aspect presented in the next chapter. Anyway, in order to cover the physical structure, we will briefly discuss the higher level and then drill down to some finer detail throughout the chapter.
The Oracle database consists of several different types of files that support its operation (as shown in Figure 5.1):
· Control files
· Initialization files
· Datafiles
· Logfiles
5.3 Control Files
The control files within Oracle store information about the physical database. Within each control file is the following information:
· Datafile information (names, locations, status, etc.)
· Operating information (database name, number of threads, number of datafiles, etc.)
· Redo log thread information
· Redo log information
· Log history
· Backup information (if using Oracle Recovery Manager)
Although these files are usually quite small (typically less than 3MB if not using Recovery Manager), their importance should not be overlooked. Therefore, it is common practice to have two or more control files for each database, placed on different disk drives and controllers to increase availability in the event of a failure. Oracle automatically mirrors (i.e., keeps two or more copies in sync with each other) the control files once they have been configured into the system.
In previous releases of Oracle, the loss or corruption of a control file while in operation would result in a database crash. This appears to have been fixed in Oracle8, where certain failures can be sustained provided that at least one of the control files is still available. If the control file that is lost is also the primary control file (i.e., the first-named control file), certain operations requiring the control file (such as queries on v$datafile) will fail, but the instance will remain in operation.
5.3.1 Datafile Information
Within the control file, there is a directory of all datafiles in the database. The control file stores
· Full pathname
· Size of file in Oracle blocks
· Oracle block size
· File status
· "Stop" SCN (system change number)
The first three of these items are self-explanatory. The file status can be set to one of many values depending on whether the file is online, offline, needing recovery, and so on. The "stop" SCN is the change number up to which media recovery will continue for that file. This is set to infinity when the file is online but to the current change number when the tablespace it belongs to is taken offline or made READ ONLY. This allows the media recovery operation to recover the tablespace only to the point in time at which it was taken offline or made READ ONLY.
5.3.2 Operating Information
This portion of the control file includes how many redo threads are currently open, how many datafiles there are, and how many logfiles there are. In addition, the maximum numbers of datafiles, logfiles, and instances specified at database creation time, and whether or not the database is currently running in ARCHIVELOG mode, are recorded.
5.3.3 Redo Log Information
This part contains information about each of the redo groups-which thread it belongs to, whether the thread is public or private, and which log group is the currently active one. It also contains more detailed information about the individual logfiles, including the full pathname, the size, the low and high SCN numbers for that log, the log sequence number, and thread allocation information.
5.3.4 Log History
This is the source of the V$LOG_HISTORY view. It contains historical information about logs that have been archived, including low and high SCN numbers and timestamp. The maximum number of entries for historical information is controlled by the MAXLOGHISTORY section of the CREATE DATABASE command.
5.3.5 Backup Information
If the database backup is performed using Oracle Recovery Manager, the control file is also used to store information about backups and file copies that have taken place. This information is used on recovery of the datafiles if the Recovery Catalog is not available.
5.4 The INIT.ORA File
I make no apology for this section of the book, because it is not going to be a verbatim regurgitation of the Server Reference Guide as found in some "tuning" books. The purpose of this section is to give real advice on putting together a safe, usable initialization file to be applied to a large-scale database system. However, explanations of some of the more important init.ora parameters are included as an appendix, because explanations are more meaningful once the concepts are understood properly.
5.4.1 Rules for init.ora Creation
On the face of it, the init.ora looks to be a fairly straightforward entity-that is, it simply contains multiple parameter=value declarations. This is almost true, and for the most part, simplistic declarations are all you need to worry about. However, there are some lesser-known facts about the way Oracle parses the init.ora that you should understand, because your life will be a good deal simpler and less confusing if you do.
Value Overrides
First of all, beware of inadvertently assigning a value to a parameter twice, and thus overriding the original value. This should not happen if you follow strict layout procedures as detailed in "Layout Recommendations" but if it does it can cost you many lost hours chasing system performance problems. A good example of this would be the setting of our old friend db_block_buffers. If you are the proud owner of a file that looks like this
.
.
db_block_buffers = 10000
#changed by JM
#db_block_buffers = 100000
.
.

then you are asking for trouble. Using this example, it would be very easy for the file to be changed accidentally (say, during maintenance) to
.
.
db_block_buffers = 10000
#changed by JM
db_block_buffers = 100000
.
.

Once the maintenance was completed and the database started up, everything would probably look good. It is likely that you wouldn't even notice the extra zero on the size of the database buffers section of the SGA on instance start-up.
The rest of this story is painfully obvious, involving "missing" memory on the database server, poor performance, and a very well hidden problem in that the initial setting all looks OK. In general, you should avoid retaining multiple settings for the same parameter, except under very specific circumstances, as detailed in the next section.
Value Concatenation
There are some init.ora parameters that can be legitimately specified several times in the file without overrides occurring. A good example of this is the event specification, used to set system-wide database events (normally for debugging purposes). If there is more than one event to set in the init.ora file, this can be achieved without overriding the previous one by keeping all the event settings bunched together in the file. This means that no other parameters can be set in between, as shown below:
# Correct way to specify events
event = "1547 trace name errorstack"
event = "4031 trace name errorstack level 1"
.
.
#Incorrect way to specify events
event = "1547 trace name errorstack"
processes = 3500
event = "4031 trace name errorstack level 1"

In the first example, both of the events are successfully set for the instance. However, in the second example we would end up with the 4031 event being the only one set, because Oracle has legitimately overridden the 1547 event.
This rule also applies to all other multiple-value parameters, such as rollback_segments and control_files.
init.ora File Sizes/Limits
When working on init.ora files, it is important to be wary of the limits Oracle puts on init.ora files. There are limitations on both the size of any individual file and the length of the string assigned to a parameter.
These limitations have been lifted in Oracle8 to the extent that they are unlikely to become a problem, even with enormous numbers of comments. However, caution should be observed on the remaining Oracle7 databases in your company.
The limits in Oracle7 are port-specific, but the values given in Table 5.1 are very typical for Oracle 7.3 on a UNIX platform
Layout Recommendations
When implementing a large system, it is important to maintain a highly readable init.ora file and to incorporate strict change controls. This helps to prevent costly mistakes when making changes in the database configuration parameters.
The first rule to follow when putting together an init.ora is to create and maintain the file in sections. This makes it easier to find specific components of the file when editing it and creates some structure in the file.
In the Oracle Reference Guide, it is recommended that all parameters be grouped together alphabetically; I don't like this idea, because it puts unrelated parameters adjacent to each other, but the choice of grouping should ultimately be something that makes sense to all the people who could edit the file.
Each of the recommended groupings will be presented individually.
Change Control Header
The first section of an init.ora file should be a change control header. This can also be formalized by the use of a source code management system such as rcs or sccs, but such a system should be used in addition to the inclusion of a header at the top of the actual file. The header should include the following information:
· Name of system
· Name of file
· Owner of file
· Purpose of file
· Change history
This is a very formal approach to file headers but makes all aspects of the file absolutely clear. Don't forget, DBAs come and go, and you don't want to have to explain every minute detail of the system to each new recruit. Here's an example of a header for an init.ora file:
########################################################
# $Id: CH05.html,v 1.1 2004/07/25 20:00:10 morlej Exp $
########################################################
# Copyright (c) 1997 xyz Limited
########################################################
# System : Reservations Management System
# File : initRMS1.ora
# Owner : Production Database Administration
# Purpose : Instance RMS1 initialisation file #
#
# Change History
# --------------
#
# Who Date Change
#=========== ========= ============
# J.A.Morle 01-Jan-94 Initial Creation
#
# J.A.Morle 08-Jan-94 Increased
# shared_pool_size->200MB
# to alleviate ORA-04031
#
# A.Einstein 01-APR-95 Added event 1547
#
########################################################
# ...etc...

It is important to keep the change descriptions meaningful, because over the life of a large system, it is likely that the reasons for change will be forgotten.
Common, Database-Wide, and Parallel Server Parameters
########################
# Instance/Thread Definition
########################
thread = 1
instance_number = 1
########################
# Common Parameters
########################
ifile = /opt/oracle/RMS/initshared.ora

This section includes all Parallel Server settings, if used. These may appear to be strange things to include as the first settings, but there is a reason for this, which will become clear when the contents of the initshared.ora file are shown. This file is located on a shared-disk volume, in order to be accessible by all nodes in the OPS cluster. It is therefore readable by all instances, which will all specify the ifile instruction in the main init.ora for that instance.
########################################################
# $Id: CH05.html,v 1.1 2004/07/25 20:00:10 morlej Exp $
########################################################
# Copyright (c) 1997 xyz Limited
########################################################
# System : Reservations Management System
# File : initshared.ora
# Owner : Production Database Administration
# Purpose : Instance independent (Common)
#   initialisation parameters
#
# Change History
# --------------
#
# Who Date Change
#=========== ========= ============
# J.A.Morle 01-Jan-94 Initial Creation
########################################################
db_name = PRDRMS
control_files = ( /opt/oracle/RMS/ctrl1.dbf,
    /opt/oracle/RMS/ctrl2.dbf,
    /opt/oracle/RMS/ctrl3.dbf )
db_files = 256
dml_locks = 0
ccf_io_size = 4194304
optimizer_mode = RULE
..contd...

This is the first part of the ifile, used for setting parameters that are common to all instances but are not Parallel Server parameters themselves. Notice the change control header at the top of the file, describing the usage of the file.
If Parallel Server were not in operation on this system, this part of the file would be included (inlined) as part of the main init.ora file, because the logic is no longer there to maintain a separate file. This is the reason for including this section first in the breakdown-it is common practice among DBAs to include the db_name, control_files, and so on at the top of the main init.ora file when running a single-instance database.
The eagle-eyed will have spotted the dml_locks parameter and will be wondering why it declared as a common parameter between instances-surely there can be different settings on each instance, can't there? Yes, and no. There can be different settings unless any instance has it set to zero, which is why it must be a common setting. More information as to why one would set it to zero can be found in the Parallel Server section of this book.
.....contd..
########################
# Parallel Server
########################
gc_lck_procs = 6
gc_freelist_groups = 100
gc_rollback_segments = 164
gc_rollback_locks = 40
gc_save_rollback_locks = 64
gc_segments = 1800
gc_db_locks = 29000
gc_files_to_locks ="1,34=4000EACH"
gc_files_to_locks = "7-8,18-19,21,30-31=1000EACH"

This is the final part of the initshared.ora file, containing all the settings specific to Parallel Server operation for this database. The gc_files_to_locks parameter is another good example of using multiple lines to build up a complex parameter value, such as demonstrated with the event specification in the next section.
Event Specification
Returning to the main init.ora file, we maintain a separate section for the event specifications :
Again, note that both of the event specifications are on contiguous lines, with no other parameters between them.
Other Groupings
Rather than waste paper by listing the remainder of the example init.ora file, the list in Table 5.2 should prove useful. It suggests categories for each of an example set of parameters.
Further Tips
Although in Section 5.4.1 it is advised that value overrides should be avoided, there is one special case in which it actually can be safer to have them than not to have them. That special case is when a special maintenance ifile is created and referenced at the bottom of the main init.ora file. The contents of the file could be
sort_area_size = 10485760
dml_locks = 1000
db_block_buffers = 100000

and the tail end of the main init.ora file could look like this:
.
.
################
# Maintenance Parameter specification
################
#ifile = maint.ora

When maintenance parameters are set in this way, they are very easy to take in and out in order to perform maintenance-only a single comment character needs to be toggled in front of the ifile specification.
Setting Up a Maintenance Instance
Another excellent way to toggle in and out of maintenance mode is to set up an alternative instance. This method is very flexible and completely removes any risk of leaving parameters unsuitably set when you enter production mode once more. The method for doing this is simple.
First, create another init.ora file with a name totally different from that used for the production init.ora file-for example, initMNT.ora file to indicate the MaiNTenance mode of the database. Within this file, use the same db_name and control_file parameters as in the production init.ora file. This will allow the newly created instance definition to access the production database. Next, set up all of the required maintenance parameters in the initMNT.ora, such as db_block_buffers and sort_area_%. If space is available, rollback segments can permanently exist in the database. These can be included in the initMNT.ora in order to bring them online for use during maintenance.
The maintenance instance is now ready for use. In order to use the maintenance instance, the following procedure should be followed.
1. Shutdown the production instance cleanly.
2. Ensure that all processes related to the production instance are cleaned up. This can take a few seconds to complete even after the instance has closed cleanly. If some of the processes do not die off naturally, manually kill off the remaining processes.
3. Ensure that the shared memory segment used to accommodate the SGA has been removed. If it has not, remove it manually.
4. Reset ORACLE_SID to point to your new instance definition (i.e., ORACLE_SID=MNT).
5. Start Oracle, and perform maintenance tasks.
6. When complete, shut down the instance and switch back to the production instance. It is vital to take good care that all traces of the maintenance instance have been cleaned up, in the same way as the production instance was cleaned up in step 2.
Using this technique, a totally different environment can be safely and easily set up for database maintenance. It is also worth noting, from a safety standpoint, that the production instance cannot be started while the maintenance instance has the database mounted, and vice versa, because Oracle takes out an exclusive lock on the datafiles.
5.5 Data Storage
5.5.1 Overview: The Physical Picture
It is important to have a full understanding of tablespaces (see Figure 5.2), segments, objects, extents, and so on. If you do not have a comprehensive understanding of these concepts, it is important that you read the Oracle Server Concepts Guide. This section deals exclusively with the different types of Oracle blocks, where they are found, and what they are used for.
5.5.2 Blocks: An Introduction
Every datafile1 is comprised of blocks, as illustrated in Figure 5.3. The size of the block is defined in the init.ora (db_block_size) when the database is created and cannot be changed thereafter. All I/O operations within Oracle are performed in some multiple of the block size. Blocks are used in order to simplify the I/O, storage, and caching algorithms used within Oracle.
The block size chosen for any particular implementation affects the fundamental operation of the database and therefore needs to be set to an optimal value at database creation. This is actually not as daunting as it sounds, because there is generally one range of sizes that are optimal for transaction-based systems and another for read-intensive systems such as data warehousing. The setting of the block size is a matter of balancing the read and caching advantages of a large block against the caching advantages of a small block when write activity is high. In this way, the determination of the correct block size is very much dependent on the operation of the database and will be covered later in this book.
Several different block types are used in Oracle.
First, there is a data block. The term "data block" is used to describe a block containing table data, index data, clustered data, and so on. Basically, it is a block that contains data.
Next, there are blocks that are associated with internal Oracle operation, such as header blocks and freelist group blocks. These blocks do not contain any user data but contain important structures that the implementor should be aware of when building a large system.
Data Blocks
The structure of a data block (see Figure 5.4) is the same whether the block belongs to a table or to an index.
Block Header
Each block contains a block header. This portion contains the block's identity in terms of
· Address of block
· Type of block (index, table, rollback, etc.)
· Interested Transaction List (ITL)
Although this piece is of variable size, it remains on the order of 100 bytes. This should be considered when assessing the efficiency of using small Oracle blocks: A 2,048-byte block would lose 5 percent in overhead, whereas a 4,096-byte block would lose only 2.5 percent.
An important part of the block header from a read consistency standpoint is the ITL portion. This is a variable list of "interested transactions" for that block, and the number of entries is determined by the settings of INITRANS and MAXTRANS. For each of these entries, the undo (rollback) block address, the transaction ID, and the system change number are recorded. This information is used by Oracle to facilitate the read consistency model, for which a small explanation is necessary.
When a query starts, the SCN is recorded in order to determine the start time of the query. This SCN is then compared against the SCN of the actual block when it is read in order to determine whether the version of the block is consistent with the query start time. If the SCN is greater than the expected SCN, a new version of the block (called a constant read [CR] block) is created in the buffer cache out of the applicable undo entries to get the prior version of the block. These undo entries are determined by reading the ITL, and so the ITL is the entry point for constructing the CR block. Note: The CR block is a cache entity only-it is never written to disk.
Once the relevant undo records have been applied, the SCN of the block is checked once again. If the SCN is still greater than the required SCN, the ITLs are reviewed again-because the entire block had undo applied to it, including the block header, the ITL entries in the new CR block also reflect the state of the block prior to the last change. Therefore, this process can be continued until the desired version of the block is created, or the required undo records are no longer available, due to forced reuse. At this point, the query session receives a "snapshot too old" message, because the read-consistent version of the block can not be constructed. 
Row Directory
The row directory is the secret of how Oracle finds a specific row within the block. It is what its name suggests: a directory of the rows contained in the block. It contains the offset in the block for each row it contains. Therefore, the block does not need to be scanned beyond the row directory before the index to the row data required is found.
Take, for example, a single-row lookup for which Oracle has already determined the rowid of the required row from the index on the table. Part of the rowid is the row number within the actual block, and therefore Oracle can read the row directory out of the block header and proceed directly to the actual row data.
It is worth noting that the row directory is not subject to any shrinking when rows are deleted. The space allocated for the row directory is never returned to the block but is reused on subsequent inserts. This does not present a problem in practice, because the rows that are subsequently inserted are typically of the same size as the deleted rows, therefore yielding the same number of rows per block. Anyway, we're not talking about a large amount of space for the row directory, unless you have tables with a very small row size.
Free Space
Free space is the currently unoccupied portion of the block that is used when new rows are inserted or when existing rows are extended.
Row Data
At last! This is where the actual data is stored for the block. The row data (or index data) is stored in length:value pairs for each column. That is, if a column for a particular row contains the value "YES," this is recorded as 3:'YES'. When a NULL column is stored, it is stored as a length zero, with no data associated with it. If these NULL columns all occur at the tail end of the column list, nothing is stored at all, because the start of the next row implies that the row had trailing NULL columns.
It is worth bearing in mind that when ordering the columns in a table it makes sense to put all the NULLable columns at the tail end of the column list in order to reduce storage and the amount of data Oracle has to read to determine that there are NULL columns.
Data Block Dump
The following block dump of a table block will help to demonstrate some of these attributes:
Starting at the top, we have the rdba, or relative data block address, of the block. The rdba is an internal rowid format for every block and is used by Oracle for every operation that addresses a specific block.
The SCN for the block is listed next, as described in "File Headers" below. The "tail" field is the last 2 bytes of the SCN, combined with the "type" and the "seq" for the block. This value is the mechanism used by Oracle to determine whether the entire block is consistent with itself after a recovery operation. The "tail" value is physically stored at the tail end of the block and should always match the SCN, seq, and type stored at the head of the block. If not, recovery needs to be performed on this block to bring the entire block into sync with itself.
Next, we have the block header information. The "Object id on Block?" piece is a throwback to version 6-it should always be "Y" now, indicating that the "seg/obj" value below it is the actual object number stored in SYS.OBJ$ (main constituent of the DBA_OBJECTS view).
The "csc" is the cleanout system change, or a short extract of the SCN used to show when block cleanout was last performed on this block. Block cleanout is covered in Section 5.5.3.
The "itc" value is simply a count of the ITLs on the block, which are listed farther down. The "flg" is either "-" or "O', where "-" means that the block is not on the freelist (i.e., more than PCTUSED used in the block), and "O" means that the block is on the freelist. The remaining fields denote which ITL's freelist the block appears on (if any).
Next are the actual ITLs themselves, containing the information described at the beginning of this section, required in order to construct CR blocks out of committed or uncommitted transactions that have modified the block. This represents the end of the transactional portion of the block header.
Then comes some information about the space usage in the block and the row directory, used to locate specific row offsets within the block. Included in the general information are the following:
· tsiz is the actual usable space in the block on which PCTUSED and PCTFREE are calculated.
· ntab is the number of tables (if this is a cluster).
· nrow is the number of actual rows in the block.
· fseo is the start of row data above header overhead.
The section beginning "0xe" is the start of the row directory, which shows once again that there are 61 rows in this block. This is immediately followed by the row directory entries, which are in the following format:
<directory_address>:pri[rownum-1] offs=<offset_address_into block>
One interesting thing to note here is that the rows are inserted into the block starting at the end of the block. In other words, the row data grows toward the row header, not away from it. This is reflected in Figure 5.4, where the free space is shown as being adjacent to the row directory.
The last thing in the block dump is the actual data, formatted as described under "Row Data" above. Above each row is a header showing the row number in the block and a summary of the content:
· tl is the total number (in bytes) of the row plus the header.
· fb is a set of flags where, in this example, we have a "H"ead of a row, and we have the "F"irst piece and "L"ast piece of the row: this row is not chained.
· cc is the column count for this row.
The actual data follows, one line per column. In this dump, the first two columns are SYSDATE and NUMBER(10), respectively, stored in Oracle internal format. The following four columns are VARCHAR2 columns, and their contents can be viewed by turning the hexadecimal values into ASCII characters.
Header Blocks
There are several different types of header blocks.2 Each of these types of header blocks has a very different purpose and will be covered in turn.
File Headers
The first block of a datafile contains the file header. Stored in this block is information such as the file number, the database that it belongs to, and the last change number for that file. The overhead of this block is important to remember when creating datafiles within Oracle using raw disk devices.
When sizing the raw disk partitions, it is important to remember to add the overhead of the file header block. It is common to find raw disk based Oracle databases with the file size within Oracle defined as 99M, 199M, 499M, and so on. This is because the raw disk partitions have been sized to exactly 100M, 200M, 500M, and so on, and no provision has been made for the file heade