
n order to build a large system of any kind, it is important to have a good understanding of some basic scaling principles. By understanding these basic principles, it becomes much easier to interpret the requirements for building a large-scale Oracle-based system, and to address performance issues in existing systems that are not scaling satisfactorily. This chapter does not aim to cover all aspects of a system that affect scaling, but does intend to cover the more important aspects, giving examples that highlight situations in which they are found in real systems.
To make this old-fashioned and generic-sounding definition a little more explicit, here are a few examples of scaling in action.
The word "increasing" appears in each listed item: that's the scaling that we are interested in for this book-increasing the system capability, whether "system" is defined as hardware, software, or, more typically, both. Indeed, the most common view of scaling is that of hardware scaling, even though scaling has at least as much to do with the software components as with the hardware. Nevertheless, it serves as a useful (if overused) example of what we mean by scaling.
If we were to double the number of processors in a system, would we achieve twice the processing capability? The answer is probably "no." Scaling, like most things, can be classified in either of two ways: good (efficient) scaling and poor (inefficient) scaling.
Good scaling is observed as a highly linear increase in capability as the system grows. Using the tired old hardware example, this means that the system would demonstrate a near-linear gain in capacity when additional resource is applied (see Figure 1.1).![]()
In Figure 1.1 the capacity of the system is increasing at almost the same rate as the resource being added. This means that, as long as the software scales equally well, the system will be able to process almost n times the data within a given amount of time if n times the resource is available.
Poor scaling demonstrates a negative deviation from linear when a system grows. This means that the system capacity diverges farther away from linear when additional resource is applied (see Figure 1.2).![]()
The poor scaling illustrated in Figure 1.2 shows that there is actually less system capacity available when all of the resource is applied than when the system had less than one-half of the total resource. This is an example of what can happen when a system scales badly: performance goes down, not up. This is true of both software and hardware scaling, and is of particular concern when a large system is being built. Good scaling is a function of both hardware and software, and unless both are optimized to scale, overall system throughput will be considerably less than expected.
Scaling can be viewed from two different perspectives: a speedup of tasks within the system; and an increase in concurrency in the system, sometimes referred to as scaleup. Both speedup and scaleup use the same principles, applied in slightly different manners.
If a job takes n seconds to run under normal circumstances, and it is desired that the job should run in n/2 seconds, then a twofold speedup is required. This speedup can be gained in either of two different ways:
Doubling the execution capacity of the existing components is great if it is possible: it allows the required speedup to be achieved with no increase in complexity and with few concerns as to the required scalability. In fact, speedups gained in this way are not really subject to any scalability concerns at all, assuming that the entire system is doubled in execution capacity (including processor speed, cache sizes, communications latencies, clock speeds of all components, and so on). Unfortunately, this is rarely an option, because speedups are frequently needed immediately after the initial rollout period or are required to be many times the speedup offered by faster parts. If the system has just been installed, it's probably already using the fastest components available, because vendors rarely sell parts that are not at the highest specification available for their system architecture. On the other hand, another vendor may have dramatically faster hardware, and a reassessment of the platform may need to be made; this, however, is an expensive option and is potentially unnecessary when other techniques, such as parallelization, are available.
The most common option is to break the job down into smaller "bite-sized" pieces and have the system process them in parallel. This technique, known as job partitioning, allows the system to be added to where necessary to increase the overall system bandwidth. This has a scalability implication for the system and makes the speedup element dependent on the concurrency available within the system. Parallelization of tasks is the common scalability requirement for decision support systems. One of the most common parallel activities in a decision support system is parallel query, which is a good example of speedup.
In a DSS system, there is a query that requires a full table scan of a very large table (1 billion rows). This table is spread over many disks and many controllers, and the system has many processors available for the execution of this query. Now, if this query were to be executed without any partitioning of the workload (see Figure 1.3), it would simply take a long time for a single processor to read and validate every row in the table against the predicate in the query. No number of additional processor boards would make this query faster, and faster processors would probably limit the speedup to two to three times, even if you were lucky and were not bound by the serial disk access anyway.![]()
Using Oracle Parallel Query, this job can be made many times faster, potentially providing a speedup of an order of magnitude or greater. This is achieved by logically breaking the large monolithic table into several smaller logical ranges. Once this has been done, additional slave processes can be used to scan each of the logical ranges in parallel, thus scaling the query across several processors (see Figure 1.4).![]()
Good scalability in this context could be defined as "achieving maximum useful concurrency from a shared system." This is the major area of concern for transaction processing systems.
In multiuser systems, the system must share out all of its resource to all the users of the system. It must do this in a fair-share manner, prevent sessions from corrupting others, and not consume excessive resource or impose artificial bottlenecks in the process. In order to achieve this desirable state of concurrency, the system must scale effectively.
Achieving concurrency is one of the greatest challenges when building a system, whether you are the hardware designer, the RDBMS vendor, or the end user. In the building of large-scale systems, it is the concurrency that must be considered at all times.
Strangely enough, concurrency is made possible through the use of enforced serialization. Through serialization, we achieve the necessary synchronization required to process tasks concurrently.
Synchronization needs to occur on any concurrent system and is most commonly handled through the use of latches and locks. The specifics of how these structures are used will be covered in more detail later in this chapter.
Latches themselves also need serialization support, in this case from the hardware itself. In order to take out a latch, a bus transaction must be performed in order to write to the memory location that contains the latch. On a multiprocessor system, many processors could be attempting to write to the same location concurrently, and therefore the bus transactions need to be arbitrated in order to allow them access one at a time. This will be discussed in more detail in Chapter 2.
In fact, synchronization sometimes also needs to occur for straight performance reasons between cooperating programs on the same or different systems, and this needs to be carefully designed if maximum concurrency is to be maintained. A good example of this is two batch jobs operating in parallel against the same data set. Oracle will automatically prevent the sessions from corrupting each other at the transactional level, and will take out implicit locks to enforce this. If these jobs collide on their respective data sets, they will get caught up in serious contention between each other as a result of Oracle protecting the data from transaction level corruption. This is especially true with Oracle Parallel Server, because Oracle needs to use locks to protect both the transaction layer and the cache layer.
If these sessions partition their data access and cooperate with each other before Oracle enforces it, they will be able to process a good deal more data. This results from the developer of the batch jobs having more knowledge of the actual data itself, and therefore knowing how to synchronize more intelligently. Either way, the synchronization has been performed in order to allow the two jobs to run concurrently without corrupting data.
Latches and locks can be viewed as very similar entities, and in many software engineering circles there is no difference between the two terms. However, in an Oracle environment, there are apparent differences between them. The basic differences are
· Duration. Latches are normally very transient, whereas locks are typically a longer-term prospect.
Their function remains identical-to prevent other sessions on the system from modifying the critical resource that they are protecting. The Webster's definition almost describes a modern software latch accurately, in that it captures the spirit of the transient operation. However, the door is most certainly bolted when a latch is taken out-it's just faster to unbolt after use.
As we started to explain in the description of concurrency, it is inherently necessary to serialize some operations on a concurrent system in order to protect different tasks from corrupting each other. This applies only to access to shared resources. Let's run through a couple of examples to help get the real point across.
Let's take a complex memory update as an example in order to add substance to the explanation. This kind of operation is typical in large transactional systems, in order to update several memory blocks atomically (i.e., as a single, indivisible operation). Figure 1.5 shows a region of memory that is protected by a latch.
The diagram in Figure 1.5 shows a memory buffer that is used to batch together many physical I/O requests before issuing a write for all of them at some later stage. Many user sessions on the system write a few hundred bytes of data into this buffer, and then update a structure at another location to reflect the new address to which other sessions can write. In this scenario, there are could be many thousands of sessions on the system that would potentially need to write into this buffer at any one time, and it is clear that some kind of serialization is required in order to prevent a race condition2 from occurring (see Table 1.1).
In the example shown in Table 1.1, the buffer is corrupted through a race occurring between the respective tasks. Not only were the 50 bytes written by user 2 overwritten, but the 300 bytes written by user 1 were effectively truncated due to user 2 setting the "next" pointer incorrectly as a result of a race condition.![]()
![]()
It can be seen that this operation must be protected from other sessions while it is in progress. This is where latches come in. Using the same example, Table 1.2 should make it clear why latches are required.
In this situation, a latch is taken before the copy, and thus the area is protected from other sessions until the copy and subsequent pointer update are complete. It is obviously important to keep the period during which the latch is held as short as possible, and also to minimize the contention for the latch. This means that this necessary serialization of the latch is of the lowest possible impact. However, as can be seen in the example, it is not uncommon for the operating system to intervene, and to context switch the latch holder off the processor and back onto the run queue. This can happen for several reasons, including standard process scheduling and receipt of an interrupt on the processor in use.
Despite user 2's process being switched off the processor, a race condition is prevented in the situation shown in Table 1.2. This kind of protection is vital in multiprocessor systems, because there are many ways in which race conditions can occur.
As stated earlier, the term latch is normally associated with memory locks that are held for short periods. The term lock is frequently used as a catchall for all types of locking, using the word as an adjective. When we are dealing with Oracle databases, however, it is correct to use the word in terms of data locking. These are longer-duration locks that are used to protect the integrity of changes made to the database, such as user transactions and system (recursive) transactions. In the case of a user transaction, the durations of these locks are sometimes controlled by the users themselves, as follows:
No other user can update or delete the row with jref = 123456789 while this lock is held. This example is very simplistic, and it is hard to see why you would need this facility. The truth is, however, that some transactions are so complex that the individual pieces of the transaction need to be protected in order to ensure that the entire transaction can be completed in a consistent, atomic fashion.
It is also worth noting that, in the example above, we explicitly lock the row that we need to update prior to updating it. This is not actually required for most types of operations, because the update statement will implicitly lock the row before the update is performed. This kind of explicit row-level locking can cause serious problems in a large-scale system and should be reserved for situations in which an atomic transaction is absolutely required and cannot be performed in any way other than SELECT ... FOR UPDATE. In fact, a recommendation will be made against the widespread use of overzealous locking in "Maintaining Concurrency" in Section 9.1.1.
An important difference between a lock and a latch within Oracle is that a lock is typically database-wide whereas a latch is typically local to the instance that is managing it. As it is database-wide, a lock affects all instances of a Parallel Server environment, not just the one executing on the local node. There are some exceptions to this rule, but in general it holds true.
Another difference is in the way that sessions wait on the locks. See Section 1.2.3 for more information on lock waits.
If something (such as a memory structure, data, or an entire program flow) requires protection with latches or locks, then it follows that other sessions, processes, or types of execution context must have some way of waiting on those latches or locks. These waits fall into two categories: active and passive.
An active wait is the method most commonly used for latch allocation. When a session cannot acquire a latch on the resource that it requires, it goes into an active wait on that latch. In active wait, the session repeatedly attempts to acquire the latch until it is successful-that is, waiting impatiently. The reasons for this are that (a) it is unlikely that the latch will be held for very long and (b) it is normally too expensive to set up any kind of queue for the resource.
Sessions that cannot acquire a latch immediately will spin until they get the latch. Spinning involves locking into a tight loop testing the latch and trying to acquire it. Spinning uses significant CPU resource but typically results in acquisition of the latch in a shorter time than it would take to establish any kind of formal queue. These types of latches are also referred to as spinlocks, owing to the nature in which they are waited on. The active wait is frequently implemented using the host's test-and-set functionality-a hardware-level serialization instruction (see "Test and Set" in Section 2.1.2).
The actual algorithms used for waiting on latches/spinlocks varies between implementations. In the Oracle kernel, for example, a process will spin on a latch for spin_count (init.ora parameter) iterations before backing off and sleeping. The process will then spin again before going to sleep once more; the actual sleep time between spins increases exponentially. The reason for the incremental backoff is to reduce the likelihood that the system will disappear off the map as a result of all the processes locking into infinite spin loops.
The alternative method of waiting on a serial resource is to perform a passive wait. Typically, these waits are used on less timing-critical components, of which a TX (row-level lock) is a good example. Such a lock is known within Oracle as an enqueue.
Enqueue is a strange word that has an opposite sibling called dequeue. Once the two are put side by side, their meanings become more apparent. They are both operations on a queue: one enqueues requests (puts them onto the queue) and the other dequeues the requests (takes them off the queue).
Oracle uses enqueue functionality to manage locks within the database. This includes the TX lock, as stated above, and the ST lock, which are more correctly termed the TX enqueues and the ST enqueue, respectively. In comparison with latches, none of these enqueues need to respond quickly to multiple requests. Several milliseconds or even several seconds would be easily fast enough for higher-level serialization such as this.
When a user takes out a row lock, a TX enqueue is created for that user. If another user subsequently attempts to update that same row, that user's session will block (wait on) the enqueue that the initial user created. If a third user tries the same update, he or she will also be blocked on the same enqueue. This is visible to the database administrator as multiple entries in V$LOCK with the same TYPE, ID1 and ID2. The first session will have an LMODE (lock mode) of 6, indicating an exclusive lock gained on that enqueue resource. All the other sessions will report an LMODE of 0 (no lock gained) and a REQUEST of 6 (exclusive lock).
At this point, all the waiters on the enqueue are using zero processor capacity on the database server-they are blocked. This is the nature of a passive wait. However, if the sessions are not actively trying to allocate the lock, how do they ever get it? The answer lies in the dequeue part of the equation, and this occurs once the first lock is cleared.
The first user will issue a commit or rollback against the updated row data protected by the TX enqueue. Once this occurs, the second user session is unblocked and is granted the LMODE of 6 on the enqueue. When that user commits or rolls back, the third session will get a piece of the action, and so on. This is the enqueue/dequeue function in operation.
Not all enqueues work in a FIFO (first in, first out) manner like the TX enqueue. The other example given, the ST enqueue, operates in a LIFO (last in, first out) mode, apparently as an optimization in its sharing algorithm.
When operating systems and relational database engines are designed, special attention is given to minimizing contention for locks on shared resources. This is of obvious importance for the scalability of these systems and merits a good deal of attention from the designers. However, we have already established that locking is necessary within these shared systems, so what can the system designers do to alleviate this contention?
There are several things that can be done where appropriate. Sometimes a good deal of reworking of the logic is necessary to implement some of these techniques, but the engineering impact can be enormous, and so these techniques are often considered as last-resort measures. We'll go through some of the more common methods of reducing lock contention in turn.
The first thing that can be affected is implementation of locks/latches with different degrees of exclusion. This is done in order to provide as much concurrency as possible even though it is necessary to lock things. In simplistic terms, there are three different states that a lock can be in:
· Acquired Shared Exclusive. Otherwise known as a multireader lock, this lock allows any number of readers, but no session can acquire an exclusive lock in order to perform a write.
The use of multireader locks allows the system to process read-only workloads while maintaining the data integrity of the system. It is frequently the case that a process may only need to prevent the resource from being updated while it performs some other task, and the use of multireader locks allows many processes to share access to the resource concurrently in read-only mode. If a process needs to modify the resource that the lock is protecting, the lock must be upgraded to an Exclusive mode before this is permitted. This would then prevent any readers or writers from accessing the region.
These three degrees of locking are used throughout both the UNIX kernel and the Oracle Server. Oracle actually defines even more locking degrees than this, in order to provide even more control over the required degree of locking. These locking degrees are described in detail in the Oracle documentation.
A large performance benefit results from maintaining different locking degrees, and this forms a fundamental part of the internal locking mechanisms used in modern systems.
A more complex technique is to provide fine-grained locking within the system. This involves reducing the sizes of the regions covered by locks in the system, so that many locks can be used independently in the place of a single lock. This allows far greater concurrency within the system. Fine-grained locking for persistent types of locks, such as transaction locks, is good for all systems (uniprocessor and multiprocessor), while fine-grained latching is most useful on multiprocessor platforms where multiple latch acquisitions can be made concurrently by different processors.
· Row-level locking with the Oracle transaction layer. Only users that need to access the same row as another user will contend for this data lock. Other RDBMS systems have relied on page-level locking or, even worse, table-level locking.
· Multiple library cache latches. In the pre-7.2 releases, a single latch enforced serialized updates to the shared SQL cache. In releases 7.2 and above, Oracle allows multiple concurrent updates by providing multiple latches. The latch used is determined by the hash value (see 1.4) of the SQL statement.
· Function locks within the UNIX process table. For example, a separate lock can be used to cover the structure that contains pending signals.
Implementation of fine-grained locking often is very difficult for system engineers and frequently requires redesign work. For this reason, it is common for portions of systems to be left for the users to "suck and see" as to whether the system requires fine-grained locking. Although you may not be able to fix any of these problems personally, it is important that you report them to the responsible party if it can be demonstrated that the lock granularity is causing a performance problem. This is usually easier to do within the Oracle product than within the UNIX kernel, because the latch activity is very well reported through the V$ views.
The best way to eliminate lock contention within a system is to use algorithmic enhancements in the software. By changing the algorithms used within the kernel, it is sometimes possible to eliminate the need to lock at all in certain areas. Many such algorithms have been produced within research circles, and some have been implemented into production software to aid scalability.
For example, assume that in a shared-memory-based application, there is a single writer and many readers of a single record in shared memory. The record is larger than the largest atomic update that is possible on the platform. A bad algorithm for dealing with updates to this record would be to lock the record and prevent the readers from accessing it while it was being updated consistently by the writer.
A better way to code this problem is to maintain a master list of pointers to the actual records. The reader processes will first read the master list and then follow the pointer to the actual record. Using this method, a newly updated record can be written to a private memory address, and the pointer in the master list can be updated atomically to point at the new location when the update is complete. This way, no locks have been taken out at any time, and all reader processes can execute in a contention-free environment.
The example shows a simplistic change that removed the need for locks in that part of an application. However, this example is about as simple as it gets: if there is more than one writer, for instance, this technique becomes more complex. The level of effort required to eliminate locks using different algorithms varies from the implementation of simple hashing algorithms to complete redesign of the software architecture. For some truly shared resources, algorithmic changes are impossible as a result of logic restrictions.
A linked list is simply a programming construct that allows the programmer to traverse a list easily without relying on the physical storage for any specific layout. The reason that linked lists are detailed in this book is because lists are used extensively within the Oracle RDBMS, and it is necessary to understand why they affect performance.
So, what does a linked list look like? In C, a linked list is composed of several simple data structures similar to the following:
Once the list has been loaded, it is very simple to manipulate it; inserting or deleting an item is almost as straightforward as adding an item to the end. Equally, shuffling the entire list in a sort operation is very efficient, because only pointers are changed-the data is never relocated.
So, it should be pretty clear by now what a linked list is and why it is found in nearly every program that manipulates data. At the end of the day, that's exactly what Oracle is.
Linked lists are used to manage working sets of information, with the classic example being an LRU (least recently used) list as commonly found within Oracle. An example of the use of LRU algorithms can be found in "LRU Algorithms" in Section 1.5.4. They will always be updated in conjunction with latches, because multiple sessions will use the lists concurrently, and without latch protection even simple list manipulations can produce corruption. Using the LRU list example, it can be seen that it is straightforward to add items to both ends of the list, if pointers to the "hot" (most recently used) and "cold" (least recently used) ends are kept in fixed locations.
Inserting an entry into the cold end of the LRU list is very straightforward. You can already assume that the entry currently on the cold end is ready to be removed from the list, because it is already the "least recently used." Therefore, you simply overwrite the contents of the entry that is currently on the cold end.
Inserting an entry into the hot end is slightly more complex, but is still straightforward using linked lists, as shown in Figure 1.6.![]()
1. Take out the latch that protects the list. Go to the cold end of the list (element one) and set the "next" item pointer to be NULL, indicating that there are no entries past this point (i.e., this is the hottest end of the LRU). Set the "previous" item pointer to point to the entry that used to be hottest (element four). Put the new contents into element one.
3. Update the "previous" item pointer here to be NULL, indicating that there are no more entries (i.e., we now have a new cold end of the list).
4. Go to the item in the list that was previously hottest (element four) and change the "next" item pointer to point to the new hot end.
Using lightweight memory manipulations like those above, lists are one of the most efficient ways of manipulating data in memory. It can be seen, though, why the list needs to be protected from other reads and writes while it is being manipulated. If another session were to start to read the LRU list while the updating process was at step 2, it would get very confused indeed owing to the interim state of the pointers. This would likely result in either data corruption or a system crash.
As well as inserting entries into the linked list, the list is often scanned to find particular entries. Using the LRU example, the LRU list may be scanned until a block has the correct status, or indeed to heat the entry to a higher (more recently used) point in the list. This can prove to be an expensive operation if the lists get too long (especially because you are holding the latch during this period to prevent other sessions from updating it under your nose), and so linked lists should generally be kept as short as possible.
Any linked lists that are managed while a latch is being held are potential bottlenecks in the system. For this reason, it is imperative that high-contention lists be kept as short as possible, unless of course the list is never fully scanned. If this is not ensured, the latches will be held for a longer duration than is desirable, and waits will build up on the latch.
The db_block_buffers parameter is absolutely typical of this management and is subsequently prone to contention. All the buffers in the cache are split up and managed by separate lists, each of which is governed by a latch. This applies to both the LRU list for the blocks (from release 7.3 onward) and the actual blocks themselves (db block hash buckets).
When the buffer cache gets very large, these lists grow with it. So while you are gaining the extra benefit of running with a larger cache, you are also incurring greater overhead from the management of the chains, and you are probably suffering latch contention from the management of the chains. This was particularly true prior to release 7.3, when there were no multiple chains for the LRU list. This meant that the LRU list did not scale at all well prior to release 7.3, and the only way to manage the contention for the cache buffer LRU chain latch was either to decrease the number of blocks in your cache or to tune the database writer. This had an obvious impact on performance if your system was relying on the enhanced cache hit ratio available with the larger cache size.
In this case, it is wise to study the work profile of the system to determine whether the block size is too small for the application. If this were the case, the option would be open to increase the block size of the database, meaning that the cache can be larger than before with no overhead in contention for the LRU chain latch, and no increase in management overhead.
It can be seen that keeping chain lengths to a minimum is a good idea until all the software vendors do it automatically. Many of the chains that are managed are not controllable by the end user and thus are not an option. It is important, however, to keep tight control of the ones that you do have control over, an example of which is demonstrated in "Avoiding High-Contention SQL" in Section 9.1.
Hashing is another technique for speeding up the management operations of a system. In particular, hashing is an approach taken to achieve dramatic speedup of searches for data within a list (usually within a cache; see Section 1.5). Hashing is used just about everywhere possible in high-performance systems because of its efficiency and its simplicity.
Hashing is the term given to performing some kind of algorithm on a piece of data in order to determine an index number for those contents, rather than performing value comparison along a potentially long list of entries. That index number can then be used as a reference to either an absolute piece of memory or a smaller list that can be searched further to find specific entries.
Hashing algorithms range from incredibly simple manipulations, such as those found in a CPU cache, to more complex ones such as those used to manage the Oracle Shared SQL cache (library cache). A simplistic algorithm might be used to split numbers into buckets known as hash buckets in order to keep some statistics up to date:
Using this algorithm, a number can be quickly assigned to the correct hash bucket, where the hash buckets are used as shown in Table 1.4.![]()
It may not be clear to you at this stage why you would want to group things together into buckets in this way. The answer is that things are easier to find once they are contained in known places. To see if the number 68 has been stored, one would only have to calculate the bucket number (68 MOD 7) and then compare 68 against the 14 possible entries in that list. This is much faster than comparing it against values in 100 possible locations.
One good example of this is the Oracle buffer cache, which manages the blocks within it using latch-managed linked lists. Rather than have a single list that covers all the cache block entries, the blocks are divided into thousands of buckets, in a way similar to that shown in Table 1.4. When a block address is known, it is hashed to determine the list (bucket number) that could contain it, and then the list is traversed. In this way, any list manipulation can be constrained to a much shorter chain while still gaining fast access to the chain though the use of hashing.
This will be covered later in this chapter in more detail, but serves as a good example here. Basically, a CPU cache is comprised of several lines. A hashing algorithm is a perfect way to identify data lines within the cache, because it is very efficient. In the case of a CPU cache, it needs to be a massively efficient operation, implemented in hardware, in order to keep the latency of a cache lookup as low as possible.
Due to the way the cache lines are organized, the line can be identified by simply hashing the memory address and obtaining the line ID (see Figure 1.7). The hash operation in this case is a simple AND operation to extract a subset of the bits in the address in order to derive the line number. This is a very efficient operation to perform within a CPU cache.![]()
From Oracle7, the Oracle server maintains a cache of SQL that can be used to reduce dramatically the workload required when another session submits the same piece of SQL. In order to implement this cache, Oracle needed a way to speed up the identification of the SQL and to check if it was already cached. It is simply much too expensive to do a text comparison of a submitted piece of SQL against all other SQL requests in the cache.
Oracle adopted a hashing algorithm to speed up the search for identical statements in the library cache. This is essentially an arithmetic operation on the actual text that makes up the piece of SQL in order to derive a single index for that statement. Don't be misled by the term index: these are large numbers, and it is actually very unusual for more than one statement to hash to the same value.3 The hash value for each new statement is then used as an index, in order to do a fast lookup in the library cache to see if an identical statement has been hard parsed already. If so, a good deal of the parsing work for this statement can be reused, reducing the overhead of parsing the "new" statement.
The alternative methods of locating previously parsed SQL statements are inefficient by comparison. In particular, if a text-based comparison was needed against every piece of cached SQL, the effect would be a greater parsing overhead than running with no cache at all.
cache n ....hiding place, for concealing and preserving provisions which it is inconvenient to carry.
A cache is a smaller, higher-speed component that is used to speed up the access to commonly used data stored in a lower-speed, higher-capacity component. There are several different applications of caching, including
The best known of these is the CPU cache, which is largely a result of the efforts of the Intel marketing machine. However, in order to build a high-performance, large-scale system, caching needs to be implemented and tuned at every opportunity.
A cache is one component in a memory hierarchy (see Figure 1.8). The memory hierarchy ranges from the very fast access speeds of a CPU register through cache, memory, disk, and network. It is viewed as a hierarchy because of the relative access speeds of the various components.
The fastest access path to storage on a system is between the CPU registers, and from there on, things slow down mostly in orders of magnitude (see Table 1.5).![]()
It can be seen from the table that as the speed of components goes down by orders of magnitude, so too does the cost of the components. It is clearly not reasonable to expect a system to be built entirely from fast SRAM (the RAM technology used in CPU cache memories) running at core CPU speeds, the cost of which would be prohibitive. Therefore, the goal of caching is to give the performance illusion of slower components running at the same rate as the fastest component. While this is impossible to achieve in reality, any step toward it gives a huge boost in system performance by reducing the waits on lower-speed components.
Cost, however, is not the only reason for caching. Another reason to keep data as close to the CPU as possible is to reduce the contention on, for example, the system bus. Clearly, if all traffic needs to be passed across a single shared resource such as a bus, limits are going to be hit before they need to be. The system is not going to scale effectively.
In order to make a cache as efficient as possible, some knowledge is needed of the access patterns within the device you are caching. In this way, assumptions can be built into the cache design in order to gain the most from the limited amount of cache space.
Temporal locality is a time-based reference pattern: if this data was used once recently, it's likely that it will be needed again soon.
Spatial locality is a location based reference pattern: if location abc was used, then it's likely that location def will be needed sometime soon.
Pretty much all types of caching follow one or both of these patterns. Some devices need to employ very complex techniques in order to achieve spatial locality, owing to limited natural association between their access pattern and spatial locality.4
Two types of caching will be concentrated on for the remainder of this chapter: I/O caching and CPU caching. I/O caching is very important to understand, because it is directly tunable by the end user. The CPU cache concepts are presented as an interesting piece of background reading to help you further understand the operation of large scalable systems.
Effective caching of disk I/O is perhaps the biggest single performance improvement that can be made to a database system by the end user. The reason for this can be clearly seen in Table 1.5-disk I/O has the poorest response time of all within a system. In fact, it takes about 100,000 times as long to access a disk block as it takes to access a memory block.
It is obviously not practical to keep the entire database in memory (although there are situations in which it is desirable to force large pieces into memory), so there needs to be a way to control the use of memory to keep the cache as efficient as possible. The most common way of keeping the cache relevant is through the use of a least recently used (LRU) algorithm, in order to benefit from temporal locality of reference.
An LRU algorithm is a very common cache management method that retains data based on frequency of use. The most frequently used blocks will always remain in the cache, while the least recently used blocks will be flushed from the cache and replaced by new entries for more active blocks as they are required.
In order to achieve this, block access is maintained by a list (see Figure 1.9). The list has two ends: the least recently used end, also referred to as the cold end, and the most recently used end, also referred to as the hot end.![]()
When a block is added to the cache, or when a block that is already in the cache is referenced, it is heated to the most recently used end of the cache management list, or LRU chain. If this block is then unused for a period of time it will be implicitly made colder by the act of heating other blocks. Once the block becomes the coldest block on the list, it is eligible for reuse when an empty block is required. The example in Section 1.3.2 demonstrates how the heating of a buffer is implemented.
This is the simplistic view of the operation of an LRU managed cache. There are many differences between this model and the real-life model, most notably the introduction of "dirty blocks" handling into the algorithm, but the principle is basically the same.
The LRU management technique works very well for I/O caching. This is because there is nearly always a great deal of repetitive access to the same blocks on disk (temporal locality). The most prominent example of this is found in B-tree indexes used by Oracle. The root block of the B-tree is the starting point for all accesses to the index, in order to determine on which side of the index the value lies. Then the branch blocks are accessed to home in further on the required leaf block. Oracle maintains very shallow B-trees, making them very efficient within an LRU cache, because a single branch block will hold references to many leaf blocks. This means that the branch block will be used frequently and will always remain in cache and be subsequently available immediately out of cache for other sessions accessing it.
It is not uncommon to achieve cache hit ratios from the Oracle buffer cache that are 95 percent or greater. Table 1.6 demonstrates the effect of the caching in the aggregate response time of the system.![]()
It can be seen that if all the disk I/Os and memory loads are performed in series (one at a time), the cached example takes around 25 minutes, compared with the noncached example, which takes a shade under 21 days. This example used serial access for reasons of clarity, but if parallel access were used, the difference could be potentially far higher: solid state memory is far more concurrent than the physical heads on a hard disk spindle.
First, the bad news: there is little that you can do about the efficiency and utilization of the CPU cache. This is the domain of the hardware engineer and very much the speciality of the compiler writer. However, it is very useful to understand how the CPU cache works, as a piece of additional information for your global understanding of system operation. If you are ever called on to assess platforms of completely different architectures, knowledge of processor and associated cache operation will aid you in understanding what actually matters.
The effectiveness of CPU caching has been well proven and should be considered essential in all high-performance systems from PCs right up to the largest UNIX server. This is attributable to the very nature of the processor's function.
If a processor is running at a clock speed of 500MHz, it is capable of processing at least 500 million instructions per second. With techniques such as pipelining, the processor may be capable of processing many times that number of instructions per second.
Pipelining is where the CPU breaks up a sequential stream of instructions into separate components, which it executes in parallel. Part of this may be "speculative execution" of pieces of code, before the result of a preceding conditional test is known.
The bad news: the CPU can perform at this rate only if it is able to get the instructions and data at the same rate.
If the processor is a 64-bit unit, each word is 8 bytes wide. This means that you need to be able to access almost 4GB of memory per second simply to supply the instructions. Even on today's highest-capacity shared bus, the maximum sustainable rate is approximately 2GB/s, and that is shared among several processors. Therefore, the latest processors on the market would spend most of their time waiting on main memory-not performing real work-if a cache were not present to provide faster access to memory locations in the system.
The simplest view of a CPU cache is that of a temporary storage area that resides between a CPU and main memory, where CPU instructions and program data are stored in very fast memory to speed subsequent, repeated access to the same locations. The cache provides much faster access to repeatedly used memory locations by
· Residing physically closer to the processor than main memory and therefore not being subject to the same degree of physical limitation (the speed of an electron)
The basic theory behind CPU caching is the same as for any other caching: if something is used once, it is likely to be used again in the near future. Also, with instruction caching, if something in a particular memory location is used, it is fairly likely that surrounding memory locations will be referenced soon. Therefore, CPU caches are able to deliver both temporal and spatial locality.
Once the instruction or piece of data has been accessed the first time (and therefore is now in cache), all subsequent references can be retrieved from cache as long as the cache entry has not been overwritten by the contents of other memory locations (see Figure 1.14).
This is achieved by loading the cache with several contiguous memory locations when a miss occurs on the first one.
In order to gain the greatest impact from these two types of locality, the mechanism for storing and retrieving data from the cache needs to be both simple and efficient so as to make its access as fast as possible.
CPU caches are organized into a fixed number of lines, and each line contains a fixed number of bytes that reflect the contents of main memory, as shown in Figure 1.11.![]()
The number of lines and the size of each line varies from processor to processor. For example, the Intel Pentium processor has an 8KB primary cache. This is split into 256 lines of 32 bytes each, totaling 8KB.
The way these lines map to physical memory varies from cache to cache. However, when a reference is made to a memory location by the CPU, it first looks in the cache to see if the address is already present. If the address is not present, a request is made over the bus (sometimes referred to as a bus transaction) to acquire that memory location and also the surrounding locations required to fill the line. A line cannot be partially filled at any time, because this would lead to corrupt data through lack of information in the tagging infrastructure (used to map cache lines to physical memory location), which will be covered a little later. Anyway, the loading of an entire line at a time is the mechanism that gives us the ability to achieve spatial locality of reference.
As a processor cache is potentially many orders of magnitude smaller in size than the main system memory, the design of the cache allows any single line to be populated with the contents of a finite number of discrete memory addresses.
As an example, we will use a CPU that has 32-bit addressing and therefore can individually address memory addresses from 0 through 4294967295, or 4GB. Any of these regions can be in the CPU cache at any one time. A hashing algorithm is used to take this memory address and locate the cache line from it, as described in "Hashing Example: CPU Direct Mapped Cache" in Section 1.4.1.
For this example, we have 128 cache lines, each of which contains 16 bytes of data. We therefore need to be able to hash uniquely a number from the range 0 to 4294967295 into a range of 0 to 127 in order to assign each memory location to a cache line. This is achieved by hashing bits 10 through 4 of the 32-bit address, as shown in Figure 1.12.![]()
Due to the use of low-order bits for the cache line hashing (the lowest 4 bits, 0-3, are used within each line to identify uniquely the byte), consecutive memory addresses are effectively striped across the cache, allowing good locality hits within the cache (i.e., all local memory regions are in cache). This is good for high cache hit ratios with consecutive calls from a program's text (code) segment.
In Figure 1.13, you can see what happens when a hashing algorithm is used as described. Sequential, line-sized memory addresses get loaded into sequential cache lines. In this way, if a piece of code is the same size as the cache or less, then all the instructions will eventually be executed directly out of the cache. Only when addresses beyond the size of the cache are accessed will lines start to be replaced.![]()
When an address is referenced within the CPU, the logic will check the cache to see if it is resident. We have already decided that a good way to select the line should be to use a range of low-order bits, so this means that any one of several different memory locations could be the one currently resident in the cache, as shown in Figure 1.14.
This reuse of cache lines means that we need some mechanism in the cache organization of determining whether or not we have a cache hit. This is where tags come in.![]()
A tag is required to identify the contents of the cache, and therefore whether or not a cache hit is achieved. As you have seen, any one cache line could represent one of several memory locations at any one time: in the example above, there would be 4294967296/(16*128), or 2,097,152 different 16-byte memory pieces that could be loaded into any location at any one time.
In order to account for this ambiguity, each cache line also stores a tag for the line. This tag is simply the remaining bits of the memory address that were not used to locate the line/byte number. In our example, this would be the remaining 21 bits of the address. These 21 bits allow for 2,097,152 different values for the tag. The tag, therefore, positively identifies for the cache search logic whether or not the correct line is loaded at that time.
While the kind of locality optimization found in cache architectures works very well for small C programs (including scientific applications), the effects are not so clear-cut for caching within an Oracle database server. The execution of Oracle code holds the following negative attributes that offset the effectiveness of caching:
Each of these attributes goes out of its way to ensure that the cache is challenged fully to provide good locality. The large binary means that even in a perfect world the entire code base is not going to cache resident (cache sizes are typically just getting into the 4MB range). The callback architecture means that the access pattern through the code is difficult to predict for the compiler. Therefore, it is not unusual that the memory requested is not in cache. Finally, the access pattern of the shared memory effectively blows away the data cache within the CPU, because the access pattern is totally unpredictable.
Performance architects often refer to the clock per instruction (CPI) index for a system. This is a measurement of the average number of clock cycles taken by the processor to process a single instruction. The lower the CPI for a given system, the greater the performance, because the system can execute a larger number of instructions per second. The whole reason for a cache's existence is to reduce the CPI for the system.
Because of the way Oracle is architected, Oracle Database Servers are often faced with a high CPI and subsequently a relatively inefficient use of the hardware. The way to improve on this situation is through continuous improvement of compiler technologies.
Compiler writers use knowledge of the physical layout of the cache, including cache line size and number of lines, in order to optimize the generated machine instructions for the specific CPU and cache architecture. In this way, huge efficiency gains can be had just from recompiling the Oracle software using a new release of the compiler. Of course, this needs to be performed by the relevant Oracle porting group and is often transparent to the end user.
Increasingly, the compiler writer will become the king of performance. This will be especially true when the first IA-64 chip (Itanium) is released. While details of the chip are still not in the public domain, it has been made public that the chip will rely heavily on explicit direction from the code generator as to how it should execute parallel instruction streams. This differs from the current technology, where the CPU must make pipelining decisions without knowledge of the global code picture. Using explicit compiler directives as to how the code should be executed optimally, more control can be obtained over the most useful parallel execution.
This chapter has introduced several concepts, mostly concentrated on software scaling. While it is unusual to have direct input into such fundamental aspects of the system and RDBMS software, it is important that you be able to understand the techniques and structures used to build scalable software.
If a system is not scaling adequately, knowledge of these principles will allow more accurate hypotheses as to the real cause of the problem. Often, a scaling issue in the system software can be mitigated procedurally or by small changes in the application. In-house changes will always be faster to implement than waiting on the software vendor, and this speed can make the difference between good and bad response times. Where an in-house fix is not possible, information can be passed more efficiently to the software vendor when the problem is well understood.
Patterson, David, and John Hennessy. Computer Architecture: A Quantitative Approach, Second Edition. Burlington, MA: Morgan Kaufmann, 1996.
Van der Linden, Peter. Expert C Programming: Deep C Secrets. Upper Saddle River, NJ: Prentice Hall, 1994.
2A race condition is any situation in which multiple sessions are able to corrupt each other through a lack of synchronization between the sessions. Many system crashes that are caused by software error are attributable to race conditions.
3There was an interesting bug in Oracle7.1.6 that was intended by Oracle to be a performance improvement.This was a change in the SQL text hashing algorithm that used only the first 64 bytes and the last 64 bytes of a piece of SQL. Unfortunately, in many applications (especially Oracle applications, such as Oracle Financials), the select list and the order by clause were identical, with just the table names and/or the first part of the WHERE clause changing. This meant that many nonidentical statements would hash to the same hash value.
4An example of this is when an EMC storage array "notices," through trend analysis, that a sequential scan is being performed. It can then perform a cached "read ahead" of the drives in question and thus hopefully eliminate transfer stalls resulting from rotational latency (waiting for the disk surface to come back around to the required sector).
![]() Scale Abilities Ltd http://www.scaleabilities.co.uk Voice: +44 1285 644533 info@scaleabilities.co.uk |
|