TOC PREV NEXT INDEX

Scale Abilities Ltd Logo


Part I
Concepts and Architecture
Chapter 1
Scaling Concepts
n order to build a large system of any kind, it is important to have a good understanding of some basic scaling principles. By understanding these basic principles, it becomes much easier to interpret the requirements for building a large-scale Oracle-based system, and to address performance issues in existing systems that are not scaling satisfactorily. This chapter does not aim to cover all aspects of a system that affect scaling, but does intend to cover the more important aspects, giving examples that highlight situations in which they are found in real systems.
1.1 What Is Scaling?
scaling n .... 2: act of measuring or arranging or adjusting according to a scale.
-Webster's Revised Unabridged Dictionary (1913)
To make this old-fashioned and generic-sounding definition a little more explicit, here are a few examples of scaling in action.
· Increasing the processor count on the system
· Increasing the number of active users on the system
· Increasing the number of records your batch job processes in a given time
The word "increasing" appears in each listed item: that's the scaling that we are interested in for this book-increasing the system capability, whether "system" is defined as hardware, software, or, more typically, both. Indeed, the most common view of scaling is that of hardware scaling, even though scaling has at least as much to do with the software components as with the hardware. Nevertheless, it serves as a useful (if overused) example of what we mean by scaling.
If we were to double the number of processors in a system, would we achieve twice the processing capability? The answer is probably "no." Scaling, like most things, can be classified in either of two ways: good (efficient) scaling and poor (inefficient) scaling.
Good Scaling
Good scaling is observed as a highly linear increase in capability as the system grows. Using the tired old hardware example, this means that the system would demonstrate a near-linear gain in capacity when additional resource is applied (see Figure 1.1).
In Figure 1.1 the capacity of the system is increasing at almost the same rate as the resource being added. This means that, as long as the software scales equally well, the system will be able to process almost n times the data within a given amount of time if n times the resource is available.
Poor Scaling
Poor scaling demonstrates a negative deviation from linear when a system grows. This means that the system capacity diverges farther away from linear when additional resource is applied (see Figure 1.2).
The poor scaling illustrated in Figure 1.2 shows that there is actually less system capacity available when all of the resource is applied than when the system had less than one-half of the total resource. This is an example of what can happen when a system scales badly: performance goes down, not up. This is true of both software and hardware scaling, and is of particular concern when a large system is being built. Good scaling is a function of both hardware and software, and unless both are optimized to scale, overall system throughput will be considerably less than expected.
Scaling can be viewed from two different perspectives: a speedup of tasks within the system; and an increase in concurrency in the system, sometimes referred to as scaleup. Both speedup and scaleup use the same principles, applied in slightly different manners.
1.1.1 Speedup
If a job takes n seconds to run under normal circumstances, and it is desired that the job should run in n/2 seconds, then a twofold speedup is required. This speedup can be gained in either of two different ways:
· Doubling the execution capacity of existing components
· Breaking the job into two parts, and assigning twice as much hardware to the job1
Doubling the execution capacity of the existing components is great if it is possible: it allows the required speedup to be achieved with no increase in complexity and with few concerns as to the required scalability. In fact, speedups gained in this way are not really subject to any scalability concerns at all, assuming that the entire system is doubled in execution capacity (including processor speed, cache sizes, communications latencies, clock speeds of all components, and so on). Unfortunately, this is rarely an option, because speedups are frequently needed immediately after the initial rollout period or are required to be many times the speedup offered by faster parts. If the system has just been installed, it's probably already using the fastest components available, because vendors rarely sell parts that are not at the highest specification available for their system architecture. On the other hand, another vendor may have dramatically faster hardware, and a reassessment of the platform may need to be made; this, however, is an expensive option and is potentially unnecessary when other techniques, such as parallelization, are available.
The most common option is to break the job down into smaller "bite-sized" pieces and have the system process them in parallel. This technique, known as job partitioning, allows the system to be added to where necessary to increase the overall system bandwidth. This has a scalability implication for the system and makes the speedup element dependent on the concurrency available within the system. Parallelization of tasks is the common scalability requirement for decision support systems. One of the most common parallel activities in a decision support system is parallel query, which is a good example of speedup.
Example
In a DSS system, there is a query that requires a full table scan of a very large table (1 billion rows). This table is spread over many disks and many controllers, and the system has many processors available for the execution of this query. Now, if this query were to be executed without any partitioning of the workload (see Figure 1.3), it would simply take a long time for a single processor to read and validate every row in the table against the predicate in the query. No number of additional processor boards would make this query faster, and faster processors would probably limit the speedup to two to three times, even if you were lucky and were not bound by the serial disk access anyway.
Using Oracle Parallel Query, this job can be made many times faster, potentially providing a speedup of an order of magnitude or greater. This is achieved by logically breaking the large monolithic table into several smaller logical ranges. Once this has been done, additional slave processes can be used to scan each of the logical ranges in parallel, thus scaling the query across several processors (see Figure 1.4).
1.1.2 Concurrency (Scaleup)
Good scalability in this context could be defined as "achieving maximum useful concurrency from a shared system." This is the major area of concern for transaction processing systems.
In multiuser systems, the system must share out all of its resource to all the users of the system. It must do this in a fair-share manner, prevent sessions from corrupting others, and not consume excessive resource or impose artificial bottlenecks in the process. In order to achieve this desirable state of concurrency, the system must scale effectively.
Achieving concurrency is one of the greatest challenges when building a system, whether you are the hardware designer, the RDBMS vendor, or the end user. In the building of large-scale systems, it is the concurrency that must be considered at all times.
Strangely enough, concurrency is made possible through the use of enforced serialization. Through serialization, we achieve the necessary synchronization required to process tasks concurrently.
Synchronization needs to occur on any concurrent system and is most commonly handled through the use of latches and locks. The specifics of how these structures are used will be covered in more detail later in this chapter.
Latches themselves also need serialization support, in this case from the hardware itself. In order to take out a latch, a bus transaction must be performed in order to write to the memory location that contains the latch. On a multiprocessor system, many processors could be attempting to write to the same location concurrently, and therefore the bus transactions need to be arbitrated in order to allow them access one at a time. This will be discussed in more detail in Chapter 2.
Digression
In fact, synchronization sometimes also needs to occur for straight performance reasons between cooperating programs on the same or different systems, and this needs to be carefully designed if maximum concurrency is to be maintained. A good example of this is two batch jobs operating in parallel against the same data set. Oracle will automatically prevent the sessions from corrupting each other at the transactional level, and will take out implicit locks to enforce this. If these jobs collide on their respective data sets, they will get caught up in serious contention between each other as a result of Oracle protecting the data from transaction level corruption. This is especially true with Oracle Parallel Server, because Oracle needs to use locks to protect both the transaction layer and the cache layer.
If these sessions partition their data access and cooperate with each other before Oracle enforces it, they will be able to process a good deal more data. This results from the developer of the batch jobs having more knowledge of the actual data itself, and therefore knowing how to synchronize more intelligently. Either way, the synchronization has been performed in order to allow the two jobs to run concurrently without corrupting data.
1.2 Latches and Locks
latch n .... the catch which holds a door or gate when closed, though it be not bolted.
-Webster's Revised Unabridged Dictionary (1913)
Latches and locks can be viewed as very similar entities, and in many software engineering circles there is no difference between the two terms. However, in an Oracle environment, there are apparent differences between them. The basic differences are
· Duration. Latches are normally very transient, whereas locks are typically a longer-term prospect.
· Degree. Latches are switches, whereas locks have levels (or degrees) of severity.
· Scope. In Oracle, latches are instance-specific, whereas locks are mostly database-wide.
Their function remains identical-to prevent other sessions on the system from modifying the critical resource that they are protecting. The Webster's definition almost describes a modern software latch accurately, in that it captures the spirit of the transient operation. However, the door is most certainly bolted when a latch is taken out-it's just faster to unbolt after use.
1.2.1 Why Lock?
As we started to explain in the description of concurrency, it is inherently necessary to serialize some operations on a concurrent system in order to protect different tasks from corrupting each other. This applies only to access to shared resources. Let's run through a couple of examples to help get the real point across.
Example: Memory Protection Using Latches
Let's take a complex memory update as an example in order to add substance to the explanation. This kind of operation is typical in large transactional systems, in order to update several memory blocks atomically (i.e., as a single, indivisible operation). Figure 1.5 shows a region of memory that is protected by a latch.
The diagram in Figure 1.5 shows a memory buffer that is used to batch together many physical I/O requests before issuing a write for all of them at some later stage. Many user sessions on the system write a few hundred bytes of data into this buffer, and then update a structure at another location to reflect the new address to which other sessions can write. In this scenario, there are could be many thousands of sessions on the system that would potentially need to write into this buffer at any one time, and it is clear that some kind of serialization is required in order to prevent a race condition2 from occurring (see Table 1.1).
In the example shown in Table 1.1, the buffer is corrupted through a race occurring between the respective tasks. Not only were the 50 bytes written by user 2 overwritten, but the 300 bytes written by user 1 were effectively truncated due to user 2 setting the "next" pointer incorrectly as a result of a race condition.
It can be seen that this operation must be protected from other sessions while it is in progress. This is where latches come in. Using the same example, Table 1.2 should make it clear why latches are required .
In this situation, a latch is taken before the copy, and thus the area is protected from other sessions until the copy and subsequent pointer update are complete. It is obviously important to keep the period during which the latch is held as short as possible, and also to minimize the contention for the latch. This means that this necessary serialization of the latch is of the lowest possible impact. However, as can be seen in the example, it is not uncommon for the operating system to intervene, and to context switch the latch holder off the processor and back onto the run queue. This can happen for several reasons, including standard process scheduling and receipt of an interrupt on the processor in use.
Despite user 2's process being switched off the processor, a race condition is prevented in the situation shown in Table 1.2. This kind of protection is vital in multiprocessor systems, because there are many ways in which race conditions can occur.
Data Protection Using Locks
As stated earlier, the term latch is normally associated with memory locks that are held for short periods. The term lock is frequently used as a catchall for all types of locking, using the word as an adjective. When we are dealing with Oracle databases, however, it is correct to use the word in terms of data locking. These are longer-duration locks that are used to protect the integrity of changes made to the database, such as user transactions and system (recursive) transactions. In the case of a user transaction, the durations of these locks are sometimes controlled by the users themselves, as follows:
SQL> SELECT pkey FROM SOME_TABLE WHERE jref = 123456789 FOR UPDATE OF status;
1 row selected.
(1 Row now also locked under a TX lock)
SQL> UPDATE SOME_TABLE SET status='INVALID WHERE jref = 123456789;
1 row updated.
SQL> commit;
(also releases TX lock)

No other user can update or delete the row with jref = 123456789 while this lock is held. This example is very simplistic, and it is hard to see why you would need this facility. The truth is, however, that some transactions are so complex that the individual pieces of the transaction need to be protected in order to ensure that the entire transaction can be completed in a consistent, atomic fashion.
It is also worth noting that, in the example above, we explicitly lock the row that we need to update prior to updating it. This is not actually required for most types of operations, because the update statement will implicitly lock the row before the update is performed. This kind of explicit row-level locking can cause serious problems in a large-scale system and should be reserved for situations in which an atomic transaction is absolutely required and cannot be performed in any way other than SELECT ... FOR UPDATE. In fact, a recommendation will be made against the widespread use of overzealous locking in "Maintaining Concurrency" in Section 9.1.1.
An important difference between a lock and a latch within Oracle is that a lock is typically database-wide whereas a latch is typically local to the instance that is managing it. As it is database-wide, a lock affects all instances of a Parallel Server environment, not just the one executing on the local node. There are some exceptions to this rule, but in general it holds true.
Another difference is in the way that sessions wait on the locks. See Section 1.2.3 for more information on lock waits.
1.2.2 Things That Need Locking/Latching
Table 1.3 presents some real-world examples of locking and latching. To make the concept clearer,
1.2.3 Waiting on Latches and Locks
If something (such as a memory structure, data, or an entire program flow) requires protection with latches or locks, then it follows that other sessions, processes, or types of execution context must have some way of waiting on those latches or locks. These waits fall into two categories: active and passive.
An active wait is the method most commonly used for latch allocation. When a session cannot acquire a latch on the resource that it requires, it goes into an active wait on that latch. In active wait, the session repeatedly attempts to acquire the latch until it is successful-that is, waiting impatiently. The reasons for this are that (a) it is unlikely that the latch will be held for very long and (b) it is normally too expensive to set up any kind of queue for the resource.
Sessions that cannot acquire a latch immediately will spin until they get the latch. Spinning involves locking into a tight loop testing the latch and trying to acquire it. Spinning uses significant CPU resource but typically results in acquisition of the latch in a shorter time than it would take to establish any kind of formal queue. These types of latches are also referred to as spinlocks, owing to the nature in which they are waited on. The active wait is frequently implemented using the host's test-and-set functionality-a hardware-level serialization instruction (see "Test and Set" in Section 2.1.2).
The actual algorithms used for waiting on latches/spinlocks varies between implementations. In the Oracle kernel, for example, a process will spin on a latch for spin_count (init.ora parameter) iterations before backing off and sleeping. The process will then spin again before going to sleep once more; the actual sleep time between spins increases exponentially. The reason for the incremental backoff is to reduce the likelihood that the system will disappear off the map as a result of all the processes locking into infinite spin loops.
The alternative method of waiting on a serial resource is to perform a passive wait. Typically, these waits are used on less timing-critical components, of which a TX (row-level lock) is a good example. Such a lock is known within Oracle as an enqueue.
Enqueue is a strange word that has an opposite sibling called dequeue. Once the two are put side by side, their meanings become more apparent. They are both operations on a queue: one enqueues requests (puts them onto the queue) and the other dequeues the requests (takes them off the queue).
Oracle uses enqueue functionality to manage locks within the database. This includes the TX lock, as stated above, and the ST lock, which are more correctly termed the TX enqueues and the ST enqueue, respectively. In comparison with latches, none of these enqueues need to respond quickly to multiple requests. Several milliseconds or even several seconds would be easily fast enough for higher-level serialization such as this.
When a user takes out a row lock, a TX enqueue is created for that user. If another user subsequently attempts to update that same row, that user's session will block (wait on) the enqueue that the initial user created. If a third user tries the same update, he or she will also be blocked on the same enqueue. This is visible to the database administrator as multiple entries in V$LOCK with the same TYPE, ID1 and ID2. The first session will have an LMODE (lock mode) of 6, indicating an exclusive lock gained on that enqueue resource. All the other sessions will report an LMODE of 0 (no lock gained) and a REQUEST of 6 (exclusive lock).
At this point, all the waiters on the enqueue are using zero processor capacity on the database server-they are blocked. This is the nature of a passive wait. However, if the sessions are not actively trying to allocate the lock, how do they ever get it? The answer lies in the dequeue part of the equation, and this occurs once the first lock is cleared.
The first user will issue a commit or rollback against the updated row data protected by the TX enqueue. Once this occurs, the second user session is unblocked and is granted the LMODE of 6 on the enqueue. When that user commits or rolls back, the third session will get a piece of the action, and so on. This is the enqueue/dequeue function in operation.
Not all enqueues work in a FIFO (first in, first out) manner like the TX enqueue. The other example given, the ST enqueue, operates in a LIFO (last in, first out) mode, apparently as an optimization in its sharing algorithm.
1.2.4 Design Considerations to Reduce Lock Contention
When operating systems and relational database engines are designed, special attention is given to minimizing contention for locks on shared resources. This is of obvious importance for the scalability of these systems and merits a good deal of attention from the designers. However, we have already established that locking is necessary within these shared systems, so what can the system designers do to alleviate this contention?
There are several things that can be done where appropriate. Sometimes a good deal of reworking of the logic is necessary to implement some of these techniques, but the engineering impact can be enormous, and so these techniques are often considered as last-resort measures. We'll go through some of the more common methods of reducing lock contention in turn.
Locking Degrees
The first thing that can be affected is implementation of locks/latches with different degrees of exclusion. This is done in order to provide as much concurrency as possible even though it is necessary to lock things. In simplistic terms, there are three different states that a lock can be in:
· NULL. No lock is acquired for this type.
· Acquired Shared Exclusive. Otherwise known as a multireader lock, this lock allows any number of readers, but no session can acquire an exclusive lock in order to perform a write.
· Acquired Exclusive. This lock is acquired to write to the region protected by the lock.
The use of multireader locks allows the system to process read-only workloads while maintaining the data integrity of the system. It is frequently the case that a process may only need to prevent the resource from being updated while it performs some other task, and the use of multireader locks allows many processes to share access to the resource concurrently in read-only mode. If a process needs to modify the resource that the lock is protecting, the lock must be upgraded to an Exclusive mode before this is permitted. This would then prevent any readers or writers from accessing the region.
These three degrees of locking are used throughout both the UNIX kernel and the Oracle Server. Oracle actually defines even more locking degrees than this, in order to provide even more control over the required degree of locking. These locking degrees are described in detail in the Oracle documentation.
A large performance benefit results from maintaining different locking degrees, and this forms a fundamental part of the internal locking mechanisms used in modern systems.
Fine-Grained Locking
A more complex technique is to provide fine-grained locking within the system. This involves reducing the sizes of the regions covered by locks in the system, so that many locks can be used independently in the place of a single lock. This allows far greater concurrency within the system. Fine-grained locking for persistent types of locks, such as transaction locks, is good for all systems (uniprocessor and multiprocessor), while fine-grained latching is most useful on multiprocessor platforms where multiple latch acquisitions can be made concurrently by different processors.
Examples of fine-grained locking include
· Row-level locking with the Oracle transaction layer. Only users that need to access the same row as another user will contend for this data lock. Other RDBMS systems have relied on page-level locking or, even worse, table-level locking.
· Multiple library cache latches. In the pre-7.2 releases, a single latch enforced serialized updates to the shared SQL cache. In releases 7.2 and above, Oracle allows multiple concurrent updates by providing multiple latches. The latch used is determined by the hash value (see 1.4) of the SQL statement.
· Function locks within the UNIX process table. For example, a separate lock can be used to cover the structure that contains pending signals.
Implementation of fine-grained locking often is very difficult for system engineers and frequently requires redesign work. For this reason, it is common for portions of systems to be left for the users to "suck and see" as to whether the system requires fine-grained locking. Although you may not be able to fix any of these problems personally, it is important that you report them to the responsible party if it can be demonstrated that the lock granularity is causing a performance problem. This is usually easier to do within the Oracle product than within the UNIX kernel, because the latch activity is very well reported through the V$ views.
Algorithmic Enhancements
The best way to eliminate lock contention within a system is to use algorithmic enhancements in the software. By changing the algorithms used within the kernel, it is sometimes possible to eliminate the need to lock at all in certain areas. Many such algorithms have been produced within research circles, and some have been implemented into production software to aid scalability.
For example, assume that in a shared-memory-based application, there is a single writer and many readers of a single record in shared memory. The record is larger than the largest atomic update that is possible on the platform. A bad algorithm for dealing with updates to this record would be to lock the record and prevent the readers from accessing it while it was being updated consistently by the writer.
A better way to code this problem is to maintain a master list of pointers to the actual records. The reader processes will first read the master list and then follow the pointer to the actual record. Using this method, a newly updated record can be written to a private memory address, and the pointer in the master list can be updated atomically to point at the new location when the update is complete. This way, no locks have been taken out at any time, and all reader processes can execute in a contention-free environment.
The example shows a simplistic change that removed the need for locks in that part of an application. However, this example is about as simple as it gets: if there is more than one writer, for instance, this technique becomes more complex. The level of effort required to eliminate locks using different algorithms varies from the implementation of simple hashing algorithms to complete redesign of the software architecture. For some truly shared resources, algorithmic changes are impossible as a result of logic restrictions.
1.3 Linked Lists
1.3.1 What Is a Linked List?
A linked list is simply a programming construct that allows the programmer to traverse a list easily without relying on the physical storage for any specific layout. The reason that linked lists are detailed in this book is because lists are used extensively within the Oracle RDBMS, and it is necessary to understand why they affect performance.
So, what does a linked list look like? In C, a linked list is composed of several simple data structures similar to the following:
struct linked_list {
struct linked_list *prev_item; /* List management */
char name[80]; /* Data */
int value; /* Data */
struct linked_list *next_item; /* List management */
};

So, using the data structure example above, the linked list could be loaded from a file as follows:
#include <stdio.h>
#define maxpoints 180
struct linked_list {
struct linked_list *prev_item; /* List management */
char name[80]; /* Data */
int value; /* Data */
struct linked_list *next_item; /* List management */
};
void
main() {
struct linked_list *list_head, *last_item, *curr_item;
FILE *fp;
list_head = curr_item = (struct linked_list *)
malloc(maxpoints*sizeof(struct linked_list));
if (!(fp=fopen("./data","r"))) {
perror("fopen");
exit(2);
}
last_item=(struct linked_list *) 0;
while(fscanf(fp,"%s %d",curr_item->name,&curr_item->value)!=EOF) {
if (last_item!=(struct linked_list *) 0 )
last_item->next_item=curr_item;
curr_item->prev_item=last_item;
last_item=curr_item;
curr_item++;
}
curr_item->next_item=(struct linked_list *) 0;
fclose(fp);
}

Once the list has been loaded, it is very simple to manipulate it; inserting or deleting an item is almost as straightforward as adding an item to the end. Equally, shuffling the entire list in a sort operation is very efficient, because only pointers are changed-the data is never relocated.
So, it should be pretty clear by now what a linked list is and why it is found in nearly every program that manipulates data. At the end of the day, that's exactly what Oracle is.
1.3.2 What Are Linked Lists Used For?
Linked lists are used to manage working sets of information, with the classic example being an LRU (least recently used) list as commonly found within Oracle. An example of the use of LRU algorithms can be found in "LRU Algorithms" in Section 1.5.4. They will always be updated in conjunction with latches, because multiple sessions will use the lists concurrently, and without latch protection even simple list manipulations can produce corruption. Using the LRU list example, it can be seen that it is straightforward to add items to both ends of the list, if pointers to the "hot" (most recently used) and "cold" (least recently used) ends are kept in fixed locations.
This is achieved by taking the following actions (based on fixed-length chains).
"Cold" End
Inserting an entry into the cold end of the LRU list is very straightforward. You can already assume that the entry currently on the cold end is ready to be removed from the list, because it is already the "least recently used." Therefore, you simply overwrite the contents of the entry that is currently on the cold end.
"Hot" End
Inserting an entry into the hot end is slightly more complex, but is still straightforward using linked lists, as shown in Figure 1.6.
1. Take out the latch that protects the list. Go to the cold end of the list (element one) and set the "next" item pointer to be NULL, indicating that there are no entries past this point (i.e., this is the hottest end of the LRU). Set the "previous" item pointer to point to the entry that used to be hottest (element four). Put the new contents into element one.
2. Step along one entry to the second-coldest entry (element two).
3. Update the "previous" item pointer here to be NULL, indicating that there are no more entries (i.e., we now have a new cold end of the list).
4. Go to the item in the list that was previously hottest (element four) and change the "next" item pointer to point to the new hot end.
5. Update the master references to reflect the address of the new cold and hot ends, if applicable.
6. Release the latch.
Using lightweight memory manipulations like those above, lists are one of the most efficient ways of manipulating data in memory. It can be seen, though, why the list needs to be protected from other reads and writes while it is being manipulated. If another session were to start to read the LRU list while the updating process was at step 2, it would get very confused indeed owing to the interim state of the pointers. This would likely result in either data corruption or a system crash.
1.3.3 Optimizing Chain Lengths
As well as inserting entries into the linked list, the list is often scanned to find particular entries. Using the LRU example, the LRU list may be scanned until a block has the correct status, or indeed to heat the entry to a higher (more recently used) point in the list. This can prove to be an expensive operation if the lists get too long (especially because you are holding the latch during this period to prevent other sessions from updating it under your nose), and so linked lists should generally be kept as short as possible.
Any linked lists that are managed while a latch is being held are potential bottlenecks in the system. For this reason, it is imperative that high-contention lists be kept as short as possible, unless of course the list is never fully scanned. If this is not ensured, the latches will be held for a longer duration than is desirable, and waits will build up on the latch.
db_block_buffers
The db_block_buffers parameter is absolutely typical of this management and is subsequently prone to contention. All the buffers in the cache are split up and managed by separate lists, each of which is governed by a latch. This applies to both the LRU list for the blocks (from release 7.3 onward) and the actual blocks themselves (db block hash buckets).
When the buffer cache gets very large, these lists grow with it. So while you are gaining the extra benefit of running with a larger cache, you are also incurring greater overhead from the management of the chains, and you are probably suffering latch contention from the management of the chains. This was particularly true prior to release 7.3, when there were no multiple chains for the LRU list. This meant that the LRU list did not scale at all well prior to release 7.3, and the only way to manage the contention for the cache buffer LRU chain latch was either to decrease the number of blocks in your cache or to tune the database writer. This had an obvious impact on performance if your system was relying on the enhanced cache hit ratio available with the larger cache size.
In this case, it is wise to study the work profile of the system to determine whether the block size is too small for the application. If this were the case, the option would be open to increase the block size of the database, meaning that the cache can be larger than before with no overhead in contention for the LRU chain latch, and no increase in management overhead.
It can be seen that keeping chain lengths to a minimum is a good idea until all the software vendors do it automatically. Many of the chains that are managed are not controllable by the end user and thus are not an option. It is important, however, to keep tight control of the ones that you do have control over, an example of which is demonstrated in "Avoiding High-Contention SQL" in Section 9.1.
1.4 Hashing
Hashing is another technique for speeding up the management operations of a system. In particular, hashing is an approach taken to achieve dramatic speedup of searches for data within a list (usually within a cache; see Section 1.5). Hashing is used just about everywhere possible in high-performance systems because of its efficiency and its simplicity.
1.4.1 What Is Hashing?
Hashing is the term given to performing some kind of algorithm on a piece of data in order to determine an index number for those contents, rather than performing value comparison along a potentially long list of entries. That index number can then be used as a reference to either an absolute piece of memory or a smaller list that can be searched further to find specific entries.
Hashing Algorithms
Hashing algorithms range from incredibly simple manipulations, such as those found in a CPU cache, to more complex ones such as those used to manage the Oracle Shared SQL cache (library cache). A simplistic algorithm might be used to split numbers into buckets known as hash buckets in order to keep some statistics up to date:
BucketID=n MOD 7
Using this algorithm, a number can be quickly assigned to the correct hash bucket, where the hash buckets are used as shown in Table 1.4.
It may not be clear to you at this stage why you would want to group things together into buckets in this way. The answer is that things are easier to find once they are contained in known places. To see if the number 68 has been stored, one would only have to calculate the bucket number (68 MOD 7) and then compare 68 against the 14 possible entries in that list. This is much faster than comparing it against values in 100 possible locations.
One good example of this is the Oracle buffer cache, which manages the blocks within it using latch-managed linked lists. Rather than have a single list that covers all the cache block entries, the blocks are divided into thousands of buckets, in a way similar to that shown in Table 1.4. When a block address is known, it is hashed to determine the list (bucket number) that could contain it, and then the list is traversed. In this way, any list manipulation can be constrained to a much shorter chain while still gaining fast access to the chain though the use of hashing. 
Hashing Example: CPU Direct Mapped Cache
This will be covered later in this chapter in more detail, but serves as a good example here. Basically, a CPU cache is comprised of several lines. A hashing algorithm is a perfect way to identify data lines within the cache, because it is very efficient. In the case of a CPU cache, it needs to be a massively efficient operation, implemented in hardware, in order to keep the latency of a cache lookup as low as possible.
Due to the way the cache lines are organized, the line can be identified by simply hashing the memory address and obtaining the line ID (see Figure 1.7). The hash operation in this case is a simple AND operation to extract a subset of the bits in the address in order to derive the line number. This is a very efficient operation to perform within a CPU cache.
Hashing Example: Oracle Shared SQL
From Oracle7, the Oracle server maintains a cache of SQL that can be used to reduce dramatically the workload required when another session submits the same piece of SQL. In order to implement this cache, Oracle needed a way to speed up the identification of the SQL and to check if it was already cached. It is simply much too expensive to do a text comparison of a submitted piece of SQL against all other SQL requests in the cache.
Oracle adopted a hashing algorithm to speed up the search for identical statements in the library cache. This is essentially an arithmetic operation on the actual text that makes up the piece of SQL in order to derive a single index for that statement. Don't be misled by the term index: these are large numbers, and it is actually very unusual for more than one statement to hash to the same value.3 The hash value for each new statement is then used as an index, in order to do a fast lookup in the library cache to see if an identical statement has been hard parsed already. If so, a good deal of the parsing work for this statement can be reused, reducing the overhead of parsing the "new" statement.
The alternative methods of locating previously parsed SQL statements are inefficient by comparison. In particular, if a text-based comparison was needed against every piece of cached SQL, the effect would be a greater parsing overhead than running with no cache at all.
1.5 Caching
cache n ....hiding place, for concealing and preserving provisions which it is inconvenient to carry.
-Webster's Revised Unabridged Dictionary (1913)
1.5.1 Cache Fundamentals
A cache is a smaller, higher-speed component that is used to speed up the access to commonly used data stored in a lower-speed, higher-capacity component. There are several different applications of caching, including
· Caching of main memory locations next to the CPU
· Caching of disk blocks in main memory
· Caching of network information on disk
The best known of these is the CPU cache, which is largely a result of the efforts of the Intel marketing machine. However, in order to build a high-performance, large-scale system, caching needs to be implemented and tuned at every opportunity.
1.5.2 Memory Hierarchies
A cache is one component in a memory hierarchy (see Figure 1.8) . The memory hierarchy ranges from the very fast access speeds of a CPU register through cache, memory, disk, and network. It is viewed as a hierarchy because of the relative access speeds of the various components.
The fastest access path to storage on a system is between the CPU registers, and from there on, things slow down mostly in orders of magnitude (see Table 1.5).
It can be seen from the table that as the speed of components goes down by orders of magnitude, so too does the cost of the components. It is clearly not reasonable to expect a system to be built entirely from fast SRAM (the RAM technology used in CPU cache memories) running at core CPU speeds, the cost of which would be prohibitive. Therefore, the goal of caching is to give the performance illusion of slower components running at the same rate as the fastest component. While this is impossible to achieve in reality, any step toward it gives a huge boost in system performance by reducing the waits on lower-speed components.
Cost, however, is not the only reason for caching. Another reason to keep data as close to the CPU as possible is to reduce the contention on, for example, the system bus. Clearly, if all traffic needs to be passed across a single shared resource such as a bus, limits are going to be hit before they need to be. The system is not going to scale effectively.
1.5.3 Cache Reference Patterns
In order to make a cache as efficient as possible, some knowledge is needed of the access patterns within the device you are caching. In this way, assumptions can be built into the cache design in order to gain the most from the limited amount of cache space.
There are two main patterns of access that directly relate to caching:
· Temporal locality of reference
· Spatial locality of reference
Temporal locality is a time-based reference pattern: if this data was used once recently, it's likely that it will be needed again soon.
Spatial locality is a location based reference pattern: if location abc was used, then it's likely that location def will be needed sometime soon.
Pretty much all types of caching follow one or both of these patterns. Some devices need to employ very complex techniques in order to achieve spatial locality, owing to limited natural association between their access pattern and spatial locality.4
Two types of caching will be concentrated on for the remainder of this chapter: I/O caching and CPU caching. I/O caching is very important to understand, because it is directly tunable by the end user. The CPU cache concepts are presented as an interesting piece of background reading to help you further understand the operation of large scalable systems.
1.5.4 I/O Caching
Effective caching of disk I/O is perhaps the biggest single performance improvement that can be made to a database system by the end user. The reason for this can be clearly seen in Table 1.5-disk I/O has the poorest response time of all within a system. In fact, it takes about 100,000 times as long to access a disk block as it takes to access a memory block.
It is obviously not practical to keep the entire database in memory (although there are situations in which it is desirable to force large pieces into memory), so there needs to be a way to control the use of memory to keep the cache as efficient as possible. The most common way of keeping the cache relevant is through the use of a least recently used (LRU) algorithm, in order to benefit from temporal locality of reference.