System/Database Monitoring
Monitoring is a frequently overlooked component of essential system development. Whether you purchase a third-party software product or develop your own techniques for monitoring the system, it is important not to underestimate either the effectiveness of good system monitoring or the effort required to develop this capability.
The benefits provided by monitoring fall into three categories:
· Providing proactive fault-detection capability
· Providing rapid fault diagnosis capability from available data
· Providing historical performance records for capacity planning and analysis
These are substantial benefits that even if taken alone would justify the cost of implementing a quality monitoring solution.
4.1.1 Proactive Fault Detection
In an ideal world, fault detection would occur before the actual problem arose, and the problem could therefore be fixed before it happened. This is, of course, complete nonsense. However, using proactive fault-detection techniques, it should be possible to preclude a selection of your least favorite problems. Very often, major problems are caused by the combined effect of several minor problems and can be prevented by providing early warnings of these problems.
For example, let's say that an Oracle Parallel Server system is capable of handling 100 distributed lock operations per second under normal loading and has 20 percent idle CPU during this loading. It is also known that servicing of 20 lock operations could take up to 5 percent of the system CPU. If something happened to increase the number of lock operations to 160 operations per second, this would not impose a response time problem on the user community, and under many operational environments this event would go unnoticed as a result. However, with a monitoring system, this would be detected and an alert would be triggered indicating that something had changed significantly and that the system was now very vulnerable to increases in CPU utilization. The system would now be running at 95 percent utilization, and even a small problem would cause user discomfort.
It can be seen that, using a suitable monitoring system, this problem can be avoided. This is one of the greatest strengths of a monitoring system.
4.1.2 Rapid Fault Diagnosis
Another advantage of a monitoring system comes into play when the system is not performing properly. In this instance, when the support person is called out, considerably more information can be made available to him or her than if the system were not monitored. For example, if one of the tablespaces in the database had filled up, the only thing the users might get (if using Forms) would be an "unhandled exception" error. If the system were not being proactively monitored, the DBA would have to work through systematic checks to discover that the tablespace was a problem. With a monitored system, however, the problem could be relayed to the support person as "tablespace full," and the DBA could rectify the problem immediately.
Typically, a DBA uses scripts, one at a time, to look through several "hot buttons" in order to determine the operational status of the database. With an adequately monitored system, the DBA has an enormous advantage over script-based DBAs for determining the performance of the system. If the monitored data is well presented to the DBA, such as through some kind of histogram-based Graphical User Interface (GUI), trends and correlations between many statistics can be immediately observed and interpreted.
For example, it is very simple to observe the relationship between the user commit rate (as measured by "user commits" in
v$sysstat) and the use of the redo allocation and redo copy latches: when one increases in rate, it is accompanied by a jump in the reported statistics from the related areas. Viewing the rate that users enter wait states, in conjunction with the usage rates for the relevant resource, can show where the pain threshold lies for key areas of database operation. A good example of this would be monitoring the number of "latch free" wait events in conjunction with the number of latch gets for each type of latch. This shows where the system starts to have problems keeping up and provides explicit targets for application tuning exercises.
4.1.3 Historical Performance Records
Whichever method you adopt for system monitoring, it is important to make sure that historical records of all the performance data are kept for later review. By doing this, you gain the ability to track gradual problems (such as adverse row chaining in the database causing excessive I/O), determine the effects of software releases in terms of system utilization, and even perform high-level capacity planning by extrapolating resource trends as business volumes increase.
Historical records are also useful for determining the impact of system upgrades and database changes. For example, a recent upgrade of Oracle on one of our production database servers allowed the use of "hard locked SGA." This is a facility that locks the SGA into real memory, removing the possibility of it being paged out onto disk. The benefit of this is more than is immediately apparent. First, because it does not need to be considered by the virtual memory system for paging, there is no swap space allocated for the SGA. Second, the UNIX kernel can start to share
Process Table Maps (ptmaps) for each process that attaches to the SGA (every user process and all Oracle background processes). A
ptmap is the space map kept by the kernel that provides VM information about all pages of memory the process has mapped. On this particular system, this means that it needs to allocate 4 bytes (i.e., a 32-bit address) for every 4KB memory page that it maps. If each of your processes on the system is mapping a 500MB SGA, this means that the effective size of their process goes up by 500KB, just in management overhead. This can become a major issue when there are 1,000 users or more connected to your system, because this equates to a 500MB memory overhead.
Historical monitoring was a useful tool here to determine how much memory we would need in the system once this facility was implemented. We were able to plot charts of memory consumption over a period of more than 1 year and to determine the impact.
4.2 Low-Intrusion Techniques
If you decide to develop your own monitoring suite, or even if you use the framework of a third-party tool with your own collectors, you will need to make sure that the collection methods are
low-intrusion methods. This means that the methods used for the collection should not impose too much on the system when they are running-that is, you should not be affecting the behavior of the system you are trying to monitor. This could result in response time problems for the user community and could also spoil the quality of any data that you collect from the system.
4.2.1 Go Easy on the System
When developing statistic-collection tools, taking care of the system should be of primary concern. Most of the time, this will mean simply minimizing the amount of CPU that the collector uses. However, when the system has especially severe scaling requirements, it is important to factor in all the types of resources that need to be minimized:
· Single-threaded resources
The first three resources may be fairly obvious, but the last one is not. It is included here as a catchall for any type of resource that is a potential bottleneck in the system. Within Oracle this typically equates to latches, but can include anything that has the same mutual exclusion effect. If it is possible to implement the collector without using any single-threaded resources, then that should be one of the primary design goals.
4.2.2 Avoiding Single-Threaded Resources
A good example of this might be the "
ST" enqueue within Oracle. This is visible in
V$LOCK and is the enqueue that needs to be acquired by Oracle before it can do a
space transaction, which includes allocation, deallocation, and coalescing of space in the database. This lock is taken out by your user process (during recursive SQL) when you require an extent on disk to extend a sort into, because the free extent table (SYS.FET$) and the used extent table (SYS.UET$) need to be updated to reflect this.
The big problem with the ST enqueue is that, by design, there is only one of them. This means by implication that only one session in the database can be doing any space transactions at any one time, and has been the reason for the ST lock to be aliased "Single Threaded" and "Seriously Throttling."
So, avoiding the need to take out the ST lock is one good consideration that you should make, both to ensure that you are not impacting the system and to ensure that the monitoring is not affected by resource contention for ST. This is often not as simple as it sounds, as the following SQL demonstrates:
FROM v$statname a,v$sesstat b
WHERE a.statistic#=b.statistic#
AND a.name like "db file%'
|
In this query, a join operation needs to occur between
v$sesstat and
v$statname. Whereas this may not be a problem in most systems, imagine what happens on a 2,000-user OLTP system, where there is a very small sort area by design. A disk sort segment needs to be allocated, because you are joining tables of approximately 400,000 rows and 200 rows, respectively, and this cannot be serviced out of the small sort area (areas as small as 16KB are not unusual).
1
Of course, you would not be running the query above in an automated monitoring system, but you may want to run queries like it in online diagnosis sessions. There are several ways around this type of problem:
· Avoid joining tables where it is not necessary; reference
v$sesstat, for example, directly by
statistic# instead of joining it against
v$statname.
· Use the "virtual" column indexes on the X$ tables found in Oracle 7.1 and above.
· Rely on the latch managed "sort segments" facility found in Oracle 7.3 and above. Note that while this helps a great deal, you should be aware that the disk sorting is still occurring (and is expensive) and you can still end up using ST if the sort segments are badly tuned.
· Use foreground (or application layer) joins. In extreme cases where you have no other option, you could implement the joins as a series of separate cursors in the monitoring application with which your application then deals. This has a resource overhead all its own, but is an option that is open to you under unusual circumstances.
Another single-threaded resource to be careful with is the library cache latch. More details on the use of the library cache latch can be found in Chapter 6, but for now we can simplify it by saying that it is taken out for each SQL
parse made on the system. Therefore, for each SQL parse made on the system, the library cache latch needs to be taken out first, preventing any other sessions from parsing. This is largely relieved in Oracle 7.2 and above with the introduction of multiple library cache latches based on the SQL hash value of the statement. However, in heavily loaded systems, it does not make sense for an automated monitoring system to allocate these latches every
n seconds when it does a sample. There is also a processing overhead in this scenario that does not need to be there.
Therefore, another good thing to do is to "parse once, execute many" for all the statements that make up the monitoring suite. Once the statement has been explicitly parsed once, it is only necessary to reexecute the cursor and suck down the results. From the OCI discussions in the preceding chapter, it should be clear how this can be easily achieved, and in fact I will be showing ways of implementing an Oracle data collector using
dbaman later in
Section 4.5.
4.3 Introduction to V$ Views
4.3.1 What Are the V$ Views ?
The V$ views are views that are based on the X$ fixed tables. Although referred to as tables, the X$ tables are not actually tables at all. An X$ table simply presents data in a format that looks like a table in order to allow standard SQL queries to be executed against it. However, the data is extracted from a variety of sources, including memory structures and file headers (see Figure 4.1).
The data available in the X$ tables is unrelated to the application data stored in the database. Instead, the data in the X$ tables describes the operational state of the database server at any point in time. The actual contents of the V$ views and X$ tables are discussed in more detail in
Section 6.10. In this section we will concentrate on the
methods of accessing them, rather than on the contents themselves.
The structure of these tables is based on the underlying data structures and subsequently has table names and column names that are unintelligible to anyone without strong source code knowledge of the RDBMS. They are also subject to a good deal of change between releases, as the architecture of the underlying software changes.
It is for these reasons that the V$ views are supplied by Oracle and that their usage is recommended over the X$ tables. This having been said, there are specific occasions for which knowledge of the underlying X$ tables is beneficial, and advice will be given accordingly when such occasions arise.
4.3.2 Accessing the V$ Views
The V$ views are all owned by the
SYS user. The
catalog.sql script creates public synonyms for all of the V$ views and grants
SELECT to the
SELECT_CATALOG_ROLE role. Therefore, anybody with a privileged role can access these views.
The method used to allow non-
SYS users to access the views is worth observing, because this is the method that is necessary for allowing non-
SYS access to the X$ tables also. The necessary steps are:
· Connected as
SYS,
create view x_$<name> as select * from x$<name>;
·
create public synonym x$<name> for x_$<name>;
When these steps are used, any of the X$ tables can be externalized for other database users to view.
4.3.3 Structure of the V$ Views
The V$ views have definitions, just as all other views in the database have. The difference is that the V$ views are not visible in the
DBA_VIEWS view, because they are not real views.
2 To see how a V$ view is defined, one needs to query the V$ view
V$FIXED_VIEW_DEFINITION. to find the definition.
The first thing you will notice when querying this view is that all the V$ views seem to be simple views based on the like-named GV$ view. A GV$ view is a new feature of Oracle8 that provides a global view of all the instances that have the database open. The way to get the real definition is to query
V$FIXED_VIEW_DEFINITION for the respective GV$ view, which will give the actual definition of the V$ view.
Many of the V$ views are composed of more than one X$ table, with a join condition. An example of this is
V$SESSION_WAIT, which joins
X$KSUSECST (the actual data) and
X$KSLED (the names of the wait conditions) to provide the readable information that is presented.
In addition to the table joins, the V$ views are frequently aggregated rollup views of an underlying X$ table. An example of this is the
V$SQLAREA view, which provides a view of summary information of the cursors stored in the library cache. Although a single cursor may have several child cursors as a result of different bind variable thresholds, for example, only one row will be returned from
V$SQLAREA for any single identical piece of SQL, because of the
GROUP BY clause in the view definition.
The performance of many of the views has been significantly improved since release 7.0. The reason for this is the inclusion of virtual indexes on selected columns within the underlying X$ tables. Another view,
V$INDEXED_FIXED_COLUMN, details which columns of the X$ tables have indexes maintained on them. Although these indexes are somewhat limited in scope, their use can significantly improve the usability of some of the views, particularly when joining against other fixed views or tables.
4.3.4 Overhead of Using V$ Views
Alhough many of the V$ views access structures that are memory- resident this does not mean that they have a low overhead on a large system. In fact, the larger the system, the more resource is consumed in processing queries on V$ views. For example, a very large buffer cache may be covered by 1,000,000+ latches and the library cache by 100,000+ more. A subsequent query on
V$LATCH involves a row for every one of these latches, making a very large data set.
Likewise, joining
V$SESSION to
V$SESSTAT on a system with a substantial user population (3,000+) results in a large join operation.
Other attributes of the V$ views that somewhat aggravate this problem include
We have seen how table joins and aggregations are introduced, through the actual views themselves, but the problem of latch allocations is not immediately obvious. As many of the X$ tables are based on rapidly changing memory structures, Oracle needs to protect the lists that it traverses to obtain the results in order to prevent the list from being changed by another session. If this were to happen, the X$ query could easily find itself in a piece of list that is no longer valid, and could return corrupt results or crash the session. Therefore, when an X$ query is executed, latch operations take place.
To demonstrate this to yourself, try the following on an idle instance:
SELECT gets FROM V$LATCH WHERE name='library cache';
SELECT gets FROM V$LATCH WHERE name='library cache';
SELECT count(*) FROM V$SQLAREA;
SELECT gets FROM V$LATCH WHERE name='library cache';
|
The first query is executed twice to demonstrate the constant number of library cache latch gets that occur just from the execution of the
V$LATCH query (ten on my test system). The query to
V$SQLAREA allocates the library cache latch many times during its execution (more than one latch get per cursor) and therefore contends directly with the users on the system that are trying to parse SQL statements.
Table joins and aggregations can cause problems greater than just the CPU overhead of the required sort. If the sort is too large for the sort area, the sort will become a disk sort in temporary tablespace. Although the
sort_area_size parameters are now dynamic parameters (they can be changed on a per-user basis as required), the number of rows involved in some of these queries often makes it impractical to extend the sort area to this size.
Now that we have seen the overhead of using the V$ views, it should be apparent why there are cases where the X$ tables should be accessed in preference. One such example is the use of
V$LATCH in a continuous sampling environment.
The
V$LATCH view can take up to 15 seconds of CPU time to execute in a large configuration, owing to the large number of rows in the underlying X$ tables.
V$LATCH aggregates this data into a summary view of all the underlying child latches (
V$LATCH_CHILDREN).
If this data is required, it makes sense to rewrite the view to reduce the amount of data returned from each X$ table used in the view. This is especially important if you intend to execute the query many times in succession, such as in an automated monitoring environment, because the amount of overhead imposed without good reason can really mount up.
Finally, it is common to join several V$ views in order to obtain the desired result. Many of the V$ tables have relationships between them, and it is beneficial to join them in this way. However, many of the views also use the same underlying X$ tables, and the entire query can be rewritten in order to eliminate any unnecessary extra accesses that may occur. When joining the X$ tables, you should also try to join on the indexed columns (usually the ADDR column), because this can make a huge difference on some of the larger views.
4.5 Using
dbaman to Optimize Data Collection
Now that we have a general-purpose scripting interface to the OCI API, it makes sense to use that interface to automate the low-intrusion data collection process. In this section, we will look at ways to implement
dbaman as a general-purpose monitoring tool.
4.5.1 Defining the Cursors
First of all, we don't want to end up with a system that needs to be reprogrammed every time we want to monitor something new. So, a standard method is required for entering SQL statements in a user-friendly manner. One way to do this is to provide a separate file containing the SQL, formatted as a simplistic Tcl script that loads up all the SQL strings:
################################################################################
# sqllist.tcl - list of statements to sample with
################################################################################
################################################################################
# Setup required procedure
################################################################################
# Actual SQL Statements. All must be of the form "add_sql { <del|abs> { SQL } }"
where statistic# in ( 4,40 ) }
where statistic# in ( 1 ) }
|
Each call to the defined
add_sql procedure loads up the SQL along with its processing requirements to the array
sql_list. The processing types adopted for this implementation are
del and
abs, denoting delta processing and absolute processing, respectively. In delta processing mode, the SQL is executed, and the delta (difference) is returned between its result set and the result set from the prior execution. In absolute mode, the absolute value is returned directly from the SQL execution.
The format of the SQL in the setup file should be such that the query always returns data in two columns, tag and value. In addition, all SQL submitted for delta processing must always return the same number of rows. If this is not the case, the deltas cannot be processed.
4.5.2 Parsing and Executing the SQL
Once the SQL is all loaded up into a local variable within
dbaman, we can go through the array in sequence, parsing all the SQL one at a time, and storing the cursor handle each time:
################################################################################
# parse_all - parses all statements in $sql_list()
################################################################################
foreach stmt [ array names sql_list ] {
set l_cur($stmt) [ oci_open ]
oci_parse $l_cur($stmt) [ lindex $sql_list($stmt) 1 ]
|
This step needs to be executed only
once, in order to keep the processing overhead (and use of the library cache latches) to a minimum.
Now we need a procedure that will go through each of the statements and do the right thing accordingly:
################################################################################
# exe_all - execute all the statements, and fetch all the data.
################################################################################
global sql_list oci_error cursor results
foreach stmt [ array names sql_list ] {
while { [ lindex $oci_error 0 ] != 1403 } {
oci_fetch $cursor($stmt) 30
|
This procedure executes each statement and fetches all the available data in array fetches of 30 rows at a time. As all the data from
oci_fetch is stored in the global namespace, all this data will be accessible from everywhere else in the program.
4.5.3 Process the Result Sets
Finally, we need a routine that will perform the specified processing on each set of data. It should put all the results into a specific array so that the output stage can just traverse the array and output in the desired format:
################################################################################
# calc - go through each statement and calculate the correct values. Put the
# results into an array when finished in order to make display easy.
################################################################################
global sql_list display prior results
set now [ clock seconds ]
foreach stmt [ array names sql_list ] {
switch [ lindex $sql_list($stmt) 0 ] {
foreach row [ array names results "$stmt,\[0-9]*" ] {
set display($row) [concat [ list \
[clock format $now] ] $results($row) ]
foreach row [ array names results "$stmt,\[0-9]*" ] {
if {[ catch { set prvval [ lindex \
$prior("$stmt,[lindex $results($row) 0]") 1 ] } ]==0} {
$prior("$stmt,[lindex $results($row) 0]") 0 ]
set nowval "[lindex $results($row) 1].0"
set difftm [ expr $now - $prvtmsp ]
set display($row) [ list [ clock format $now ] \
[ lindex $results($row) 0 ] \
[ expr ($nowval-$prvval)/$difftm]]
set prior("$stmt,[lindex $results($row) 0]") [ list \
$now [lindex $results($row) 1] ]
|
In this procedure, we attempt to set as many of the
$display array elements as possible. In the case of the delta processing, if there has been no previous execution, the
$display array element is simply not set. If there has been no prior corresponding row it is also ignored, apart from setting up the prior data for the next row. If there
has been a corresponding prior data point but there is none on this runthrough no delta will be calcutlated for the data point.
However, if the data point reappears at some point in the future, a delta will be calculated at that time, using a delta calculation based on the stored timestamp of the prior sample (whenever that was).
4.5.4 Pulling it Together with a Main Loop
Once we have all the required procedures, we can build a main loop that takes care of the repetitive execution of the procedures:
################################################################################
# Main routine for dbaman-based sampling monitor
################################################################################
# read in all the support procedures
# read in the SQL configuration file
set fd [ open [lindex $argv 0 ] a+ ]
set sample [ lindex $argv 1 ]
foreach dpoint [ array names display ] {
puts $fd $display($dpoint)
after [ expr $sample * 1000 ]
|
That's about all that is needed to get a fairly low-intrusion,
dbaman-based monitoring system. The actual monitoring process can be run on a nonproduction platform, because it will use standard Oracle client/server connections to the actual database server. This ensures that the intrusion on the actual server is kept to a minimum. Combined with well-written and well-understood SQL statements, the result is a useful monitoring system.
This main routine is very simplistic-just writing out all the data to a file. More complex monitor presentations can be created by linking the TK component into the
dbaman shell in order to gain a full GUI capability. Once this has been performed, there is practically no limit to how the data can be presented.
4.6 Processing Statistics
After going to the trouble of making a low intrusion data collector, it makes sense at least to run the database collector in client/server mode, as shown in Figure 4.

2. This means that only the actual data extraction, and none of the pattern matching and actual delta processing, is performed on the database server.
Wherever possible, all other data processing of the statistics should take place on noncritical resources such as, for example, a development server. One project that is slightly beyond the scope of this book but worthwhile nonetheless is building a network infrastructure among all the monitoring components (
dbaman, system monitors, etc.) in order to funnel all the data to a central place in real time. This allows several activities to be performed on the data, such as data analysis and data display.
The latest version of Tcl supports both client and server network sockets (
socket command) and is a good starting point for building such an infrastructure. In this example program, we have most of the makings of a network data concentrator:
The program works by first of all setting up a server socket. A server socket is a socket that listens for incoming requests and creates a new client socket for any such requests that come in. This action frees up the server socket and allows multiple connections by connecting to the same port. The new client socket is assigned a port that is guaranteed to be unique.
Every 300 milliseconds, or less if a new socket connection interrupts the sleep, the program wakes up, scans all the defines client socket connections, and writes all the data from each socket to the file supplied on the command line.
This is a fairly simplistic version of what is needed and will actually work well as long as no exceptional circumstances arise. One such exceptional circumstance is noted in the source code, where a single session can effectively lock the
write_logs procedure by never giving it a chance to finish on that socket. However, there would need to be a great deal of data flowing in for this to happen.
In order to use this framework with the
dbaman-based database data gatherer, we need to amend the main routine to use the network instead of a local file:
################################################################################
# Main routine for dbaman-based sampling monitor
################################################################################
# read in all the support procedures
# read in the SQL configuration file
set sendaddr [lindex $argv 0 ]
set sendport [lindex $argv 1 ]
set sample [ lindex $argv 2 ]
set sd [ socket $sendaddr $sendport ]
foreach dpoint [ array names display ] {
puts $sd $display($dpoint)
after [ expr $sample * 1000 ]
|
The only changes in this program relate to opening a socket instead of the file, and using
puts to write down the network instead of to the local file.
Once a network forwarding infrastructure has been implemented, this infrastructure is also an ideal vehicle for centralizing and reporting on all subsystem alerts. Useful things to include are
· The Oracle
alertXXXX.log files
· The UNIX hardware error log
· Any middleware or application logfiles
It does not make sense to ship the entire logfile, however, because this is not much more useful than the original logfile. Instead, a text filter can be applied to the logfile in order to extract only specific types of messages. These messages can then be distilled and rewritten prior to shipping across the network, with the end result being a much more useful set of messages, all in a central location.
This example code fragment of one such text filter searches its input for ORA-00600 errors, the dreaded internal error:
/^ORA-00600/ { sub("[[]"," ");
sprintf("STATUS: Oracle Internal error: type %d.",$6);
|
This
awk pattern match and action detects the ORA-00600 error, extracts the arguments of the string, and reports only the fact that there is an internal error, of the type denoted by the arguments.
This could be run as follows:
$ tail -f alertDB.ora | awk -f filter.nawk | sendmesg dev_host 12345
|
This command line filters all the future output from the alertfile and pipes the output to a command called
sendmesg. This is a fabricated program that simply takes standard input and pushes it down an open client socket to the host and port specified on the command line.
In the ORA-00600 example, we prefix the message with a string denoting the severity of the message. All that is now required to make a simplistic alert console is to tail the single journal that is now being created and
grep for the type of message required (STATUS, ALARM, etc.).
4.7 Defining the "Flight Envelope"
4.7.1 What Is a "Flight Envelope"?
A flight envelope is the way that the operating characteristics of an aircraft are specified. For example, a plane may be able to fly at 300 knots at 20,000 feet but at only 150 knots at 3,000 feet. This is very similar to the way that large complex systems work.
In your system, you may be able
either to cope with 3,000 pageouts per second
or to run at 90 percent CPU utilization. If the system were to start paging at 3,000 pageouts per second while your system was at 90 percent CPU utilization, the system would be in big trouble, with end user response times being degraded accordingly. This is a simplified version of a system flight envelope.
A flight envelope is a very useful thing to define, both for your own benefit and for the benefit of people who simply cannot understand the complex interactions of the system.
4.7.2 How Do I Define a Flight Envelope for the System?
Much of the definition of the flight envelope comes down to "getting a feel" for the particular system on which you run your database. The hardware vendor may be able to provide some raw statistics about on the system's capability, but these numbers need to be used with a good pinch of caution.
In fact, one of the best ways to define a flight envelope is to break it out into several child flight envelopes. Certain operational parameters within the system will directly compete against each other, while others will work in their own discrete groups.
For example, there is a direct correlation between the number of latch acquisitions per second and the amount of idle CPU on the system, because of the impact of spinning to obtain latches. Whereas a normal latch get is not very CPU-intensive, a spin get of a latch is very CPU-intensive. Therefore, one of the system flight envelopes may be how the latch get potential relates to idle CPU on the system (see Figure 4.3).
This envelope shows that the number of potential latch gets increases linearly up to a certain point. At this point, a proportion of the latch gets must be acquired by spinning, and therefore the act of gaining the latch starts to eat into the idle CPU. As the available CPU is reduced, the latch acquisition problem is compounded, and so the number of potential latch gets per second is further reduced.
These kinds of flight envelopes can be as scientific as you like; sometimes the best metric of all is "I can do up to
n transactions per second before the response time degrades."
4.8 Using Excel Spreadsheets for Data Visualization
Microsoft Excel spreadsheets provide an excellent way of visualizing trends in otherwise meaningless streams of data. Typically, they provide an entire spectrum of different plotting options that should always provide you with enough choices to produce meaningful results.
Unfortunately, there are also several limitations that probably don't normally affect standard bean-counter users of the spreadsheet products (unless they have unusually large numbers of very complex beans to count). The first one is the row count limitation. In Excel, this is limited to 16,384 rows in version 7.0 and to 65,536 in the 97 version. It is amazing how easy it is to exceed 16,384 rows in your data; for example, if your data collector is sampling every 15 seconds, it only takes 68 hours of one statistic to exceed the limit, and proportionately less if you are showing several instances and a small number of different statistics. This comes down to only 5.67 hours if you are trying to plot four different statistics over three database instances, for example.
The other problems with using Excel for this type of function relate to the assumptions it makes about the size and complexity of the spreadsheet that you can use on it. The spreadsheet is always memory-resident, for example, which can cause havoc with the Windows virtual memory system when large sheets are used.
Related to this is the fundamental assumption from the charting tool in Excel that you are plotting "a few" data points. It clearly is not expecting 16,383 (the first row is the heading row) points to plot, because otherwise it would not redraw the chart in the "Chart Wizard" box every time you changed the settings.
Despite these problems, however, Excel is still one of the most powerful tools in the domestic tool bag for data analysis. Using Excel to best effect requires several preliminary steps in order to make the most of the charting facilities it contains.
The most fundamental requirement is that you massage the data before trying to do anything with it in Excel. This is necessary because of the variety of different sources from which the data may originate. More often than not, data from different sources will have different sample periods, different snapshot times, and widely different numbers of samples. It is often these mismatched data points that you will want to chart together on a single chart-for example, charting transaction rate against CPU utilization.
In order to achieve this with high-quality results, several steps must be taken. First, all the data must be quantized to common snapshot times and sample periods. This can be achieved using one of the powerful text-processing languages such as awk or perl.
Quantization means time-aligning the data points from the separate data sources. Individual circumstances will determine the most suitable granularity for the quantization, but using nearest whole minutes often works very well.
The input data may look like this:
1997/12/17 23:42:26 882430946 dbmon BMA2.user_commits 14.58
1997/12/17 23:42:59 882430979 dbmon BMA2.user_commits 14.79
1997/12/17 23:43:32 882431012 dbmon BMA2.user_commits 15.30
1997/12/17 23:44:06 882431046 dbmon BMA2.user_commits 14.79
1997/12/17 23:44:39 882431079 dbmon BMA2.user_commits 13.79
1997/12/17 23:45:12 882431112 dbmon BMA2.user_commits 14.58
1997/12/17 23:45:46 882431146 dbmon BMA2.user_commits 14.33
1997/12/17 23:46:19 882431179 dbmon BMA2.user_commits 13.00
1997/12/17 23:46:53 882431213 dbmon BMA2.user_commits 14.50
1997/12/17 23:47:27 882431247 dbmon BMA2.user_commits 14.97
1997/12/17 23:48:00 882431280 dbmon BMA2.user_commits 15.27
1997/12/17 23:48:34 882431314 dbmon BMA2.user_commits 15.18
|
All of the samples lie on almost random time boundaries as a result of rounding in system timers, scheduling glitches, and so on. This makes such data very difficult to match up with any other data.
The following awk script will take data in the form above and time-align it to whole-minute boundaries:
The program takes all the samples, or portions of samples, that fall within the boundaries of the minute. A weighted average is then calculated for the minute, and the output timestamp is adjusted accordingly.
This results in data of the following form:
This data is now ready for the next step.
The next step varies depending on the amount of data analysis required. If one-off charting of a few choice statistics is all that is required, the next step is simply to load up the CSV (comma separated value) file into Excel and perform the charting.
If the requirement is in-depth analysis of hundreds of statistics, spanning multiple days and possibly multiple systems, the next step should be a little more preparation, in order to save a good deal of time later.
One of the best (and most enjoyable) ways to proceed in this case is to use Oracle and Open Data Base Connectivity (ODBC) to aid the process. I used this technique after a large benchmark and found it to be a most flexible method of rapidly comparing and cross-checking results from numerous tests from both benchmark sites.
The first step is to create a table in an Oracle database:
------------------------------- -------- ----
|
This table stored the date and time, the UNIX date/time (seconds since epoch), the vendor name, the test ID, the namespace of the statistic (e.g., db, system), the statistic name (e.g., user_commits), and the actual value.
On top of this, four bitmap indexes were built-on
VENDOR,
ID,
NAMESPACE, and
STATNAME. No other indexes were required. The total number of data points in this particular analysis was just over 1 million.
Owing to the quantization effort prior to this, any statistic for a given timestamp could be directly compared with any other for that timestamp. This is a fundamental requirement for when we start using pivot tables in the Excel product, because any inconsistencies will result in "gaps" in the data, where the value suddenly crashes down to zero every few samples. This results in false spikes in the data owing to quantization errors.
Next, an ODBC/SQL*Net connection is defined between the database server and the PC running Excel. Once this is running, a pivot table can be created that drives from a query to the Oracle database. This is defined when you initially invoke the pivot table, specifying that you want to use an "external data source." You are then given the opportunity to define the query that will act as the data set for the pivot table. The screen shot in Figure 4.4 shows one of the master pivot tables used to flip quickly between different tests at the different vendor sites.
The pull-down-menu-type cells (B1:B2) change the parameters for the data selection from the database, and the query is reexecuted in order to define the new set of data to display on the pivot table.
In this particular example, the tabs on the bottom of the workbook point to predefined chart worksheets: when a test is recalled from the database, all of the required charts are already drawn and finished on the other sheets in the workbook, by virtue of the fact that they contain static references back to the data in the pivot table.
Having the right tools is an essential part of building a large system. Without the data provided by such tools, it is very difficult to make scientific decisions and hypotheses about the system-you are quite literally in the dark.
Hopefully, some of the building blocks presented in this chapter will be of use in building some of the fundamental tools required, in addition to allowing rapid development of any subsequent tool requirements (using
dbaman).
1
This is less of an issue now that the
sort_area parameters can be dynamically altered by each user session: the monitoring session can invest in a larger sort area than the online users in order to keep the impact to a minimum.
2
However, the
V_$<name> view exists, because it was created as a standard database object in
catalog.sql.