James Morle's BlogRSS Feed
The Oracle Wait Interface Is Useless (sometimes) – part 3bPosted on 3:57 pm February 16, 2010 by James Morle
Welcome back for the concluding part of this series of blogs.
In the last part I went through a brief primer about code execution and stack frames in preparation for this posting. I hope that wasn't too boring for everyone - I think it's really important to make sure all the preliminary knowledge is covered and not make too many assumptions about the reader. In this part I will finally get to the point, and talk about some alternative techniques for determining the reasons for poor performance for our example user session.
To recap very quickly, we have a user session taking 100% CPU in top, nothing is emitted in SQL trace files, Oracle's Time Model shows it is executing SQL, and that's about all the information we have. As Tanel stated in his posting for Part Two, this is the stage where it makes good sense to dig into session-level statistics (not system-level statistics) to see if any of the counters are incrementing in a conspicuous way. If you don't already have a copy of Tanel's Snapper code, I suggest you grab it and get familiar with it - it's a nice implementation of an essential tool that many DBAs already have some kind of script for.
Tanel found in Part Two that the SQL execution was revisiting the same small number of buffers very frequently, indicating that there might be a problem with one of the Nested Loop join methods used by the optimiser. Perhaps there is something wrong with that part of the plan (or with the whole plan)? Using counters can often reveal a potential problem this way, but they don't necessarily put a neon sign reading "Here's Where All The Time Went" on top of the offending item. This is where we might want to dig further using some of our knowledge of code execution and stack traces, in combination with a dynamic tracing infrastructure such as Dtrace or Systemtap.
The power of Dtrace and friends lies in a few key areas:
- It can report time-based information
- It can show user and kernel information
- It is dynamic - unlike gprof and other code profilers, it is not necessary to run everything with a profiled binary
These are really key features which, when combined, allow for tremendously powerful on-demand tracing.
Consider the following sample output from a very small Dtrace script to look into what happens when a single printf() library call is made from a C program:
CPU FUNCTION 0 -> my_function USER CODE (57 lines of dynamic library loading omitted) 0 -> printf USER CODE 0 -> _setorientation USER CODE 0 <- _setorientation USER CODE 0 -> _ndoprnt USER CODE 0 -> _findbuf USER CODE 0 -> isatty USER CODE 0 -> ___errno USER CODE 0 <- ___errno USER CODE 0 -> ioctl USER CODE 0 -> syscall_mstate KERN 0 -> gethrtime_unscaled KERN 0 -> tsc_gethrtimeunscaled_delta KERN 0 -> tsc_gethrtimeunscaled KERN 0 <- tsc_gethrtimeunscaled KERN 0 <- tsc_gethrtimeunscaled_delta KERN (180 more lines of kernel functions for ioctl call omitted) 0 <= ioctl SYSCALL: ioctl 1743296 ns (34 lines of kernel function (probably DTrace intrumentation!) omitted) 0 <- ioctl USER CODE 0 <= fstat64 SYSCALL: fstat64 352699 ns (~300 lines of kernel and dynamic library loading omitted) 0 -> _write USER CODE (~200 lines of kernel and dynamic library loading omitted) 0 <= write SYSCALL: write 2154427 ns (123 lines of kernel code omitted) 0 <- printf USER CODE
In the above output, there are around 1000 functions called in User and Kernel domains. The output gives you some idea of the level of detail possible with Dtrace; we can see all of the following attributes:
- All User mode function calls are logged
- All Kernel mode function calls are logged
- All system calls are timed
Note: The printf() itself is very much a user-mode call,part of the standard C library (libc). It actually makes a couple of system calls under the bonnet: Initially an ioctl() and eventually the write(2) system call.
Let's look at something else - time spent in the Virtual Memory subsystem of the kernel doing Page Ins (page faults). Again, this is a very simple script, and produces output similar to the following:
WAIT: PAGE IN: 0 ms WAIT: PAGE IN: 0 ms WAIT: PAGE IN: 0 ms WAIT: PAGE IN: 0 ms WAIT: PAGE IN: 0 ms WAIT: PAGE IN: 90 ms WAIT: PAGE IN: 836 ms WAIT: PAGE IN: 1098 ms WAIT: PAGE IN: 853 ms WAIT: PAGE IN: 816 ms FINAL WAIT: PAGE IN: 263 ms
So now we can quantify how much time is spent servicing page faults, in this case showing approximately how much per second. Here we have very specific information about how much time our particular session is spending servicing page faults. Don't assume that soft page faults are cheap/fast, either - with increasing virtualisation trends, soft page faults as observed by the guest operating system could actually be hard page faults in the underlying hypervisor.
These are just a few of the things that can be monitored in Dtrace. Now let's combine the Oracle stack tracing that we introduced in part 3a. In that posting I concluded that we could dump the stack trace once in a while, observe the name of the current function, and determine where we are spending the most time. With Dtrace we can make this much more concrete, and can quantitatively report the time spent in each function, and combine other contributing factors from the Operating System! I have put together a Dtrace script to demonstrate this concept. I won't post the script here (yet) because it is still very much in development and still has a number of deficiencies. What I'm trying to get to here is a really quick way to see exactly what an Oracle session is doing by only looking at the running process using Dtrace. The script currently has the following features:
- Shows all OWI wait events and durations
- Shows all system calls and durations
- Shows timing stats every second for: OWI wait times, page faults, latch acquisition, and a selection of common rowsource fetch operators
The rowsource operators are taken from tracing the function calls to the various rowsource operators defined in Metalink note 175982.1. Specifically, the script currently traces calls to the following rowsource operators:
- GS - Group By Sort
- JO - Join (Nested Loop)
- UA - Union All
- VW - View
- HJ - Hash Join
- TB - Table
- JOT - Unknown. Possible relation to Join, observed frequently during stack tracing
- SO - Sort
- IXT - Index
- PF - Prefetch
Running the same misbehaving SQL as Tanel in his last posting, we get the following output (snippet only):
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 464:JO 0:UA 0:VW 0:HJ 0:TB 21:JOT 0:SO) LAST1S (ms): wt=0 pgtm=0 latch=1 rsrc=(0:GS 430:JO 0:UA 0:VW 0:HJ 0:TB 9:JOT 0:SO) LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 471:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO) LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 467:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO) LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO) LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO) SYSCALL: times 7104 ns SYSCALL: times 1749 ns LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO) LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO) LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO) LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 22:JOT 0:SO)
Remember that this SQL example precisely fits our profile where the session is using 100% usermode CPU and emitting no wait information, so we cannot see any OWI waits detailed here - just summaries of where the time was spent every second and a couple of system calls. We also have no page faults taking place on this test, so no time is accounted for there. The big user of CPU (468ms per second) is the nested loop join rowsource operator, which agrees with Tanel's findings using session-statistics. We are also spending a little time acquiring latches, but it's not very significant. There are clearly some rowsource functions missing, given that we can only account for half the CPU time, so I did a bit more function tracing to find the names of the missing functions, and added them to the script. The output now looks as follows (scroll right to see the new functions):
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 461:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 163:IXT 759:PF) LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 691:JO 0:UA 0:VW 0:HJ 0:TB 455:JOT 0:SO 304:IXT 843:PF) LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 460:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 162:IXT 759:PF) SYSCALL: times 7810 ns SYSCALL: times 1750 ns LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 686:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 301:IXT 842:PF) LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 436:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 177:IXT 721:PF) LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 708:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 286:IXT 880:PF) LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 357:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 228:IXT 588:PF) LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 788:JO 0:UA 0:VW 0:HJ 0:TB 445:JOT 0:SO 235:IXT 1013:PF) LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 277:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 276:IXT 458:PF)
Looks like we've found the missing time, and then some (the timers add up to more than 1000ms each second). The reason for this timing exaggeration is that there is a good deal of double accounting going on with the current version of the script - child function times are also accounted for in the parent. This is a known bug, hopefully I'll get a chance to fix that one soon.
Well that's about it for now. I hope that this journey has been an interesting one, I know it's been a lot of reading to take in all the parts. We started with a session that was emitting no wait information, and now we have a simple Dtrace script that is easy to execute with no impact to other sessions, and easy to stop once the underlying cause is determined. The query does not need to be stopped and restarted, and no additional overhead is imposed from enabling additional Oracle diagnostics such as setting statistics_level=all. I intend to expand on this approach over time, with a particular focus on:
- Multiplatform support - Systemtap version is required
- Better design to support more function tracing
- Fix double counting issue
- Add identifiers for optimiser plan line number rather than rolling up by function name