Full-stack Philosophies

James Morle's Blog

RSS Feed

And Now For Something Completely Different…

Posted on 9:34 am October 12, 2009 by James Morle

So just a short post today, more of an announcement...

At 11.05am on Wednesday 2nd December, I will co-present a talk at the UKOUG conference with Tanel Poder. OK, so nothing particularly earth-shattering there. However, we've decided to do something a little bit different with the creative process. The talk is entitled "The Oracle Wait Interface Is Useless (sometimes)", and is a subject that both Tanel and I have been working on separately for a little while. The premise is this: The wait interface is great for 'slow' waits, but what about waits that are not instrumented by the wait interface? What about waits that are not waits from Oracle's perspective, such as reading a page of memory? What about pure inefficiency? There is, of course, the concept of DB Time, but it is not currently granular enough.

So, you probably can see the idea behind the presentation: The goal is to present some alternative diagnostic techniques to determine the cause of poor performance. We happen to both have similar ideas on this, and they don't just stop at Oracle.

So here's the new concept, at least for us: We are going to write the content for the presentation as a tag team effort between our two blogs. Airing our dirty laundry in public, so to speak. We think this will give a fairly unique opportunity for public comment before the presentation is actually given!

So, over to Tanel for part one...


14 comments on “And Now For Something Completely Different…

  1. Hi,

    I totally agree that Oracle performance has to be monitored together with OS performance. A few years ago I read about Craiga Shallahamer's tool called OSM which monitor both - OS and Oracle. It was used to create a workload for capacity planning but the idea of monitoring OS and Oracle and use both results for tuning stay in my mind. I'm waiting for your new work.

    regards,
    Marcin Przepiorowski

    • Hi Marcin,

      I'm not talking about 'monitoring' so much here - I'm talking about diagnostics. So, a method that may, or may not, include host tools to find the real problem is required...
      Cheers

      James

      • Hi James,

        I mean 'monitoring' in term on gathering a diagnostic data. If you want to go further than Oracle Wait interface and strace/truss/etc it will be a real revolution. Two ideas come to my mind - use software profiler or monitor a page faults on OS level for Oracle processes.

        regards,
        Marcin

  2. The funny thing is that Precise Indepth for Oracle had some of this stuff already (some it correctly implemented and some of it wrong). One could see that a process was waiting for a memory page that had to be paged into memory. You could see that on the statement level. This was achieved by sampling not only the Oracle wait state for the session but also the process/thread state in the underlying OS. One could even see if a process was waiting for the CPU for a particular SQL statement. This technique was based on sampling (once a second), but for stuff that was frequently executed (many but short) or that took long time (a few but long) it tended to be statistically correct (enough). So one could see memory waits for Hash Joins or full table scans that were going through the File System cache but were waiting on a free buffer.

  3. James,

    You've confused me. What does the term "reading a page from memory mean?" Are you thinking of a minor page fault (which is neither a read nor a wait). Actually, I guess I want to know what "pure inefficiency" means? Is that code glut? High CPI?

    • Hi Kevin,

      Good spot, that's not very clear/incorrect. What I am referring to is "referencing something in memory, which is assumed to be 'fast', but actually results in a potentially slow page fault, aka a 'wait' but not per the Oracle Wait Interface".
      In terms of pure inefficiency, I'm mostly referring to aspects such as bad query plans. I don't think we're at the point where the external observer can cast easy aspersions over code bloat.
      So here's an example: a user complains of bad response time. First port of call: Oracle Wait Interface, perhaps via a 10046 trace. It shows NOTHING. Neither does truss/strace. By 'nothing', I mean literally no output from either of those methods, which have been the mainstay of many DBAs toolkit for years. So what's going on? That's what I want to address with this subject 🙂

      James

  4. James,

    Ok, thanks. Should be interesting. BTW, what is a "slow page fault?" Is that a major page fault or a minor page fault that triggers dirty page flushing or something along those lines?

    • "Slow" is a relative thing, and that's my meaning for 'slow page fault' 🙂
      A page fault of any kind is going to be slower than a load from L1, a major page fault slower than a minor, and so forth. We going to be limited on what we can quantify in terms of time in this study, but let's see what's out there. Maybe some counters will work in conjunction with other data.

  5. OK, thanks James. I admit being a bit of a prima donna when it comes to memory and Oracle. I don't think there is room for any demand paging overhead. If it isn't all in the resident set then tune it so it is.

    BTW, it's Oracle. Your L1 is useless 🙂

    • Nothing wrong with being a prima donna 🙂
      I totally agree on the demand paging comment, but this isn't about how things should be done, it's about how things might actually be for whatever reason (well, there's normally one reason bigger than the others 🙂 )
      Regarding L1, I remember those instruction trace/cache locality charts Jeff used to have when he was at Oracle: Pink Noise!

  6. Hi James,

    Touching alternative diagnostic techniques theme, how about technique based on frequent snapshots from v$sql (or v$sqlstat in 10g+)? Frequent means every 15 second or even faster. There are technical challenges of storing/querying few thousand snapshots to cover last 12 hours, for example, but that has been solved and implemented outside Oracle. Now, for any period of time as small as half a minute, delta can be calculated giving top queries by CPU, by gets, by reads etc. The number of executions is known, statistics per execution can be calculated (such as gets/exec to see if the query is doing insane things). Even more, individual statements can be tracked and plotted over time and compared against system-wide data.

    Of course this obtained data is as good as sql statistics reported by Oracle and you know well, nothing is perfect. But they happen to be quite accurate and providing actionable information. In practice this technique turned out to be more efficient and accurate than active session history (ASH) technique. Especially for OLTP databases with many small queries. This can be attributed to the cumulative counters used by sql statistics which don't skip even small things.

    Sorry if I deviated from the topic of this discussion.
    Thanks,

    • Hi Sergei,

      That is certainly an approach that can help diagnose SQL-related issues. However, the point of the article is more general than that - I'm beginning to wish that I used a non-SQL example now! Specifically, though, your approach is about system-level tuning, and while this can be useful in reducing the overall footprint of the system, I would normally take a session-level approach, even for very transient OLTP workloads.
      Cheers

      James

      • I agree it is SQL-related. I was intrigued by the name of article and keeping in mind that SQL-related problems are very common I wanted to share “something different”.

        But eventually everything what happens in Oracle instance comes down to the server processes and few background processes and the harmony of their concurrent functioning. I understand that you are focusing on OS process level details and very interested to hear about your findings.

        Best regards,
        Sergei

Leave a Reply