Posted on 11:00 am July 25, 2014 by James Morle
At last, the long-awaited release of Oracle Database 184.108.40.206 has arrived, including the highly anticipated In-Memory Database Option. I have had the pleasure of being involved in the early beta stages of this feature and I have say: I'm impressed. This is truly game changing technology, and has been extremely well designed and implemented by Oracle. But this post isn't about that, it's about the implications of this technology once it gets used in anger.
I can summarise this in one sentence for those with little time: It's going to be expensive, very expensive. I encourage those with more time to read on, the detail is kind of important.
Before I continue, a small disclaimer: I have no problem at all with anybody, and I mean anybody, making huge amounts of money in a free market economy. Everybody, and I mean everybody, that makes massive stacks of cash in a free market fully deserves every penny of it. Nobody is forced to give them that money, it is voluntarily given up in exchange for goods or services. Even Justin Bieber deserves that Lamborghini that he was caught racing (although I believe that was a rental). Oracle will make truckloads of cash from the In-Memory Database Option, and fair play to them for that - it's good enough for users to pay for. This blog post isn't a gripe about money, it's a technical evaluation of the licensing impact of using this option, including some non-obvious aspects which make the costs scale in ways that will not be adequately understood when first licensing this option.
OK, first a very, very quick background into what the In-Memory Option (IMO from now on) does, and what the price list says about it.
IMO allows the user to specify that a table should be fully cached in a In Memory Column Store. This store is an additional cache which co-exists with the good old-fashioned buffer cache, with which it is kept consistent. The contents of the store, as the name suggests, are a columnar representation of all of the data in the nominated table(s). Without going into too much detail about this (Google "columnar format"), the data is stored so that any given block stores the rows for that block grouped by column, instead of grouped by row. So, if a table has columns "EMPID", "NAME" and "DEPTNO", all the "EMPID" column values are stored first, then all the "NAME" values, then all the "DEPTNO" values. This storage format makes filtering much more efficient due to the spatial locality of reference.
In addition to being pivoted into columnar format, the column store also has an implementation of region-based indexing, similar to Exadata's storage indexes, which allows Oracle to quickly skip whole regions of in-memory data that do not contain the requested values. On top of all this, Oracle has also implemented in-memory data compression, which reduces the total amount of data that Oracle must process to find matching keys. All of these attributes – columnar format, storage indexes, and compression – combine to reduce the amount of data to scan, whilst leaving the data in memory as a tightly packed, spatially convenient format.
The actual scanning of that spatially convenient data is carried out using SIMD instructions. I don't currently know exactly which SIMD instructions are used, as there are numerous instructions that perform essentially the same function but in subtly different ways, notably in the way they interact with the processor cache. I suspect that MOVNTDQA (bulk loading from memory to CPU core) and probably the PCMP* series of instructions (parallel data comparison) are used for Intel architectures, but that's just pure conjecture at this stage. The SIMD instructions all have the same purpose - to achieve multiple operations using a single processor instruction. Using this approach, the CPI of the processor are minimised, thus improving throughput. More goodness to add to the IMO pot.
So what does all this mean for licensing? Well, the price list says that this option is on a pricing par with Real Application Clusters at $23,000 per Processor License, which is already a big uplift, but there is more to it than that.
When using IMO, two things are clear:
- Memory bandwidth, which is always much slower than CPU bandwidth, will dominate system throughput (it always has dominated CPU throughput for Oracle databases, but you will notice it much more when the majority of query response time is attributed to memory waits)
- All these new IMO queries will be darn quick, regardless of (a) (though with different values for 'darn', depending on (a))
Starting from the bottom, the very fact that queries will be extremely fast means that less attention will be paid to the classic techniques of data design and optimiser plan optimisation. This is a major pitfall that will occur frequently, with the net effect of simply driving up CPU utilisation due to good ole-fashioned bad design. Unlike previous bad design mistakes, which might result in a huge amount of disk I/O and an uplift in CPU, 100% of these errors will result solely in increased CPU and therefore licensing cost.
Now how about that issue of the memory bandwidth? Here's the kicker: When the processor is waiting for memory to feed it with new data, because of point (a), how does this get reported back to the system? Think about that for a moment...
OK, here's the answer: It is reported as busy CPU. Memory waits show up in your OS statistics as CPU time, and as a consequence into your AWR reports as "DB CPU". Eventually, as load increases on the box, all the processors will eventually end up reporting that they are 100% busy and performance on the box will decrease. Not rocket science. But what is the immediate reaction to this dilemma: Add more CPUs. There are two problems with this:
- It probably won't fix the problem, at least in the way we might intuitively expect
- It requires more Oracle Processor Licenses (Enterprise Edition, IMO, and any other options all need to be licensed on the additional CPUs)
The additional costs are obvious, but why won't it work? It won't work because it isn't a CPU problem, it's a memory problem that is reported as a CPU problem. Actually, by lucky architectural coincidence, adding CPU probably will make a difference (possibly a negative one) by implicit virtue of all system architectures being NUMA these days: In a NUMA architecture, memory is physically affined to one CPU domain (aka NUMA node). In current reality, a CPU domain is analogous to a CPU socket, but it doesn't need to be. So, adding a CPU adds a new NUMA node, and this node must also be populated with memory. This implicitly adds memory channels to the system, and thus memory bandwidth.
The whole interaction of NUMA with IMO is a subject for another day – I haven't really been through the thought process of what that means yet, not have I looked at what Oracle have done in this area. QPI and Hypertransport have lots of point to point bandwidth (significantly, more bandwidth than current DDR3 memory), but it's not infinite and it is shared – memory placement must matter for IMO.
Back to the current subject: Adding CPU might help, but not because you added CPU. And adding CPUs to our system adds significant cost to your license: A single Intel® Xeon® E7-4890v2 has 15 cores, which is 8 Processor Licenses. Just EE and IMO will set you back $564,000 in Capex and $124,080 in annual support...
So what's the alternative? There are two currently:
- Ensure all memory channels are in use
- Maximise memory clock speed
The first one is easy: Make sure that the manner in which the DIMMs of your server are plugged into the sockets on your server enables all of the available memory channels. If you fully populate the DIMM sockets, you are already using all the memory channels - the danger arises when the DIMM sockets are not all populated. Choice of DIMM socket really matters to enable all the available memory channels. Memory channels directly equate to memory bandwidth.
The second one will probably take some getting used to for most IT shops. It is fairly common, in my experience, that memory is selected for systems based upon a sweet spot of DIMM pricing, in isolation of any other considerations. As with most things in IT, the fastest, most dense DIMMs with the most channels cost disproportionately more than just the memory capacity would imply. Consider the following extract from the HP DL580 Gen8 Quickspecs:
The price differential between the bottom two DIMM modules is huge: $179.99 for the 16GB/Dual Rank DIMM module compared to $669.99 for the 32GB/Quad Rank DIMM module. However, only the more expensive module allows the full clock speed of 1333MHz when fully populated. If you fully populate this example server with those modules, you would pay $17,279.04 for the 16GB modules and a whopping $64,319.04 for the 32GB modules. This is why IT system builders often favour the less cutting-edge DIMM.
That pricing differential is chicken feed in the case of a heavy IMO system, because it's a deathmatch fight between memory speed and Oracle license costs.
The availability of this technology might/should drive changes in system architectures. Memory channels matter now even more than they ever did and, although it will increase the costs of hardware, having more of them is going to be worth every penny when offset against the Oracle license savings of doing so.
I'm hoping that my esteemed colleague, Jeff, will write a much more qualified post regarding memory speed and SIMD instructions at some point. Keep an eye on his blog!
Posted on 10:05 am May 23, 2014 by James Morle
It's been a while since I provided any public updates regarding Simora, our Oracle workload simulation product. It's finally time to unveil the status of Simora and our steps moving forwards.
We have been working extensively on the Simora engine and infrastructure over the last several months, with a view to transforming it into a commercially viable product. This phase of development has now completed and we have recruited a stellar group of Alpha testers to give the code a good thrashing! Once they are done with it, and the bugs fixed, we will be looking for a wider set of technical users for the beta test phase. If you are interested in joining the beta programme, please scroll down for the application form.
So What Is Simora?
Simora is a processing and playback framework for the accurate replay of Oracle workloads. Workloads can be captured from real Oracle databases (any recent version, any edition) and replayed against another Oracle database (again, any recent version and any edition - assuming the features present in the workload are supported by the target database). If you think that sounds a little bit like Oracle's Real Application Testing (RAT), you would not be far from the truth - Simora supports the simplistic playback of workloads that RAT supports. However, this is not the main design goal of Simora, which is to be more of a utility to allow full control over the workloads that are played back. Simora was developed with this in mind, and this is evident in the approaches taken to the design. So, what kind of workload manipulation can be achieved with Simora?
Once a workload is created, it can be quickly turned into a Playback workload, just like RAT. This comes with all the necessary caveats for this kind of approach: the playback database must be an exact copy of the capture database, same SCN, and the database must be restored to the same start-SCN for each playback of the workload. There are also a few limitations on how aggressively the workload can be accurately played back against the test database, and these limitations are inherent to both RAT and Simora. These limitations can all be overcome by switching into Advanced Mode.
In Advanced Mode, Simora facilitates more of a workload Devops workflow. It allows the captured workload to be fully manipulated, starting with the fact that the Simora workloads are presented as an easy-to-use script language based upon the popular TCL language. The workload scripts are instrumented to make an easily readable view of the database operations associated with the workload. These scripts can then be fully modified and take advantages of the Simora playback engine facilities to, for example:
- Add loops
- Add variable substitutions
- Experiment with different SQL statements
- Implement inter-workload interaction through message passing
Once the modified workload scripts have completed testing using Simora's debugging harness, they can be presented to the main engine as part of an aggregate workload. This workload might have, as an example, the following components:
- Product selections and browsing transaction
- Order entry transaction
- Picklist generation
These transaction types can then be scaled as much or as little as desired, including the splits between them. The think times and active user counts can be manipulated fully dynamically while the simulation is running.
Anyway, that's a brief introduction to Simora and its advanced capabilities. We will shortly be launching a Simora-specific microsite with significantly more information, including licensing information. The current thinking is that we would like to make the product freely available for non-commercial use, but that's subject to further thinking!
Posted on 5:36 pm November 25, 2013 by James Morle
It's just under a week to go before the doors open for the UKOUG Tech13 conference and the adjoining OakTable World UK 2013 sessions, so I thought I would write a very short blog post about what I will be doing there, where I'll be, and what I'm looking forward to.
This year I will be presenting four different presentations, which will keep me fairly busy. The schedule is as follows (but please check for last minute changes in schedule and room assignments nearer the time):
- Sunday 1st December, 13:40: "Optimal Oracle Configuration for Efficient Table Scanning" at Tech13, room Exchange 11
- Monday 2nd December, 08:45: OakTable World UK 2013 Opening Address
- Monday 2nd December, 10:15: "What the Heck is Async I/O and Why Should I Care?" at Tech 13, room Exchange 10
- Monday 2nd December, 14:10: "Diagnosing ASMlib" at OakTable World UK 2013
- Tuesday 3rd December, 09:00: "The Least You Need to Know When Evaluating Exadata" at Tech13, room Exchange 6
Those that are interested in understanding more about optimising their I/O stack will be interested to hear that my first three talks all flow together. The "Table Scanning" talk takes the listener through all the essential components for configuring Oracle for high bandwidth table/index scanning. The "Async I/O" talk drills in further to one of the really important parts of this: Async I/O. Then in "Diagnosing ASMlib", we get really deep and dirty into the oracleasm kernel module, looking at module source code and putting together diagnostic tools for viewing ASMlib I/O (which does not show up clearly in strace!).
My Exadata talk is aimed at DBAs, managers and Architects, and is focused on covering the most important aspects of the Exadata architecture in an open and honest way. That means without any sales hype or hidden agendas. It includes my usual candid opinions on what Exadata is really good at, and what it isn't great at.
I am heavily involved in the organisation of OakTable World UK 2013, and so I am likely to be spending a large amount of my time over at that event rather than attending many sessions. If you'd like to meet for a coffee/tea, please seek me out at #OTWUK13, or drop me an email to arrange something prior to the event on firstname.lastname@example.org. It's a great opportunity to talk about platform design/architecture, performance, etc. I'd love to catch up with you!
Posted on 3:03 pm August 19, 2013 by James Morle
When I wrote this article for The Register in October 2010, there was a torrent of naysayers and witch hunters spewing their opinions in the comments section. I don't have a problem with that, I was only expressing an opinion myself, after all. I don't actually own a time machine and so any of my predictions are only based upon observations in the industry. It is heartening, though, to see that the prediction is coming true a little bit sooner than I expected with the announcement from EMC that VMAX will be relegated to the Capacity Tier. At least it seems that EMC 'get it', and are doing the right thing.
Like it or not, solid-state storage is the future of high-performance storage, and the hardware/software architectures have to change to keep up with the dramatic reduction in latency solid-state storage brings. Flash storage isn't perfect, but the next technologies in solid-state storage remove many of its deficiencies and retain or improve on those latency numbers. There is no going back, only forwards.