Full-stack Philosophies

James Morle

James Morle's Blog

RSS Feed

Latest Post

Oracle’s In-Memory Database: The True Cost Of Licensing

Posted on 11:00 am July 25, 2014 by James Morle

At last, the long-awaited release of Oracle Database has arrived, including the highly anticipated In-Memory Database Option. I have had the pleasure of being involved in the early beta stages of this feature and I have say: I'm impressed. This is truly game changing technology, and has been extremely well designed and implemented by Oracle. But this post isn't about that, it's about the implications of this technology once it gets used in anger.

I can summarise this in one sentence for those with little time: It's going to be expensive, very expensive. I encourage those with more time to read on, the detail is kind of important.

Before I continue, a small disclaimer: I have no problem at all with anybody, and I mean anybody, making huge amounts of money in a free market economy. Everybody, and I mean everybody, that makes massive stacks of cash in a free market fully deserves every penny of it. Nobody is forced to give them that money, it is voluntarily given up in exchange for goods or services. Even Justin Bieber deserves that Lamborghini that he was caught racing (although I believe that was a rental). Oracle will make truckloads of cash from the In-Memory Database Option, and fair play to them for that - it's good enough for users to pay for. This blog post isn't a gripe about money, it's a technical evaluation of the licensing impact of using this option, including some non-obvious aspects which make the costs scale in ways that will not be adequately understood when first licensing this option.

OK, first a very, very quick background into what the In-Memory Option (IMO from now on) does, and what the price list says about it.

IMO allows the user to specify that a table should be fully cached in a In Memory Column Store. This store is an additional cache which co-exists with the good old-fashioned buffer cache, with which it is kept consistent. The contents of the store, as the name suggests, are a columnar representation of all of the data in the nominated table(s). Without going into too much detail about this (Google "columnar format"), the data is stored so that any given block stores the rows for that block grouped by column, instead of grouped by row. So, if a table has columns "EMPID", "NAME" and "DEPTNO", all the "EMPID" column values are stored first, then all the "NAME" values, then all the "DEPTNO" values. This storage format makes filtering much more efficient due to the spatial locality of reference.

In addition to being pivoted into columnar format, the column store also has an implementation of region-based indexing, similar to Exadata's storage indexes, which allows Oracle to quickly skip whole regions of in-memory data that do not contain the requested values. On top of all this, Oracle has also implemented in-memory data compression, which reduces the total amount of data that Oracle must process to find matching keys.  All of these attributes – columnar format, storage indexes, and compression – combine to reduce the amount of data to scan, whilst leaving the data in memory as a tightly packed, spatially convenient format.

The actual scanning of that spatially convenient data is carried out using SIMD instructions. I don't currently know exactly which SIMD instructions are used, as there are numerous instructions that perform essentially the same function but in subtly different ways, notably in the way they interact with the processor cache. I suspect that MOVNTDQA (bulk loading from memory to CPU core) and probably the PCMP* series of instructions (parallel data comparison) are used for Intel architectures, but that's just pure conjecture at this stage. The SIMD instructions all have the same purpose - to achieve multiple operations using a single processor instruction. Using this approach, the CPI of the processor are minimised, thus improving throughput. More goodness to add to the IMO pot.

So what does all this mean for licensing? Well, the price list says that this option is on a pricing par with Real Application Clusters at $23,000 per Processor License, which is already a big uplift, but there is more to it than that.

When using IMO, two things are clear:

  1. Memory bandwidth, which is always much slower than CPU bandwidth, will dominate system throughput (it always has dominated CPU throughput for Oracle databases, but you will notice it much more when the majority of query response time is attributed to memory waits)
  2. All these new IMO queries will be darn quick, regardless of (a) (though with different values for 'darn', depending on (a))

Starting from the bottom, the very fact that queries will be extremely fast means that less attention will be paid to the classic techniques of data design and optimiser plan optimisation. This is a major pitfall that will occur frequently, with the net effect of simply driving up CPU utilisation due to good ole-fashioned bad design. Unlike previous bad design mistakes, which might result in a huge amount of disk I/O and an uplift in CPU, 100% of these errors will result solely in increased CPU and therefore  licensing cost.

Now how about that issue of the memory bandwidth? Here's the kicker: When the processor is waiting for memory to feed it with new data, because of point (a), how does this get reported back to the system? Think about that for a moment...

OK, here's the answer: It is reported as busy CPU. Memory waits show up in your OS statistics as CPU time, and as a consequence into your AWR reports as "DB CPU". Eventually, as load increases on the box, all the processors will eventually end up reporting that they are 100% busy and performance on the box will decrease. Not rocket science. But what is the immediate reaction to this dilemma: Add more CPUs. There are two problems with this:

  1. It probably won't fix the problem, at least in the way we might intuitively expect
  2. It requires more Oracle Processor Licenses (Enterprise Edition, IMO, and any other options all need to be licensed on the additional CPUs)

The additional costs are obvious, but why won't it work? It won't work because it isn't a CPU problem, it's a memory problem that is reported as a CPU problem. Actually, by lucky architectural coincidence, adding CPU probably will make a difference (possibly a negative one) by implicit virtue of all system architectures being NUMA these days: In a NUMA architecture, memory is physically affined to one CPU domain (aka NUMA node). In current reality, a CPU domain is analogous to a CPU socket, but it doesn't need to be. So, adding a CPU adds a new NUMA node, and this node must also be populated with memory. This implicitly adds memory channels to the system, and thus memory bandwidth.

The whole interaction of NUMA with IMO is a subject for another day – I haven't really been through the thought process of what that means yet, not have I looked at what Oracle have done in this area. QPI and Hypertransport have lots of point to point bandwidth (significantly, more bandwidth than current DDR3 memory), but it's not infinite and it is shared – memory placement must matter for IMO.

Back to the current subject: Adding CPU might help, but not because you added CPU. And adding CPUs to our system adds significant cost to your license: A single Intel® Xeon® E7-4890v2 has 15 cores, which is 8 Processor Licenses. Just EE and IMO will set you back $564,000 in Capex and $124,080 in annual support...

So what's the alternative? There are two currently:

  1. Ensure all memory channels are in use
  2. Maximise memory clock speed

The first one is easy: Make sure that the manner in which the DIMMs of your server are plugged into the sockets on your server enables all of the available memory channels. If you fully populate the DIMM sockets, you are already using all the memory channels - the danger arises when the DIMM sockets are not all populated. Choice of DIMM socket really matters to enable all the available memory channels. Memory channels directly equate to memory bandwidth.

The second one will probably take some getting used to for most IT shops. It is fairly common, in my experience, that memory is selected for systems based upon a sweet spot of DIMM pricing, in isolation of any other considerations. As with most things in IT, the fastest, most dense DIMMs with the most channels cost disproportionately more than just the memory capacity would imply.  Consider the following extract from the HP DL580 Gen8 Quickspecs:

Screen Shot 2014-07-25 at 10.18.05

The price differential between the bottom two DIMM modules is huge: $179.99 for the 16GB/Dual Rank DIMM module compared to $669.99 for the 32GB/Quad Rank DIMM module. However, only the more expensive module allows the full clock speed of 1333MHz when fully populated. If you fully populate this example server with those modules, you would pay $17,279.04 for the 16GB modules and a whopping $64,319.04 for the 32GB modules. This is why IT system builders often favour the less cutting-edge DIMM.

That pricing differential is chicken feed in the case of a heavy IMO system, because it's a deathmatch fight between memory speed and Oracle license costs.

The availability of this technology might/should drive changes in system architectures. Memory channels matter now even more than they ever did and, although it will increase the costs of hardware, having more of them is going to be worth every penny when offset against the Oracle license savings of doing so.

I'm hoping that my esteemed colleague, Jeff, will write a much more qualified post regarding memory speed and SIMD instructions at some point. Keep an eye on his blog!