Full-stack Philosophies

James Morle's Blog

RSS Feed

Oracle’s In-Memory Database: The True Cost Of Licensing

Posted on 11:00 am July 25, 2014 by James Morle

At last, the long-awaited release of Oracle Database 12.1.0.2 has arrived, including the highly anticipated In-Memory Database Option. I have had the pleasure of being involved in the early beta stages of this feature and I have say: I'm impressed. This is truly game changing technology, and has been extremely well designed and implemented by Oracle. But this post isn't about that, it's about the implications of this technology once it gets used in anger.

I can summarise this in one sentence for those with little time: It's going to be expensive, very expensive. I encourage those with more time to read on, the detail is kind of important.

Before I continue, a small disclaimer: I have no problem at all with anybody, and I mean anybody, making huge amounts of money in a free market economy. Everybody, and I mean everybody, that makes massive stacks of cash in a free market fully deserves every penny of it. Nobody is forced to give them that money, it is voluntarily given up in exchange for goods or services. Even Justin Bieber deserves that Lamborghini that he was caught racing (although I believe that was a rental). Oracle will make truckloads of cash from the In-Memory Database Option, and fair play to them for that - it's good enough for users to pay for. This blog post isn't a gripe about money, it's a technical evaluation of the licensing impact of using this option, including some non-obvious aspects which make the costs scale in ways that will not be adequately understood when first licensing this option.

OK, first a very, very quick background into what the In-Memory Option (IMO from now on) does, and what the price list says about it.

IMO allows the user to specify that a table should be fully cached in a In Memory Column Store. This store is an additional cache which co-exists with the good old-fashioned buffer cache, with which it is kept consistent. The contents of the store, as the name suggests, are a columnar representation of all of the data in the nominated table(s). Without going into too much detail about this (Google "columnar format"), the data is stored so that any given block stores the rows for that block grouped by column, instead of grouped by row. So, if a table has columns "EMPID", "NAME" and "DEPTNO", all the "EMPID" column values are stored first, then all the "NAME" values, then all the "DEPTNO" values. This storage format makes filtering much more efficient due to the spatial locality of reference.

In addition to being pivoted into columnar format, the column store also has an implementation of region-based indexing, similar to Exadata's storage indexes, which allows Oracle to quickly skip whole regions of in-memory data that do not contain the requested values. On top of all this, Oracle has also implemented in-memory data compression, which reduces the total amount of data that Oracle must process to find matching keys.  All of these attributes – columnar format, storage indexes, and compression – combine to reduce the amount of data to scan, whilst leaving the data in memory as a tightly packed, spatially convenient format.

The actual scanning of that spatially convenient data is carried out using SIMD instructions. I don't currently know exactly which SIMD instructions are used, as there are numerous instructions that perform essentially the same function but in subtly different ways, notably in the way they interact with the processor cache. I suspect that MOVNTDQA (bulk loading from memory to CPU core) and probably the PCMP* series of instructions (parallel data comparison) are used for Intel architectures, but that's just pure conjecture at this stage. The SIMD instructions all have the same purpose - to achieve multiple operations using a single processor instruction. Using this approach, the CPI of the processor are minimised, thus improving throughput. More goodness to add to the IMO pot.

So what does all this mean for licensing? Well, the price list says that this option is on a pricing par with Real Application Clusters at $23,000 per Processor License, which is already a big uplift, but there is more to it than that.

When using IMO, two things are clear:

  1. Memory bandwidth, which is always much slower than CPU bandwidth, will dominate system throughput (it always has dominated CPU throughput for Oracle databases, but you will notice it much more when the majority of query response time is attributed to memory waits)
  2. All these new IMO queries will be darn quick, regardless of (a) (though with different values for 'darn', depending on (a))

Starting from the bottom, the very fact that queries will be extremely fast means that less attention will be paid to the classic techniques of data design and optimiser plan optimisation. This is a major pitfall that will occur frequently, with the net effect of simply driving up CPU utilisation due to good ole-fashioned bad design. Unlike previous bad design mistakes, which might result in a huge amount of disk I/O and an uplift in CPU, 100% of these errors will result solely in increased CPU and therefore  licensing cost.

Now how about that issue of the memory bandwidth? Here's the kicker: When the processor is waiting for memory to feed it with new data, because of point (a), how does this get reported back to the system? Think about that for a moment...

OK, here's the answer: It is reported as busy CPU. Memory waits show up in your OS statistics as CPU time, and as a consequence into your AWR reports as "DB CPU". Eventually, as load increases on the box, all the processors will eventually end up reporting that they are 100% busy and performance on the box will decrease. Not rocket science. But what is the immediate reaction to this dilemma: Add more CPUs. There are two problems with this:

  1. It probably won't fix the problem, at least in the way we might intuitively expect
  2. It requires more Oracle Processor Licenses (Enterprise Edition, IMO, and any other options all need to be licensed on the additional CPUs)

The additional costs are obvious, but why won't it work? It won't work because it isn't a CPU problem, it's a memory problem that is reported as a CPU problem. Actually, by lucky architectural coincidence, adding CPU probably will make a difference (possibly a negative one) by implicit virtue of all system architectures being NUMA these days: In a NUMA architecture, memory is physically affined to one CPU domain (aka NUMA node). In current reality, a CPU domain is analogous to a CPU socket, but it doesn't need to be. So, adding a CPU adds a new NUMA node, and this node must also be populated with memory. This implicitly adds memory channels to the system, and thus memory bandwidth.

The whole interaction of NUMA with IMO is a subject for another day – I haven't really been through the thought process of what that means yet, not have I looked at what Oracle have done in this area. QPI and Hypertransport have lots of point to point bandwidth (significantly, more bandwidth than current DDR3 memory), but it's not infinite and it is shared – memory placement must matter for IMO.

Back to the current subject: Adding CPU might help, but not because you added CPU. And adding CPUs to our system adds significant cost to your license: A single Intel® Xeon® E7-4890v2 has 15 cores, which is 8 Processor Licenses. Just EE and IMO will set you back $564,000 in Capex and $124,080 in annual support...

So what's the alternative? There are two currently:

  1. Ensure all memory channels are in use
  2. Maximise memory clock speed

The first one is easy: Make sure that the manner in which the DIMMs of your server are plugged into the sockets on your server enables all of the available memory channels. If you fully populate the DIMM sockets, you are already using all the memory channels - the danger arises when the DIMM sockets are not all populated. Choice of DIMM socket really matters to enable all the available memory channels. Memory channels directly equate to memory bandwidth.

The second one will probably take some getting used to for most IT shops. It is fairly common, in my experience, that memory is selected for systems based upon a sweet spot of DIMM pricing, in isolation of any other considerations. As with most things in IT, the fastest, most dense DIMMs with the most channels cost disproportionately more than just the memory capacity would imply.  Consider the following extract from the HP DL580 Gen8 Quickspecs:

Screen Shot 2014-07-25 at 10.18.05

The price differential between the bottom two DIMM modules is huge: $179.99 for the 16GB/Dual Rank DIMM module compared to $669.99 for the 32GB/Quad Rank DIMM module. However, only the more expensive module allows the full clock speed of 1333MHz when fully populated. If you fully populate this example server with those modules, you would pay $17,279.04 for the 16GB modules and a whopping $64,319.04 for the 32GB modules. This is why IT system builders often favour the less cutting-edge DIMM.

That pricing differential is chicken feed in the case of a heavy IMO system, because it's a deathmatch fight between memory speed and Oracle license costs.

The availability of this technology might/should drive changes in system architectures. Memory channels matter now even more than they ever did and, although it will increase the costs of hardware, having more of them is going to be worth every penny when offset against the Oracle license savings of doing so.

I'm hoping that my esteemed colleague, Jeff, will write a much more qualified post regarding memory speed and SIMD instructions at some point. Keep an eye on his blog!


24 comments on “Oracle’s In-Memory Database: The True Cost Of Licensing

  1. Nice post James. NUMA has only been mildly interesting to me due to the proliferation of 2 socket boxes scaled out via RAC. I agree with you that this will become a major area of interest and a place where Oracle (and SAP) will really have to focus. We've been talking to the SGI guys about their UV line for example. (I thought they had gone out of business in the 90's) They are claiming the upcoming 300 model will support 64TB. It's actually several blades connected via what they call "NUMAlink". Memory latencies are claimed to be no more than 500ns, no matter how far away it's located from the CPU, but that's still 4X slower than local memory. So it seems like it's all going to be about how effective the code (Oracle or Hana) is with affinitization. I'm looking forward to what Jeff has to say about the SIMD stuff. Tanel is looking into it as well. We should have access to one of the SGI racks in the next month or so to do some testing on by the way. :)

    • Hi Kerry, thanks for taking the time to read this longer-than-expected post!
      I'm familiar with NUMAlink from back in the SGI Origin days, I recall writing about that early incarnation in my book, in fact.
      My guess is that latency will matter with NUMA transports, but not as much as they do for non-SIMD instructions. The very fact that the memory access is 'batched' makes it slightly less important. Bandwidth is going to be the killer, and I can't think of an obvious way for Oracle to optimise it for NUMA any more than simply striping the regions across all NUMA nodes. Any other approach will result in hotspots in one (or few) NUMA domains. Closson needs to wade in on this one.
      Memory channels are going to really matter. Even with true point-to-point transport between nodes, the memory is going to get congested and...appear to be CPU bottlenecked.

  2. Since the topic is In-memory database I think we should all expect that customers will look for the largest memory capacity they can get. Today's Xeons require socket proliferation to get DRAM capacity. That wouldn't always be the case as future Xeon memory densities--even at low socket counts--will be tremendous. What I hope does not happen is folks quickly jot down some DRAM sizings and match that up to a Xeon server without concern for EX versus EP. 4S EP boxes are 2-hop memory systems. If you want to watch CPI shoot through the room run elegant SIMD instructions (like Oracle's In-Memory Column Store code) against 2-hop DRAM.

    In summary, don't run Oracle 12.1.0.2 In-Memory Column Store on 4S EP.

  3. Hi, excellent article. I have to re-read it one more time.
    One quick question: If IMO requires more CPU time to wait for memory, can we say that we will take higher advantage of Hyper-Threading or SMT - which do not count for licensing?

    • Franck,
      There is an important distinction to make here, and apologies if it's not clear in my blog: I'm not saying IMO has more memory wait than any other query execution. In fact, it will have less than any other access path that is based upon scanning. My point is that a much higher proportion of the wait time will be attributed to that, as opposed to the waits we spend most of our time on today.

  4. Often the conversation lends itself to x86 but the platform to maximize Oracle db and from the sound of it, IMO will continue to be Power servers. Power8 is the latest offering, currently with the 1 & 2 socket servers being available. (Larger boxes coming soon) These 1S & 2S servers scale up to 24 cores with 1 TB of RAM, also expected to double. Each socket delivers 200 GB/s of memory bandwidth with all RAS features so no compromise like x86 must consider between RAS vs performance mode. This is almost a 4X increase in memory bandwidth per socket - 400 GB/s for a 2 socket server is flat out ridiculous. The bandwidth between the L1/L2/L3/L4 caches are equally ridiculous and impressive. With 8 threads, dual SIMD pipelines and servers running up to ~4.6 GHz meets all of the requirements to have the best platform to maximize IMO. Will be interesting to start seeing some test results to see how it compares to IBM's DB2 with BLU acceleration which has been available on AIX for since early 2013. With more memory coming and its active memory expansion allowing for >2X increase in what the server see's as memory capacity (ie 1 TB of physical is seen as up to 2.25 TB using hardware accelerators). 2 memory controllers per socket with 4 channels per controller. Hopefully we will see more partnering between Oracle and IBM to help both companies drive innovation, excitement and revenues.

      • Hello James - Enjoy your blog - seems anybody linked to Kevin is top shelf. I work for a Business Partner (IBM and other technologies) but my direct job is that I lead the Power practice for the Power platform on AIX, Linux and IBM i. I am a Enterprise Architect that is also fairly technical. I am Power centric and focused and often IBM centric but that is just as much to protect my install as for any other competitive reasons. With Cisco, Oracle, EMC, NetApp and others all now having significant investments in solutions that compete against Power it sometimes seems hard to catch a break. By the way, I did work for IBM for 4 years, Sun for 10 years and in the the Army for 8 years where I was an instructor on HPC platforms in the mid 90's focusing on SunOS & Solaris

        Power8 may seem like hyperbole because the its capabilities are so high in comparison to x86 and SPARC and I would say more if this wasn't a public blog. What part of it seems like hyperbole to you? Everything I said is the real capability. I would be happy to send you references if needed. If it isn't hyperbole, what do you think of the platform? It's incredible isn't it? We now offer not only out venerable PowerVM but also PowerKVM as hypervisor options. Linux with Little or Big Endian support (today with Ubuntu, later this Q with SuSE and RH has a SOD). IBM i just rocks on and on - it is the original integrated infrastructure solution. AIX (and Linux) for all of its abilities to exploit the Power server can't touch IBM i when it comes to how few cores are required, how few resources are required to manage it - it is as close to a microwave appliance as I know of in the commercial space - they just work and forever.

        I'll add this - I am in sales, albeit a technical seller who supports roughly 28 sales reps for my company. I do often switch between sales and technical mode but I never intend to misstate or insinuate something is possible when it isn't - Credibility is more important than a PO because that money will evaporate but your credibility is earned and easy to loose. My motto is to do everything with balanced scales (Proverbs 20:10) and so I sleep good at night. If anything you may see passion and enthusiasm. If you would like to know more feel free to email me directly at bmurphy@thinksis.com.

        If you allow me to plug my inaugural blog at powertheenterprise.wordpress.com I do a financial analysis of Oracle on x86 vs Oracle on Power as well as DB2 on Power. The Oracle comparison itself is amazing and hard for most people in x86 land to believe but that is why we have references. The DB2 comparison is interesting because it shows the challenge Oracle has with their ala carte pricing which is further impacted by IMO. I need to update the blog to reflect that now as I didn't have the data or understanding of it enough to include. If we could set aside the Ford vs Chevy debates and looking at the pure technology and finances of each we all could learn a lot.

        Thank you

    • Mike, I wouldn't suggest any particular benchmark - or rather, I would suggest them all. Clearly this is a feature that is designed for DW work, but it is designed to sit comfortably with OLTP work at the same time. I would personally create composite workloads (possibly from existing/known benchmarks) that carefully provoke the various dimensions of the feature.

  5. Hi James,

    Excellent post as always, I think the most important points to notice in this post are (IMHO):

    1. The design techniques used can get relaxed due the throughtput provided (I've seen this on Exadata or T4-5).

    2. The main wait will definitely will be hidden an presented as High DB CPU usage.

    3. Because of (1) and (2) and eventual refactor or fine tune might be hard to do/find, so the "obvious solution" might be to add more CPU power to the servers, that might cause the real issue to be hidden again and the cycle might start again, and you stated the licensing costs will increase, but not only the licensing costs but consultancy costs and probably many SRs, etc.

  6. This subtle difference between fully and partly populated NUMA systems was brought home to me with our IBM P6 a few years ago. We started with half the cpu population and the performance was so-so. Once we added up a full house of cpu+memory cards, the speed of memory access nearly doubled! Most of the CPU-bound operations ran nearly twice as fast!
    I guess due to less trips to the backplane connectors, but also due to the memory channels now available. When we got the new P7+. the first thing I did was get on the "all slots populated" warhorse. And it paid off in spades!
    I'm glad I do remember the lessons I learned from your book!

  7. Nuno, you are right. That has historically been the case with Power servers at least back to Power5. The more slots you filled the better the performance. It varies between the entry and enterprise servers. Each has(had) different number of memory controllers (MC) and memory channels (MCH) per MC. Also, the number of DIMMs per MCH impacts performance and capacity. Power tends to not stack more than 2 DIMMs per MCH - usually 1 DIMM for enterprise and 2 for entry. Rarely less than 2 MCH per MC but usually 4. As you populate the DIMMs, there are definite rules, just like my days at Sun on SPARC servers. Balance memory across all slots. Use the same size, speed & type of DIMM. Put a DIMM on all sockets. Usually place in pairs. Fill all slots to get the maximum performance and bandwidth. As you decrease from this you decrease both performance and bandwidth plus the obvious capacity. The iSeries guys do a great job of actual "computer science" by measuring latency and optimal performance with certain DIMM sizes, etc.

    • >The iSeries guys do a great job of actual "computer science" by measuring latency and optimal performance with certain DIMM sizes, etc.

      ...that's because, um, well, Hal Kossman! That's why the S7A was an earth-shattering system for its era.

  8. Pingback: How to get insights into the Linux Kernel « Martins Blog

Leave a Reply