Full-stack Philosophies

Jeffrey Needham's Blog

RSS Feed

Hadoop as a Cloud

Posted on 5:08 pm June 3, 2014 by Jeffrey Needham

The idea of clouds “meeting" big data or big data "living in" clouds isn’t simply marketing hype. Because big data followed so closely on the trend of cloud computing, both customers and vendors still struggle to understand the differences from their enterprise-centric perspectives.

Everyone assumes that Hadoop can work in conventional clouds as easily as it works on bare iron, but that’s only true if they’re willing to ignore the high cost and complexity to manage and triage Hadoop and its inability to scale when imposed in conventional clouds.

Conventional cloud infrastructure is neither architecturally nor economically optimal for running supercomputers like Hadoop. And Hadoop is already a proven, self-contained, highly resilient cloud in its own right.

On the surface there are physical similarities in the two technologies – racks of conventional cloud servers and racks of Hadoop servers are constructed from the same physical components. But Hadoop transforms those servers into a single supercomputer, whereas conventional clouds consist of applications such as mailboxes and Windows desktops and web servers because single instances of those applications no longer saturate commodity servers.

Because conventional clouds are designed for a class of applications that are not throughput-intensive, they consume few resources. These “do-little” clouds exploit the resource profile of many front-end applications such as ecommerce and web mail. As generations of hardware servers evolved to be more powerful, it became possible to stack several server instances of web mail onto a single physical chassis.

Since conventional clouds are designed to consolidate instances of applications running on very idle servers, they deploy and manage applications at an OS-image level. These clouds support several variants of Windows and Linux and can restart the OS image (and sometimes the application itself) in the event of a failure. But when a Hadoop cluster is imposed into a conventional cloud, many platform assertions quickly fail.

If a Hadoop cluster is deployed in a conventional cloud, then what appears to be a 100-node cluster might only be provisioned on 20 physical machines, and so it has the actual throughput of a 20-node cluster. A 100-node Hadoop cluster is not a collection of 100 standalone instances that can be relocated and restarted; it is one unified 100-node data processing instance.

The other assertion that wreaks havoc with running Hadoop in a conventional cloud is that data is not elastic. When a Windows image fails, the private data that needs to be relocated is relatively small so it can be moved quickly. But a Hadoop data processing node typically reads hundreds of terabytes of private data during the course of a day. Provisioning or relocating a Hadoop data processing node involves terabytes of inelastic data.

Whatever cloud you pick, data is always inelastic and those terabytes always move at the speed of infrastructure bottlenecks. Even Hadoop can't break the laws of physics, although HDFS does a better job of provisioning and relocating data across its own cloud because it is optimized for a different set of platform assertions.

Conventional clouds rely heavily on shared storage (such as the S3 file system on AWS or traditional NFS pods) which is not required to be high-performance; it’s more important that shared storage accommodates provisioning and relocation of modest amounts of private data than throughput and scalability. I/O is not required for “do-little” clouds.

However, if Hadoop data processing nodes each expect to draw their private data throughput from shared storage, then many more shared storage pods for each Hadoop processing node will be required to meet the demand and then Hadoop strains the economics of the conventional cloud infrastructure. Hadoop is far less seamless and more resource invasive if it’s forced into a conventional cloud infrastructure.

Conventional clouds are just one way to provision and schedule computing resources for a specific, mostly low-throughput category of applications, but even conventional clouds are deployed on bare iron. This type of cloud converts thousands of servers into a flexible pool of computing resources that allow a broad, mostly heterogeneous collection of Windows and Linux servers to operate more efficiently. By this definition, conventional clouds are heterogeneous, general-purpose clouds.

Hadoop, however, when deployed across the same bare-iron servers, is a homogeneous, purpose-built, supercomputing analytics cloud. When viewed as an alternative deployment model of a cloud, Hadoop is a superior topology for an expanding family of Apache analytics applications.

A Hadoop analytics cloud differs in how it commissions and de-commissions individual data processing nodes. It cannot provision a copy of Windows or Linux, but it can provision, relocate and reschedule any jobs within the expanding ecosystem of Apache technologies such as HBASE, HIVE, Accumulo, Stinger, MapReduce, Storm, Impala, Spark, Shark, etc.

Groups who are compelled to run Hadoop in conventional clouds must understand how the assertions differ between heterogeneous conventional clouds and homogeneous Hadoop analytics clouds and discover what types of analytics applications can work well within the constraints of a conventional cloud. The art to running any type of cloud on any hardware infrastructure is to make sure platform and workload assertions hold, or if they don't, understand the implications of the trade-offs.

blog posts 2014.05.30 - 2 clouds


Leave a Reply