Full-stack Philosophies

Jeffrey Needham's Blog

RSS Feed

More Than Just a Lake

Posted on 5:07 pm June 3, 2014 by Jeffrey Needham

Data lakes, like legacy storage arrays, are passive. They only hold data. Hadoop is an active reservoir, not a passive data lake. HDFS is a computational file system that can digest, filter and analyze any data in the reservoir, not just store it.

HDFS is the only economically sustainable, computational file system in existence. Some file systems share, like NFS. Some file systems scale, like Ceph. Most file systems require the user to supply the processing capabilities, such as a database or pile of scripts.

Hadoop comes with a scheduling capability that enables several diverse forms of processing across the entire file system. Sometimes that processing is for analytics, sometimes for transcoding, and sometimes it enables databases and their SQL queries.

Hadoop comes with processing built-in, such as Pig (a pile of scripts) and Hive (a simple SQL database) that are scheduled in parallel across the cluster. HDFS is not a general purpose, transactional file system like NFS, but it is a flexible, purpose-built, hyper-scalable analytics reservoir. This reservoir must make it possible for traditional database products to directly access HDFS and still provide a canal for enterprises to channel their old data sources into the new reservoir, allowing old and new data to coexist and intermingle.

HDFS contains a feature called name node federation that, over time, could be used to create a reservoir of reservoirs, which will make it possible to create planetary file systems that can act locally but think globally.

 


Leave a Reply