White Papers

The Elephant in the Room
Quy Ta 23 Mar 2015
This blog will explore a hybrid computing environment that takes Lustre®, a high performance parallel file system and integrates
it with Hadoop®, a framework for processing and storing big data in a distributed environment. We will explore some reasons
and benefits of such a hybrid approach and provide a foundation on how to easily and quickly implement the solution using
Bright Cluster Manager® (BCM) to deploy and configure the hybrid cluster.
First, let’s establish some definitions and technologies for our discussion. Hadoop is a software framework for distributed storage
and processing of typically very large data sets on compute clusters. The Lustre file system is a parallel distributed file system
that is often the choice for large scale computing clusters. In the context of this blog, we define a hybrid cluster as taking a
traditional HPC cluster and integrating a Hadoop computing environment capable of processing MapReduce jobs using the
Lustre File System. The hybrid solution that we will use as an example in this blog was jointly developed and consists of
components from Dell, Intel, Cloudera and Bright Computing.
Why would you want to use the Lustre file system with Hadoop? Why not just use the native Hadoop file system, HDFS?
Scientists and researchers have been looking for ways to use both Lustre and Hadoop from within a shared HPC infrastructure.
This hybrid approach will allow them to use Lustre as both the file system for Hadoop analytics work as well as the file system
for their general HPC workloads. They can also avoid standing up two different clusters (HPC and Hadoop), and the associated
resources required, by allowing the re-purposed provisioning of the existing HPC cluster resources into a small to medium sized
self-contained Hadoop cluster. This solution would typically target those HPC users that have a need to run periodic Hadoop
specific jobs.
A key component to connecting the Hadoop and Lustre ecosystems is the Intel Hadoop Adapter for Lustre plug-in or Intel HAL
for short. Intel HAL is bundled with the Intel Enterprise Edition for Lustre software. It allows the users to run MapReduce jobs
directly on a Lustre file system. The immediate benefit is that Lustre is able to deliver faster, stable and easily managed storage
for the MapReduce applications directly. A potential long term benefit using Lustre as the underlying Hadoop storage would be a
higher raw capacity available when compared to HDFS due to the three time replication as well as the performance benefits of
running Lustre on InfiniBand connectivity. The following architectural diagram will illustrate a typical topology for the hybrid
solution.

Summary of content (8 pages)