AWS EMR(Elastic MapReduce) is a managed hadoop framework.
It provides you an easy, cost-effective and highly scalable way to process large amount of data.
It can be used for multiple things like indexing, log analysis, financial analysis, scientific simulation, machine learning etc.
Cluster and Nodes
The centerpiece of EMR is Cluster.
Cluster is a collection of EC2 instances also called as nodes.
All nodes of an EMR cluster are launched in same availability zone.
Each node has a role in cluster.
Type of EMR Cluster Nodes
Master Node:- It’s the main boss which manages the cluster by running software components and distributing the tasks to other nodes. Master node will monitor task status and health of cluster.
Core Node:- It’s a slave node which “run tasks” and “store data” in HDFS (Hadoop Distributed Filesystem).
Task Node:- This is also a slave node but it only “run tasks”. It doesn’t store any data. It’s an optional node.
EMR has two type of clusters
1) Transient :- These are clusters which are shutdown once the jobs is done. These are useful when you don’t need cluster to be running all day long and can save money by shutting them down.
2) Persistent :- Persistent clusters are those which need to be always available to process the continuous stream of jobs or you want the data to be always available on HDFS.
Different Cluster States
An EMR cluster goes through multiple stages as described below:-
STARTING – The cluster provisions, starts, and configures EC2 instances.
BOOTSTRAPPING – Bootstrap actions are being executed on the cluster.
RUNNING – A step for the cluster is currently being run.
WAITING – The cluster is currently active, but has no steps to run.
TERMINATING – The cluster is in the process of shutting down.
TERMINATED – The cluster was shut down without error.
TERMINATED_WITH_ERRORS – The cluster was shut down with errors.
Below is the complete EMR cluster lifecycle .
(Image reference AWS EMR Management Guide)
Types of filesystem in EMR
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster.
EMR File System (EMRFS)
Using the EMR File System (EMRFS), Amazon EMR extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster. Most often, Amazon S3 is used to store input and output data and intermediate results are stored in HDFS.
Local File System
The local file system refers to a locally connected disk. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of preattached disk storage called an instance store. Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.
Programming languages supported by EMR
EMR integrates with IAM to manage permissions.
EMR has Master and Slave security groups for nodes to control the traffic access.
EMR supports S3 server-side and client-side encryption with EMRFS.
You can launch EMR clusters in your VPC to make it more secure.
EMR integrates with CloudTrail so you will have log of all activites done on cluster.
You can login via ssh to EMR cluster nodes using EC2 Key Pairs.
AWS CLI :- Command line provides you a rich way of controlling the EMR. Refer here the EMR CLI .
Software Development Kits (SDKs) :- SDKs provide functions that call Amazon EMR to create and manage clusters. It’s currently available only for the supported languages mentioned above. You can check here some sample code and libraries.
Web Service API :- You can use this interface to call the Web Service directly using JSON. You can get more information from API reference Guide .
You pay for EC2 instances used in cluster and EMR.
You are charged for per instance hours.
EMR supports On-Demand, Spot, and Reserved Instances
As a cost saving measure it is recommenced that task nodes should be Spot instances
It’s not a good idea to use spot instances for Master or Core Node as they store data on them. And you will lose data once the node is terminated.
If you want to try some EMR hands on refer this tutorial.
This AWS Crash Course series is created to give you a quick snapshot of AWS technologies. You can check about other AWS services in this series over here .