AWS Crash Course – EMR

What is EMR?

  • AWS EMR(Elastic MapReduce) is a managed hadoop framework.
  • It provides you an easy, cost-effective and highly scalable way to process large amount of data.
  • It can be used for multiple things like indexing, log analysis, financial analysis, scientific simulation, machine learning etc.

Cluster and Nodes

  • The centerpiece of EMR is Cluster.
  • Cluster is a collection of EC2 instances also called as nodes.
  • All nodes of an EMR cluster are launched in same availability zone.
  • Each node has a role in cluster.

Type of EMR Cluster Nodes

Master Node:- It’s the main boss which manages the cluster by running software components and distributing the tasks to other nodes. Master node will monitor task status and health of cluster.

Core Node:- It’s a slave node which “run tasks” and “store data” in HDFS (Hadoop Distributed Filesystem).

Task Node:- This is also a slave node but it only “run tasks”. It doesn’t store any data. It’s an optional node.

Cluster Types

EMR has two type of clusters

1) Transient :- These are clusters which are shutdown once the jobs is done. These are useful when you don’t need cluster to be running all day long and can save money by shutting them down.

2) Persistent :- Persistent clusters are those which need to be always available to process the continuous stream of jobs or you want the data to be always available on HDFS.

Different Cluster States

An EMR cluster goes through multiple stages as described below:-

STARTING – The cluster provisions, starts, and configures EC2 instances.
BOOTSTRAPPING – Bootstrap actions are being executed on the cluster.
RUNNING – A step for the cluster is currently being run.
WAITING – The cluster is currently active, but has no steps to run.
TERMINATING – The cluster is in the process of shutting down.
TERMINATED – The cluster was shut down without error.
TERMINATED_WITH_ERRORS – The cluster was shut down with errors.

Below is the complete EMR cluster lifecycle .

(Image reference AWS EMR Management Guide)

Types of filesystem in EMR

Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster.

EMR File System (EMRFS)
Using the EMR File System (EMRFS), Amazon EMR extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster. Most often, Amazon S3 is used to store input and output data and intermediate results are stored in HDFS.

Local File System
The local file system refers to a locally connected disk. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of preattached disk storage called an instance store. Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.

Programming languages supported by EMR

  • Perl
  • Python
  • Ruby
  • C++
  • PHP
  • R

EMR Security

  • EMR integrates with IAM to manage permissions.
  • EMR has Master and Slave security groups for nodes to control the traffic access.
  • EMR supports S3 server-side and client-side encryption with EMRFS.
  • You can launch EMR clusters in your VPC to make it more secure.
  • EMR integrates with CloudTrail so you will have log of all activites done on cluster.
  • You can login via ssh to EMR cluster nodes using EC2 Key Pairs.

EMR Management Interfaces

  • Console :-  You can manage your EMR clusters from AWS EMR Console .
  • AWS CLI :-  Command line provides you a rich way of controlling the EMR. Refer here the EMR CLI .
  • Software Development Kits (SDKs) :- SDKs provide functions that call Amazon EMR to create and manage clusters. It’s currently available only for the supported languages mentioned above. You can check here some sample code and libraries.
  • Web Service API :- You can use this interface to call the Web Service directly using JSON. You can get more information from API reference Guide .

EMR Billing

  • You pay for EC2 instances used in cluster and EMR.
  • You are charged for per instance hours.
  • EMR supports  On-Demand, Spot, and Reserved Instances
  • As a cost saving measure it is recommenced that task nodes should be Spot instances
  • It’s not a good idea to use spot instances for Master or Core Node as they store data on them. And you will lose data once the node is terminated.

If you want to try some EMR hands on refer this tutorial.

This AWS Crash Course series is created to give you a quick snapshot of AWS technologies.  You can check about other AWS services in this series over here .

AWS Crash Course – Redshift

Redshift is a data warehouse from Amazon. It’s like a virtual place where you store a huge amount of data.

  • Redshift is fully managed petabyte-scale system
  • Amazon Redshift is based on PostgreSQL 8.0.2
  • It is optimized for data warehousing
  • Supports integrations and connections with various applications, including, Business Intelligence tools
  • Redshift provides custom JDBC and ODBC drivers.
  • Redshift can be integrated with CloudTrail for auditing purpose.
  • You can monitor Redshift performance from CloudWatch.

Features of Amazon Redshift

Supports VPC − The users can launch Redshift within VPC and control access to the cluster through the virtual networking environment.

Encryption − Data stored in Redshift can be encrypted and configured while creating tables in Redshift.

SSL − SSL encryption is used to encrypt connections between clients and Redshift.

Scalable − With a few simple clicks, you can choose vertical scaling(increasing instance size) or horizontal scaling(increasing compute nodes).

Cost-effective − Amazon Redshift is a cost-effective alternative to traditional data warehousing practices. There are no up-front costs, no long-term commitments and on-demand pricing structure.

MPP(Massive Parallel Processing) – Redshift Leverages parallel processing which improves query performance. Massively parallel refers to the use of a large number of processors (or separate computers) to perform coordinated computations in parallel. This reduces computing time and improves query performance.

Columnar Storage – Redshift uses columnar storage. So it stores data tables by column rather than by row. The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed up the time it takes to return a query.

Advanced Compression – Compression is a column-level operation that reduces the size of data when it is stored thus help in saving space.

Type of nodes in Redshift
Leader Node
Compute Node

What does these nodes do?

Leader Node:- A leader node receives queries from client applications, parse the queries and develops execution plans, which are an ordered set of steps to process these queries. The leader node then coordinates the parallel execution of these plans with the compute nodes.  Good part is that you will not be charged for leader node hours; only compute nodes will incur charges. If you run single node Redshift cluster you don’t need leader node.

Compute Node:- It execute the steps specified in the execution plans and transmit data among other nodes to serve queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications. You can have 1 to 128 Compute Nodes.

From which sources you can load data in Redshift?

You can do it from multiple sources like :-

  • Amazon S3
  • Amazon DynamoDB
  • Amazon EMR
  • AWS Data Pipeline
  • Any SSH-enabled host on Amazon EC2 or on-premises

Redshift Backup and Restores

  • Redshift can take automatic snapshots of cluster.
  • You can also take manual snapshot of cluster.
  • Redshift continuously backs up its data to S3
  • Redshift attempt to keep at least 3 copies of the data.

Hope the above snapshot give you a decent understanding of Redshift. If you want to try some handson check this tutorial .

This series is created to give you a quick snapshot of AWS technologies.  You can check about other AWS services in this series over here .

AWS Crash Course – SQS

Today we will discuss about an AWS messaging service called SQS.

  • SQS is Simple Queue Service.
  • It’s a messaging queue service which acts like a buffer between message producing and message receiving components.
  • Using SQS you can decouple the components of an application.
  • Messages can contain upto 256 KB of text in any format.
  • Any component can later retrieve the messages programmatically using the SQS API.
  • SQS queues are dynamically created and scale automatically so you can build and grow applications quickly – and efficiently.
  • You can combine SQS with auto scaling of EC2 instances as per warm up and cool down.
  • Used by companies like Vodafone, BMW, RedBus, Netflix etc.
  • You can use Amazon SQS to exchange sensitive data between applications using server-side encryption (SSE) to encrypt each message body.
  • SQS is pull(or poll) based system. So messages are pulled from SQS queues.
  •  Multiple copies of every message is stored redundantly across multiple availability zones.
  • Amazon SQS is deeply integrated with other AWS services such as EC2, ECS, RDS, Lambda etc.

Two types of SQS queues:-

  • Standard Queue
  • FIFO Queue

Standard Queue :-

  • Standard Queue is the default type offered by SQS
  • Allows nearly unlimited transactions per second.
  • Guarantees that a message will be delivered at least once.
  • But it can deliver the message more than once also.
  • It provides best effort ordering.
  • Messages can be kept from 1 minute to 14 days. Default is 4 days.
  • It has a visibility time out window. And if order is not processed till that time, it will become visible again and processed by another reader.

FIFO Queue :-

  • FIFO queue complements the standard queue.
  • It has First in First Out delivery mechanism.
  • Messages are processed only once.
  • Order of the message is strictly preserved.
  • Duplicates are not introduced in the queue.
  • Supports message groups.
  • Limited to 300 transactions per second.

Hope the above snapshot give you a decent understanding of SQS. If you want to try some handson check this tutorial .

This series is created to give you a quick snapshot of AWS technologies.  You can check about other AWS services in this series over here .

AWS Crash Course – RDS

Welcome to AWS Crash Course.

What is RDS?

  • RDS is Relational Database Service of Amazon.
  • It is part of its PaaS offering.
  • A new DB instance can easily be launched from AWS management console.
  • Complex administration process like patching, backup etc. are manged automatically by RDS.
  • Amazon has its own relational database called Amazon Aurora.
  • RDS also supports other popular database engines like MySQL, Oracle, SQL Server, PostgreSQL and MariaDB .

RDS Supports Multi AZ(Availability Zone) failovers.

What does that mean?

It means if your primary DB is down. Services will automatically failover to secondary DB in other AZ.

  • Multi-AZ deployments for MySQL,Oracle and PostgreSQL engines utilizes synchronous physical replication to keep data on the standby up-to-date with Primary.
  • Multi-AZ deployments for the SQL server engine use synchronous logical replication to achieve the same result, employing SQL server native mirroring tech.
  • Both approaches safeguard your data in event of a DB instance failure or loss of AZ.
  • Backups are taken from secondary DB which avoids I/O suspension to the primary.
  • Restore’s are taken from secondary DB which avoids I/O suspension to the primary.
  • You can force a failover from one AZ to another by rebooting your DB instance.

But RDS Multi AZ failover is not a scaling Solution.

Read Replicas are for Scaling.

What are Read Replicas?

As we discussed above Multi AZ is synchronous replication of DB.  While read replicas are asynchronous replication of DB.

  • You can have 5 read replicas for both MySQL and PostgreSQL.
  • You can have read replicas in different regions but for MySQL only.
  • Read replica’s can be built off Multi-AZ’s databases.
  • You can have read replica’s of read replica’s , however only for MySQL and this will further increase latency.
  • You can use read replicas for generating reports. By this you won’t put load on the primary DB.

RDS supports automated backups.

But keep these things in mind.

  • There is a performance hit if Multi-AZ is not enabled.
  • If you delete an instance then all automated backups are deleted, however manual db snapshots will not be deleted.
  • All snapshots are stored on S3.
  • When you do a restore , you can change the engine type(e.g. SQL standard to SQL enterprise) provided you have enough storage space.

Hope the above snapshot give you a decent understanding of RDS. If you want to try some handson check this tutorial .

This series is created to give you a quick snapshot of AWS technologies.  You can check about other AWS services in this series over here .

AWS Crash Course – S3

Welcome to AWS Crash Course.

What is S3?

S3 is Simple Storage Service. It’s an object storage. That means, it’s used for storing objects like photos, videos etc.

  • S3 provides 11 9’s durability 99.999999999%.  Means Losing 1 out of 100 Billion objects.
  • S3 provides 99.99% availability.
  • Files can be 1 byte to 5 TB in size
  • Read after Write consistency for PUTS of new objects – means you can immediately read what you have written.
  • Eventual consistency for overwrite PUTS and DELETES – if you overwrite or delete an existing object it takes time to propagate in S3 globally.
  • Secure your data using ACL and bucket policies.
  • S3 is designed to sustain the loss of 2 facilities concurrently i.e. 2 Availability Zone failures.

    S3 has multiple classes. One is S3 Standard which we discussed above. Others classes are:-

S3-IA (Infrequently Accessed) 
  • S3-IA is for data which is not frequently accessed but still needed an immediate access.
  • You get same durability and availability as S3 but at reduced price.
  • Can manage upto 2 Concurrent facility fault tolerance.
S3-RRS (S3- Reduced Redundancy Storage)
  • 99.99% durability and availability.
  • Use RRS if you are storing non-critical data that can be easily reproduced. Like thumbnails of images.
  • No Concurrent facility fault tolerance
S3 Glacier 
  • Data is stored in Amazon Glacier in “archives.“
  • Archive can be any data such as a photo, video, or document.
  • A single archive can be as large as 40 terabytes.
  • You can store an unlimited number of archives in Glacier
  • Amazon Glacier uses “vaults” as containers to store archives.
  • Under a single AWS account, you can have up to 1000 vaults.

Here is quick comparison of different S3 classes as per Amazon.

S3 Supports versioning. 

What does that mean?
It means that if you change a file it can keep versions of both old and new files.

  • If you enable versioning in S3 it will keep all the versions even if you delete or update the old version.
  • Great backup tool
  • Once enabled versioning cannot be disabled ,only suspended
  • Integrates with lifecycle rules
  • Versioning’s MFA delete capability which uses multi factor authentication, can be used to provide additional layer of security.
  • Cross region replication , requires versioning enabled on the source bucket
S3  Supports Lifecycle Management

What does that mean?
It means you can move objects from one storage class to another storage class after few days as per your schedule. This is used to reduce cost by moving less critical data to cheaper storage class.

  • Lifecycle configuration enables you to specify the lifecycle management of objects in a bucket
  • Can be used with versioning
  • Can be applied to current versions and previous versions
  • Transition from standard to infrequent access storage class can be done only after the data is in standard class storage for 30 days.
  • You can directly put data from standard to glacier
  • Lifecycle policy will not transition objects that are less than 128KB

If you want to try some handson try this exercise . You can also refer to this AWS S3 CLI cheat sheet to get your hands dirty with command line.

This series is created to give you a quick snapshot of AWS technologies.  You can check about other AWS services in this series over here .