Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol

Introduction to
Microsoft Azure HDInsight
Dattatrey Sindhol

2
Agenda
Introduction
Hadoop Distributions
Microsoft Azure HDInsight
Microsoft BI and Data Platform
HDInsight - Use Cases
HDInsight - Typical Implementation
Further Learning

3
Introduction
What is Big Data?
“Big Data is a collection
of data sets so large and
complex that it becomes
difficult to process using
on-hand database
management tools or
traditional data processing
applications”.

4
Introduction
Hadoop is an open source
framework, from Apache foundation,
capable of processing very large
volumes of heterogeneous data sets
in a distributed fashion across clusters
of commodity computers and
hardware using a simplified
programming model.
What is Hadoop?

5
Introduction
Conclusion
In simple terms, Big Data is the Challenge and Hadoop is the Solution.

6
Hadoop Distributions
Amazon Elastic
Map Reduce
(EMR)
Cloudera Hortonworks
IBM
InfoSphere
BigInsights
MapR
Pivotal Teradata Intel
Azure
HDInsight
Reference: How the 9 Leading Commercial Hadoop Distributions Stack Up

7
Which Distribution Should I Use?
Cost
Scalability
Availability
Existing Technology Stack
Existing Infrastructure
Existing Skillset

8
HDInsight - Overview
Microsoft’s
Hadoop
Distribution in
the Cloud
Offers Hadoop
on Windows
Platform
Based on
Hortonworks
Data Platform
(HDP)
Tightly
integrated
with Microsoft
Technology
Stack

10
Microsoft Data Platform and Enterprise BI Ecosystem

11
Why HDInsight?
Microsoft Stack
Runs on Windows
Create & Destroy
On-Demand
DFS Implementation
in Blob Storage
DFS Implementation
in Blob Storage
Store data on Blob
Storage for Later Use
Automation using
PowerShell
Orchestration/Work
flow using SSIS
Scheduling using
SQL Agent
BI & Analytics with
Power BI

12
Considerations
Requires dropping and
re-creating the cluster to
scale-up/down
Storage and Cluster should be in
the same Data Center

13
HDInsight Versions
COMPONENT VERSION 1.6 VERSION 2.1 VERSION 3.0
VERSION 3.1
(Current/Default)
Hortonworks Data Platform (HDP) 1.1 1.3 2.0 2.1.7
Apache Hadoop & YARN 1.0.3 1.2.0 2.2.0 2.4.0
Tez 0.4.0
Apache Pig 0.9.3 0.11.0 0.12.0 0.12.1
Apache Hive & HCatalog 0.9.0 0.11.0 0.12.0 0.13.1
HBase 0.98.0
Apache Sqoop 1.4.2 1.4.3 1.4.4 1.4.4
Apache Oozie 3.2.0 3.3.2 4.0.0 4.0.0
Apache HCatalog 0.4.1 Merged with Hive Merged with Hive Merged with Hive
Apache Templeton 0.1.4 Merged with Hive Merged with Hive Merged with Hive
Ambari API v1.0 1.4.1 >=1.5.1
Zookeeper 3.4.5 3.4.5
Storm 0.9.1
Mahout 0.9.0
Phoenix 4.0.0.2.1.7.0-2162

14
HDInsight Use Case - Iterative Exploration

15
HDInsight Use Case - Data Warehouse on Demand

16
HDInsight Use Case - ETL Automation

17
HDInsight Use Case - BI Integration

18
Typical Implementation
Transactional
Social
Warehouse
Azure
Blob
Blob Blob
Blob Blob
Multi-Node
HDInsight Cluster
MapReduce
• Hive
• Java
Reporting and Analytics
• SSRS
• Excel
• Power BI
Web Logs
Clickstream
Files
(TXT, XML, JSON, ..)
Collaboration
Office 365 / SharePoint

19
Typical Implementation (Contd…)
E-CommerceInternalSystems
OLTP
Transactional
Internal Systems
Customers
Internal Systems
Team
Sqoop
Or AzCopy
Hive Metastore
MapReduce
Hive
Multi-Node
HDInsight Cluster
MapReduce
• Hive
• Pig
• Java
• Python
Collaboration, Reporting, and Analytics• SSRS
• Excel
• Power BI
PowerShell / SSIS / SQL Agent
Subscription & Cluster Management | Data Movement | Job Execution
Warehouse
Web Logs
Social
Web Logs
Azure
Blob Storage
Blob
Blob Blob
Blob
Blob
Blob
Blob

20
Further Reading and Learning Resources
• HDInsight Emulator
• http://azure.microsoft.com
• Learning map for HDInsight: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-learn-map

21
References
• http://msdn.microsoft.com/en-us/library/dn749804.aspx
• http://azure.microsoft.com/en-us/documentation/articles/hdinsight-
component-versioning/

22
Related Apache Projects
Term Description
Ambari / HUE Deployment, Configuration, and Monitoring
Avro / Parquet / RC / Sequence Data serialization system
Flume / S4 / Storm Collection and import of log and event data
Hbase / Cassandra Column-oriented database scaling to billions of rows
HCatalog Schema and Data Type Sharing over Pig, Hive, and MapReduce
Hive / Drill / Impala Data Warehouse with SQL-Like Access
Hive-QL/HQL SQL-Like Language to Query Hive
Mahout Library of machine learning and data mining algorithms
Pig High-level programming for Hadoop computations
Oozie Orchestration and workflow management
Sqoop Imports data from relational databases
Tez Application framework for graph
Whirr Cloud-agnostic deployment of clusters
MapReduce / YARN
MapReduce is a programming model for distributed data processing. MapReduce has undergone a
complete overhaul in hadoop-0.23 and we now have Map-Reduce 2.0 (MRv2) or YARN.
Zookeeper Configuration management and coordination

24
Top 10
Mobile Companies
Top 5
Outsourced Product Development Companies
2012 Partner of the year
Windows Azure, Finalist
40
GLOBAL OFFICES
7500
EMPLOYEES
23
COUNTRIES
Excellence Award
Technology Agency of the Year

Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol

Similar to Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol (20)

More from HARMAN Services

More from HARMAN Services (20)

Recently uploaded

Recently uploaded (20)

Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol