3. 3
Introduction
What is Big Data?
“Big Data is a collection
of data sets so large and
complex that it becomes
difficult to process using
on-hand database
management tools or
traditional data processing
applications”.
4. 4
Introduction
Hadoop is an open source
framework, from Apache foundation,
capable of processing very large
volumes of heterogeneous data sets
in a distributed fashion across clusters
of commodity computers and
hardware using a simplified
programming model.
What is Hadoop?
11. 11
Why HDInsight?
Microsoft Stack
Runs on Windows
Create & Destroy
On-Demand
DFS Implementation
in Blob Storage
DFS Implementation
in Blob Storage
Store data on Blob
Storage for Later Use
Automation using
PowerShell
Orchestration/Work
flow using SSIS
Scheduling using
SQL Agent
BI & Analytics with
Power BI
22. 22
Related Apache Projects
Term Description
Ambari / HUE Deployment, Configuration, and Monitoring
Avro / Parquet / RC / Sequence Data serialization system
Flume / S4 / Storm Collection and import of log and event data
Hbase / Cassandra Column-oriented database scaling to billions of rows
HCatalog Schema and Data Type Sharing over Pig, Hive, and MapReduce
Hive / Drill / Impala Data Warehouse with SQL-Like Access
Hive-QL/HQL SQL-Like Language to Query Hive
Mahout Library of machine learning and data mining algorithms
Pig High-level programming for Hadoop computations
Oozie Orchestration and workflow management
Sqoop Imports data from relational databases
Tez Application framework for graph
Whirr Cloud-agnostic deployment of clusters
MapReduce / YARN
MapReduce is a programming model for distributed data processing. MapReduce has undergone a
complete overhaul in hadoop-0.23 and we now have Map-Reduce 2.0 (MRv2) or YARN.
Zookeeper Configuration management and coordination
24. 24
Top 10
Mobile Companies
Top 5
Outsourced Product Development Companies
2012 Partner of the year
Windows Azure, Finalist
40
GLOBAL OFFICES
7500
EMPLOYEES
23
COUNTRIES
Excellence Award
Technology Agency of the Year