Hadoop introduction

  • View
    151

  • Download
    2

Embed Size (px)

Transcript

  • Training (Day 1)

    Introduction

  • Big-dataFour parameters:

    Velocity: Streaming data and large volume data movement.Volume: Scale from terabytes to zettabytes.Variety: Manage the complexity of multiple relational and non-relational data types and schemas.Voracity: Produced data has to be consumed fast before it becomes meaningless.

  • Not just internet companiesBig Data Shouldnt Be a SiloMust be an integrated part of enterprise information architecture

  • Data >> Information >> Business ValueRetail By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues.

    Financial Services By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets.

    Government By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies.

    Healthcare Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.

  • Single-core, single processorSingle-core, multi-processor

    Single-core

    Multi-core, single processorMulti-core, multi-processorMulti-core

    Cluster of processors (single or multi-core) with shared memoryCluster of processors with distributed memory

    Cluster

    Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN.

    Grid of clusters

    Embarrassingly parallel processing

    MapReduce, distributed file system

    Cloud computing

    Pipelined Instruction level

    Concurrent Thread level

    Service Object level

    Indexed File level

    Mega Block level

    Virtual System Level

    Data size: small

    Data size: large

    Reference: Bina Ramamurthy 2011

    Processing Granularity

  • How to Process BigData?

    Need to process large datasets (>100TB)Just reading 100TB of data can be overwhelmingTakes ~11 days to read on a standard computerTakes a day across a 10Gbit link (very high end storage solution)On a single node (@50MB/s) 23daysOn a 1000 node cluster 33min

  • ExamplesWeb logs;RFID; sensor networks; social networks; social data (due to the social data revolution), Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; medical records; photography archives; video archives; and large-scale e-commerce.

  • Not so easy

    Moving data from storage cluster to computation cluster is not feasible

    In large clustersFailure is expected, rather than exceptional. In large clusters, computers fail every dayData is corrupted or lostComputations are disruptedThe number of nodes in a cluster may not be constant. Nodes can be heterogeneous.

    Very expensive to build reliability into each applicationA programmer worries about errors, data motion, communicationTraditional debugging and performance tools dont apply

    Need a common infrastructure and standard set of tools to handle this complexityEfficient, scalable, fault-tolerant and easy to use

  • Why is Hadoop and MapReduce needed?

    The answer to this questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate.

    Seeking is the process of moving the disks head to a particular place on the disk to read or write data.

    It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disks bandwidth.

    If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.

  • Why is Hadoop and MapReduce needed?

    On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well.

    For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.

    MapReduce can be seen as a complement to an RDBMS.

    MapReduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.

  • Why is Hadoop and MapReduce needed?

  • Hadoop distributions

    Apache HadoopApache Hadoop- based Services

    for Windows Azure

    Clouderas Distribution Including Apache Hadoop (CDH)

    Hortonworks Data Platform

    IBM InfoSphere BigInsights

    Platform Symphony MapReduce

    MapR Hadoop Distribution

    EMC Greenplum MR (using MapRsM5 Distribution)

    Zettaset Data Platform

    SGI Hadoop Clusters (uses

    Cloudera distribution)

    Grand Logic JobServer

    OceanSync Hadoop Management

    Software

    Oracle Big Data Appliance (uses

    Cloudera distribution)

  • Whats up with the names?

    When naming software projects, Doug Cutting seems to have been inspired by his family.

    Lucene is his wifes middle name, and her maternal grandmothers first name.

    His son, as a toddler, used Nutch as the all-purpose word for meal and later named a yellow stuffed elephant Hadoop.

    Doug said he was looking for a name that wasnt already a web domain and wasnt trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.

  • Hadoop features

    Distributed Framework for processing and storing data generally on commodity hardware.

    Completely Open Source.

    Written in JavaRuns on Linux, Mac OS/X, Windows, and Solaris.Client apps can be written in various languages.

    Scalable: store and process petabytes, scale by adding Hardware

    Economical: 1000s of commodity machines

    Efficient: run tasks where data is located

    Reliable: data is replicated, failed tasks are rerun

    Primarily used for batch data processing, not real-time / user facing applications

  • Components of Hadoop

    HDFS (Hadoop Distributed File System)Modeled on GFSReliable, High Bandwidth file system that can

    store TB' and PB's data.

    Map-ReduceUsing Map/Reduce metaphor from Lisp languageA distributed processing framework paradigm that

    process the data stored onto HDFS in key-value .

    DFS

    Processing Framework

    Client 1 Client 2

    Inputdata

    Outputdata

    Map

    Map

    Map

    Reduce

    Reduce

    Input Map Shuffle & Sort Reduce Output

  • Very Large Distributed File System10K nodes, 100 million files, 10 PBLinearly scalableSupports Large files (in GBs or TBs)

    EconomicalUses Commodity HardwareNodes fail every day. Failure is expected, rather than exceptional.The number of nodes in a cluster is not constant.

    Optimized for Batch Processing

    HDFS

  • HDFS Goals

    Highly fault-tolerantruns on commodity HW, which can fail frequently

    High throughput of data accessStreaming access to data

    Large filesTypical file is gigabytes to terabytes in sizeSupport for tens of millions of files

    Simple coherencyWrite-once-read-many access model

  • HDFS: Files and Blocks

    Data OrganizationData is organized into files and directoriesFiles are divided into uniform sized large blocksTypically 128MBBlocks are distributed across cluster nodes

    Fault ToleranceBlocks are replicated (default 3) to handle hardware failureReplication based on Rack-Awareness for performance and fault tolerance

    Keeps checksums of data for corruption detection and recoveryClient reads both checksum and data from DataNode. If checksum fails, it tries other replicas

  • HDFS: Files and Blocks

    High Throughput:Client talks to both NameNode and DataNodesData is not sent through the NameNode.Throughput of file system scales nearly linearly with the number of nodes.

    HDFS exposes block placement so that computation can be migrated to data

  • HDFS Components

    NameNodeManages the file namespace operation like opening, creating, renaming etc.File name to list blocks + location mappingFile metadata Authorization and authenticationCollect block reports from DataNodes on block locationsReplicate missing blocksKeeps ALL namespace in memory plus checkpoints & journal

    DataNodeHandles block storage on multiple volumes and data integrity.Clients access the blocks directly from data nodes for read and writeData nodes periodically send block reports to NameNodeBlock creation, deletion and replication upon instruction from the NameNode.

  • name:/users/joeYahoo/myFile - blocks:{1,3}name:/users/bobYahoo/someData.gzip - blocks:{2,4,5}

    Datanodes (the slaves)

    Namenode (the master)

    1 12

    224 5

    33 4 4

    55

    ClientMetadata

    I/O

    1

    3

    HDFS Architecture

  • Simple commandshdfs dfs -ls, -du, -rm, -rmr

    Uploading fileshdfs dfs copyFromLocal foo mydata/foo

    Downloading fileshdfs dfs - moveToLocal mydata/foo foo

    hdfs dfs -cat mydata/foo

    Adm