Hadoop ecosystem - hadoop 生態系

  • View
    304

  • Download
    5

Embed Size (px)

Transcript

  1. 1. Headfirst Hadoop
  2. 2. 2 figure Source : http://aryannava.com/2014/02/19/apache-hadoop-ecosystem/hadoopecosystem/
  3. 3. Case 3 Hadoop EcoSystem
  4. 4. flume : log hadoop. : hadoop? : 4 figure Source : http://image.slidesharecdn.com/flume-120314204418-phpapp01/95/apache-flume-4-728.jpg?cb=1338404245
  5. 5. Apache Flume: Log hdfs : config 5 Source netcat exec syslog spooldir seq http avro Sink logger hdfs file_roll hbase solr avro channel memory jdbc File figure Source : https://flume.apache.org/FlumeUserGuide.html
  6. 6. shell hadoop PIG : : (log) id Hadoop HDFS, .. : MapReduce Java, , hadoop API, @@ : 6
  7. 7. PigMap-Reduce? Apache Pig JavaC++ 1620 Pig Pig Shell (Grunt) Pig Language (Latin) Libraries (Piggy Bank) UDF: 7 figure Source : http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing
  8. 8. 8 LOAD STORE REGEX_EXTRACT, FILTER, FOREACH, GROUP, JOIN, UNION, SPLIT, AVG, COUNT, MAX, MIN, SIZE, ABS, RANDOM, ROUND, INDEXOF, SUBSTRING, REGEX EXTRACT, Debug DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE HDFS cat, ls, cp, mkdir, $ pig x grunt> A = LOAD file1 AS (x, y, z); grunt> B = FILTER A by y > 10000; grunt> STORE B INTO output;
  9. 9. mapreduce code 9 nm dp Id Id dt hr A1 A1 7/7 13 B1 A1 7/8 12 B2 A1 7/9 4 Java Code Map-Reduce A1 7/8 13 A1 7/9 12 A1 Jul 12.5
  10. 10. pig 10 A1 12.5 LOAD LOAD FILTER JOIN GROUP FOREACH STORE (nm, dp, id) (nm, dp, id) (id, dt, hr) (nm, dp, id, id, dt, hr) (group, {(nm, dp, id, id, dt, hr)}) (group, ., AVG(hr)) (dp,group, nm, hr) Logical PlanPig Latin A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id) ; B = LOAD file2.txt' using PigStorage(',') AS (id, dt, hr) ; C = FILTER B by hr > 8; D = JOIN C BY id, A BY id; E = GROUP D BY A::id; F = FOREACH E GENERATE $1.dp,group,$1.nm, AVG($1.hr); STORE F INTO '/tmp/pig_output/'; nm dp Id Id dt hr A1 A1 7/7 13 B1 A1 7/8 12 B2 A1 7/9 4 Tips : pig x local dump or illustrate
  11. 11. : : DBtable csv Hadoop HDFS, .. : PIGMapReduce SQL DB : 11 sql server Hive
  12. 12. Hadoop RDB : Hive Hive = HadoopRDB SQL( SQL MapReduce) SQL SQL 12
  13. 13. Hive .. CLI WebUI API JDBC and ODBC Thrift Server (hiveserver) Client API HiveQL Metastore DB, table, partition 13 figure Source : http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive
  14. 14. 14 $ hive hive> create table A(x int, y int, z int) hive> load data local inpath file1 into table A; hive> select * from A where y>10000 hive> insert table B select * from A where y>10000 figure Source : http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/
  15. 15. Hive 15 A1 12.5 HiveQL > create table A (nm String, dp String, id String) > create table B (id String, dt Date, hr int) > create table final (dp String, id String , nm String, avg float) > load data inpath file1 into table A; > load data inpath file2 into table B; > Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr) from a,b where b.hr > 8 and b.id = a.id group by a.id; nm dp Id id dt hr A1 A1 7/7 13 B1 A1 7/8 12 B2 A1 7/9 4 Tips : create table & load data tool
  16. 16. HiveSQL Hive RDMS HQL SQL HDFS Raw Device or Local FS MapReduce Excutor NO YES Index, Bigmap index 16 Source : http://sishuok.com/forum/blogPost/list/6220.html
  17. 17. Pig vs Hive 17 Hive Pig SQL-LIKE PigLatin Yes/ Schemas/ Types Yes / Yes Partitions No Thrift Server No Yes Web Interface No Yes(limited) JDBC/ODBC No No Hdsf Yes Hive Pig Big Data Source : http://f.dataguru.cn/thread-33553-1-1.html
  18. 18. : HCatalog : Mapreduce, pig, hive "metastore Command line 18 figure Source : http://wiki.gurubee.net/pages/viewpage.action?pageId=26739793
  19. 19. : : : csv hdfs load hive : 19 DBtable csv Hadoop HDFS, .. sqoop
  20. 20. Sqoop : RDB Hadoop Apache Sqoop = SQL to Hadoop .. RDBMS Data warehources NoSQL .. Hive Hbase oozie 20 figure Source : http://bigdataanalyticsnews.com/data-transfer-mysql-cassandra-using-sqoop/
  21. 21. Sqoop 21 figure Source : http://hive.3du.me/slide.html
  22. 22. Hive + Sqoop 22 A1 12.5 HiveQL > create > load data inpath file1 into table A; > load data inpath file2 into table B; > Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr) from a,b where b.hr > 8 and b.id = a.id group by a.id; nm dp Id id dt hr A1 A1 7/7 13 B1 A1 7/8 12 B2 A1 7/9 4 HiveQL > create > Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr) from a,b where b.hr > 8 and b.id = a.id group by a.id;
  23. 23. : : hive RDB hive DB, table sqoop : hive : 23 Impala HBase
  24. 24. impala Near-realtime SQL hive 6~ 60 24 figure Source : http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/
  25. 25. NOSQL Hbase HbaseBigTableNoSQL (Multi-Dimensional Map) Why HBase Random read/write hadoop 25
  26. 26. Hbase HBase(RDBMS) (Table) (primary index) row key. SQL ( join ) Java, RESTThrift getRow(), Scan() getRow()row range (timestamp) Scan()row range ( start key, end key) insert, update, delete Hbase insert put () cell put() => update; Delete() = cell Row Key design hbase 26
  27. 27. HBase Rowkey, column family, column qualifier, timestamp, cell 27 figure Source : http://www.slideshare.net/hanborq/h-base-introduction
  28. 28. : : machine learning, data mining hive, pig : Machine learning => Machout => Rhadoop 28
  29. 29. Mahout = Mahout = MapReducedata mining : ( ) Mahout Dimension Reduction Vector Similarity Pattern Mining 29 Regression Recommenders ClusteringClassification Freq. Pattern Mining Vector Similarity Non-MR Algorithms Examples Dimension Reduction Evolution figure Source : http://www.slideshare.net/chaoyu0513/hit20130928-apache-mahout
  30. 30. R : R hadoop R R CRAN Perl Revolution Rhadoop rmr2, rhdfs, rhbase 30 figure Source : http://www.r-project.org/
  31. 31. : : hadoop ecosystem application src txt-> { flume => MR => hive pig => sqoop } -> dst DB result error message : 31 shell script .. oozie
  32. 32. Hadoop : oozie job ( start, end, kill, fork, join ) ( mapreduce/java/pig/hive ) code xml 32 figure Source : http://www.slideshare.net/martyhall/hadoop-tutorial-oozie
  33. 33. http://oozie_server:11000/ 33
  34. 34. ETL Apache Flume Apache Sqoop DB Apache Hbase Apache Hive Apache Impala Calculate Apache Pig Apache Mahout R Hadoop WorkFlow Apache OOZIE 34
  35. 35. Advice Hadoop Hadoop : (2014 4 334 ) (Yahoo) 35
  36. 36. backup
  37. 37. Pig example result 37 A = LOAD '/user/waue/pig_input/file1.txt' using PigStorage(',') AS (nm, dp, id) ; B = LOAD '/user/waue/pig_input/file2.txt' using PigStorage(',') AS (id, dt, hr) ; C = FILTER B by hr > 8; D = JOIN C BY id, A BY id; E = GROUP D BY A::id; F = FOREACH E GENERATE $1.dp,group,$1.nm, AVG($1.hr); STORE F INTO '/tmp/pig_output/';
  38. 38. Hive example result 38 INSERT OVERWRITE TABLE final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr) from a,b where b.hr > 8 and b.id = a.id group by a.id;