Spark와 Hadoop, 완벽한 조합 (한국어)

  • Published on
    15-Apr-2017

  • View
    4.120

  • Download
    0

Embed Size (px)

Transcript

<ul><li><p>Spark HDP, (Hortonworks Data Platform) </p><p> , Hortonworks Korea </p><p> Hortonworks Inc. 2011 2015. All Rights Reserved </p></li><li><p> Hadoop? </p><p> / </p><p> Hortonworks Inc. 2011 2015. All Rights Reserved </p></li><li><p>4ZB DATA </p><p> MOBILE </p><p>DEVICES </p><p> HUMAN </p><p>CONTENT </p><p> INTERNET </p><p>OF THINGS </p><p>44ZB DATA </p><p>Page 3 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Source: http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm </p></li><li><p> , , , </p><p>Page 4 Hortonworks Inc. 2011 2015. All Rights Reserved </p></li><li><p> Apache Hadoop , </p><p>Page 5 Hortonworks Inc. 2011 2015. All Rights Reserved </p></li><li><p>Page 6 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Hadoop </p><p>App App App App </p><p>App </p><p>App </p><p>Page 6 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p> H ADOO P </p></li><li><p> Hadoop </p><p>Page 7 Hortonworks Inc. 2011 2015. All Rights Reserved </p></li><li><p>Page 8 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Payment Tracking </p><p>Call Analysis </p><p>Machine Data </p><p>Product Design </p><p>Social Mapping </p><p>Factory Yields </p><p>Defect Detection </p><p>Due Diligence </p><p>M &amp; A Proactive Repair Disaster </p><p>Mitigation Investment Planning </p><p>Next Product </p><p>Recs </p><p>Store Design </p><p>Risk Modeling </p><p>Ad Placement </p><p>Inventory Predictions </p><p>Sentiment Analysis </p><p>Ad Placement </p><p>Basket Analysis Segments </p><p>Customer Support </p><p>Supply Chain </p><p>Cross- Sell </p><p>Customer Retention </p><p>Vendor Scorecards </p><p>Optimize Inventories </p><p> , , . </p></li><li><p>Page 9 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Historical Records </p><p>OPEX Reduction </p><p>Mainframe Offloads </p><p>Fraud Prevention </p><p>Data as a </p><p>Service </p><p>Public Data </p><p>Capture </p><p> IT Hadoop . , ETL , . </p><p>Digital Protection </p><p>Device Data </p><p>Ingest </p><p>Rapid Reporting </p><p> ETL </p></li><li><p>Page 10 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Hortonworks . . </p><p>Social Mapping </p><p>Payment Tracking </p><p>Factory Yields </p><p>Defect Detection </p><p>Call Analysis </p><p>Machine Data </p><p>Product Design M &amp; A </p><p>Due Diligence </p><p>Next Product </p><p>Recs </p><p>Store Design </p><p>Risk Modeling </p><p>Ad Placement </p><p>Proactive Repair </p><p>Disaster Mitigation </p><p>Investment Planning </p><p>Inventory Predictions </p><p>Customer Support </p><p>Sentiment Analysis </p><p>Supply Chain </p><p>Ad Placement </p><p>Basket Analysis Segments </p><p>Cross- Sell </p><p>Customer Retention </p><p>Vendor Scorecards </p><p>Optimize Inventories </p><p>OPEX Reduction </p><p>Mainframe Offloads </p><p>Historical Records </p><p>Data as a </p><p>Service </p><p>Public Data </p><p>Capture </p><p>Fraud Prevention </p><p>Device Data </p><p>Ingest </p><p>Rapid Reporting </p><p>Digital Protection </p><p> ETL </p></li><li><p> Hortonworks? </p><p> Hadoop </p><p> Hortonworks Inc. 2011 2015. All Rights Reserved </p></li><li><p>Page 12 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Hortonworks Hadoop </p><p>H O R T O N W O R K S D ATA P L AT F O R M </p><p>YARN: </p></li><li><p>Page 13 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Hortonworks Apache </p><p> Apache Hadoop 1/3, </p><p> Hadoop </p><p>Hadoop </p><p>A PA C H E H A D O O P C O M M I T T E R S </p></li><li><p>Page 14 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>STO</p><p>RA</p><p>GE</p><p> STOR</p><p>AG</p><p>E </p><p>Hortonworks </p><p>Hortonworks Hadoop </p><p> = </p><p>Apache </p><p> , </p><p>Project 1 </p><p>Project 5 </p><p>Project 4 </p><p>Project 3 </p><p>Project 2 </p><p>Project 6 </p></li><li><p>Page 15 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Hortonworks </p><p>Hortonworks SmartSense </p><p>Hortonworks SmartSense </p></li><li><p>Page 16 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Hortonworks </p><p>Hortonworks Hadoop , 100 40% </p><p> F100 75% F100 65% F100 55% F100 46% F100 40% </p><p>Hortonworks </p><p> 2014 Forrester Wave </p><p>The Forrester Wave: Big Data Hadoop Solutions </p></li><li><p>Page 17 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Hortonworks </p><p> 556 (2015 8 5 ) 2015 2 119 NASDAQ : HDP </p><p>Hortonworks Data Platform </p><p> , , </p><p> , </p><p> Hadoop </p><p>2011 </p><p>Yahoo! 24 Hadoop , , </p><p>740+ </p><p>1350+ </p></li><li><p>Page 18 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Hortonworks IT </p><p> IT , , </p><p>Hortonworks </p><p> Hadoop </p><p>, , </p><p>2015 6 Shared Accounts of Hortonworks (A, I) (All Cut, n=35) </p><p>Hortonworks, Big Data #1 </p><p>Microsoft, Hosting #2 </p><p>MongoDB, Warehousing #3 </p><p>Tableau, Big Data #4 </p><p> 20 </p><p>Source: https://hortonworks.com/blog/cio-survey-hortonworks-data-platform-now-a-top-it-imperative/ </p></li><li><p>Spark HDP, </p><p>Spark on YARN, </p><p> Hortonworks Inc. 2011 2015. All Rights Reserved </p></li><li><p>Page 20 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p> API DataFrames, , SQL </p><p> Hive Hadoop SQL Spark Hadoop </p><p> , , </p><p> Hadoop </p><p>Hortonworks Spark </p><p>Storage </p><p>YARN: Data Operating System </p><p>Governance Security </p><p>Operations </p><p>Resource Management </p></li><li><p>Page 21 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p> YARN SLA , </p><p> - RDD HDFS </p><p>SQL SparkSQL Hive , HS2; ORC </p><p>Spark NoSQL RDDs for predicate pushdown HBase </p><p> , , : </p><p> Apache Zeppelin </p><p>Spark Hadoop ? </p><p>Storage </p><p>YARN: Data Operating System </p><p>Governance Security </p><p>Operations </p><p>Resource Management </p></li><li><p>Page 22 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p> Apache Atlas, Spark Apache Falcon </p><p> Apache Ranger , Apache Ambari </p><p> Linux, Windows, </p><p> Spark Cloudbreak Ambari - Azure, AWS, GCP, OpenStack, Docker </p><p>Spark Hadoop ? </p><p>Storage </p><p>YARN: Data Operating System </p><p>Governance Security </p><p>Operations </p><p>Resource Management </p></li><li><p>Page 23 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p> ! </p></li><li><p>Page 24 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>CDO ( ) </p><p> : , </p></li><li><p>Page 25 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p> : </p><p>- </p><p>Cloudbreak 1. 2. Spark blueprint </p><p> 3. HDP </p><p>Microsoft Azure </p></li><li><p>Page 26 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Login to launch.hortonworks.com which is a self-service portal for launching HDP clusters to the cloud (cloudbreak.sequenceiq.com) </p></li><li><p>Page 27 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Select a cloud provider, then start the process of creating your cluster </p></li><li><p>Page 28 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Name the cluster, choose your region, and pick your blueprintin this case, we want hdp-spark-cluster for our data science work </p></li><li><p>Page 29 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>We clicked create cluster and Cloudbreak is now provisioning our Spark environment on Azure </p></li><li><p>Page 30 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>We can now access Zeppelin which is a data science notebook for Spark thats similar to iPython notebook </p></li><li><p>Page 31 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Lets look at our data. We can see eventType, if the drivers certified, how many hours driven, as well as weather data such as foggy, rainy, etc. </p></li><li><p>Page 32 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Lets start asking questions of our data; such as, does fatigue cause violations? </p></li><li><p>Page 33 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Lets view the data in a pie chart graphic to see how violations look by hours driven. </p></li><li><p>Page 34 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>How are violations impacted by fog? </p></li><li><p>Page 35 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Does location have an impact on incidents? </p></li><li><p>Page 36 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>OK, weve learned enough about the data and what features we want to include in our model. So well run a logistic regression on training data. </p></li><li><p>Page 37 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Lets run our code </p></li><li><p>Page 38 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Lets look at our model. Next step is to hand the model off to the Enterprise Architect to integrate into our real-time application. </p></li><li><p>Page 39 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>YARN </p><p>HDFS </p><p> BI </p><p> (ActiveMQ) </p><p>SQL NoSQL Use </p><p>Model </p></li><li><p>Storm Spark </p><p> Hortonworks Inc. 2011 2015. All Rights Reserved </p></li><li><p>Page 41 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>HDFS </p><p>YARN </p><p> (Storm) </p><p> (Kafka) </p><p> (Hive on Tez) </p><p> (HBase) </p><p>Predic'on Bolt </p><p>Spark Storm </p><p> (Spark) </p><p>Spark </p></li><li><p>Page 42 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Pig </p><p>HDFS </p><p>HCatalog () </p><p> DB </p><p>Tableau </p></li><li><p>Page 43 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p> Spark Bolt </p><p> : HDFS </p><p>YARN </p><p>Storm </p><p>Kakfa Spout </p><p>HBase </p><p> HBase </p><p>Bolt HDFS Bolt </p><p>Active MQ </p><p> Bolt </p><p>T(1) T(2) T(N) </p><p> (Kafka) </p><p> Bolt </p><p> MQ </p></li><li><p>Page 44 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>HDP </p><p>Tableau </p><p> : </p><p>1 </p><p>2 </p><p> Spark MLlib </p><p>3 </p><p> YARN Spark </p><p>4 </p><p> Spark MLlib Storm bolt </p><p>5 </p></li><li><p>Page 45 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Spark MLlib </p><p> 45 2721 -91.3 38.14 </p><p> 72 4152 -94.23 37.09 </p><p>Spark MLlib </p><p>0 1 1 0.45 0.2721 0 0 0 </p><p>1 0 0 0.72 0.4152 1 1 0 </p><p> 0, 1 </p><p> 0 1 </p></li><li><p>Page 46 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Spark YARN </p><p>1 </p><p>spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 truckml.jar --algorithm LR --regType L2 --regParam 1.0 /user/root/truck_training --numIterations 100 </p><p>spark-submit Spark job YARN </p><p>HDFS </p><p>2 YARN UI Spark job </p></li><li><p>Page 47 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Spark </p><p>: 87.5% : 88% </p><p> 1. 2. 3. </p></li><li><p>Page 48 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Spark Storm </p><p>Kafka Spout </p><p> Storm Bolt </p><p> Spark HBase </p><p> (HBase) </p><p>Active MQ </p></li><li><p>Page 49 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p> (CDO) </p><p> 1. 2. 3. </p><p> 40% </p></li><li><p>Page 50 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>HDP </p><p> TB </p><p> TB </p><p> , BI </p><p> Storm </p><p> YARN YARN HDP </p></li><li><p>Page 51 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Storm Bolt Spark </p><p> : val algorithm = new LogisticRegressionWithSGD() val model = algorithm.run(training).clearThreshold() println(model.weights) println(model.intercept) Weights[-0.40819922025591465,0.06392530395655666,-0.1346227352186122,-0.07188217286407801,0.7277326276521062,0.508779221680863,-0.024689093098281954] Intercept 0.0 </p><p> Storm bolt import org.apache.spark.mllib.classification.LogisticRegressionModel; import org.apache.spark.mllib.linalg.Vectors; .. Vector weights = (Vectors.dense(new double[] ) LogisticRegressionModel model = new LogisticRegressionModel(weights, 0.0); double prediction = model.predict() </p></li><li><p>Page 52 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>Page 52 Hortonworks Inc. 2011 2015. All Rights Reserved </p></li><li><p>Page 53 Hortonworks Inc. 2011 2015. All Rights Reserved </p><p>This presentation contains forward-looking statements involving risks and uncertainties. Such forward-looking statements in this presentation generally relate to future events, our ability to increase the number of support subscription customers, the growth in usage of the Hadoop framework, our ability to innovate and develop the various open source projects that will enhance the capabilities of the Hortonworks Data Platform, anticipated customer benefits and general business outlook. In some cases, you can identify forward-looking statements because they contain words such as may, will, should, expects, plans, anticipates, could, intends, target, projects, contemplates, believes, estimates, predicts, potential or continue or similar terms or expressions that concern our expectations, strategy, plans or intentions. You should not rely upon forward-looking statements as predictions of future events. We have based the forward-looking statements contained in this presentation primarily on our current expectations and projections about future events and trends that we believe may affect our business, financial condition and prospects. We cannot assure you that the results, events and circumstances reflected in the forward-looking statements will be achieved or occur, and actual results, events, or circumstances could differ materially from those described in the forward-looking statements. The forward-looking statements made in this prospectus relate only to events as of the date on which the statements are made and we undertake no obligation to update any of the information in this presentation. Trademarks Hortonworks is a trademark of Hortonworks, Inc. in the United States and other jurisdictions. Other names used herein may be trademarks of their respective owners. </p><p>Page 53 Hortonworks Inc. 2011 2015. All Rights Reserved </p></li></ul>