Spark Hadoop

  • Published on
    17-Jan-2017

  • View
    112

  • Download
    2

Embed Size (px)

Transcript

Presentation Template for Sigma Software Purposes

DIFFERENCE BETWEEN SPARK AND HADOOP MAPREDUCE

SPARK IS MUCH FASTERSpark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk.

LOGISTICS REGRESSION PERFORMANCE

WORDCOUNT WITH HADOOP

WORDCOUNT WITH SPARKIts easier to develop for Spark.

Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL

SPARK GENERAL FLOW

SOME ACTIONS AND TRANSFORMATIONSmap(func)flatMap(func)froupByKey()reduceByKey(func)mapValues(func)sample()union(other)distinct()sortByKey()..reduce(func)collect()count()first()take(n)saveAsTextFile(path)countByKey()foreach(func)

CREATE INPUT RDDs

SPLIT INTO TRAINING,VALIDATION AND TEST DATASETS

FIND OUT OPTIMAL RANK ANDNUMBER OF ITERATIONS

RMSE (ROOT MEAN SQUARE ERROR)CALCULATION METHOD

EVALUATE THE BEST MODELON THE TEST SET

CREATE A NAIVE BASELINE AND COMPARE IT WITH THE BEST MODELOUTPUT

RECOMMEND SOME NEW PRODUCTS FOR USER WITH ID #150AND SOME OUTPUT...

USER ALREADY REACTED ON SOME CAMPAIGNS

USE THIS INFORMATION FOR PREDICTIONAND SOME OUTPUT...

RDD FAULT TOLERANCE

SPARK DEPLOYMENT

MACHINE LEARNINGTypes of Machine Learning

ALS Algorithm

ALS MODEL AND ALGORITHMModel Ratings as product of User (A) and Movie Feature (B) matrices of size UxK and MxK

Alternating Least Squares (ALS)Start with random A nd B vectorsOptimize user vectors (A) based on moviesOptimize movie vectors (B) based on usersRepeat until converged

ALS ALGORITHM