The paradox of big data - dataiku / oxalide APEROTECH

  • Published on
    12-Jul-2015

  • View
    452

  • Download
    1

Transcript

  • The Paradox of Big Data

  • 2001 Programming Languages 2004 Natural Language Processing2006 Social Recommendation2008 Distributed Computing

    2011 Social Gaming2012 Advertising2013 Dataiku

    2009 Web Mining

    Type Spent Coding

    2010

    100%100%80%50%20%0%

    10%50%

    20%

    Favorite Language

    CExascriptExascriptExascript

    Python

    Powerpoint

    Python

    Java

    None

    Largest Dataset

    100GB100GB10GB10TB

    100TB100kB500GB100TB10TB

    Im Florian and I like data

  • www.dataiku.com

    Dataiku in shortSoftware editor behind Data Science Studio,the Photoshop for Data Science

    COMMUNITY EDITION

    http://www.dataiku.com/dss/trynow/

  • Goals For Today Big Data with the bias of what I know of it

    (Analytics )

    Big Data: History and Feelings

    What are the key technologies to watch ?

    Some practical use cases ?

    How to get started ?

  • Dataiku

    Motivation

    1/8/144

    First Hard Drive: 3,75 Megabytes Access Time: 1 second

  • IN 2008 man

    invented big data

    Volume Variety Velocity

  • WHAT IF THE MARKETING GUY HAD CHOSEN ANOTHER LETTER?

    Capacity Complexity Celerity

  • OR SIMPLER

    Size Serendipity Speed

  • OR AFTER A DRINK

    Big Blur Blazing

  • Or Combine C B.. S.

  • Or Combine

    Complete Bull Sh..

  • SOOO WHAT IS

    BIG DATA ?

  • PARADOX #1 SIMPLEXITY

  • SUBTLE PATTERNS

  • "MORE BUSINESS" BUTTONS

  • PARADOX #2 SELF-AWARE

  • DATA SCIENTIST AT NIGHT

  • DATA CLEANER THE DAY

  • DATA PLUMBERER THE WEEK-END

  • WAIT COMPUTATION BETWEEN COFFEES

  • PARADOX #3 WHERE TO STORE DATA?

  • MY DATA IS WORTH MILLIONS

  • I SEND IT TO THE

    MARKETING CLOUD

    AND BACKUP IT TO GOOGLE

  • PARADOX #4 IS IT BIG OR NOT ?

  • WE ALL LIVE IN A BIG DATA

    LAKE

  • ALL MY DATA MAY FITS IN HERE

  • PARADOX #5 (at last) HUMAN OR NOT ?

  • TECHCRUNCH SAYS THAT MACHINE LEARNING WILL SAVE

    US ALL

  • I JUST WANT MORE REPORTS

  • BIG DATA TECH TRENDS

  • ELEPHANT MAKE BABIES

  • Dataiku - Pig, Hive and Cascading

    WELCOME TO TECHNOSLAVIA

    Hadoop Ceph

    Sphere Cassandra

    Kafka Flume Spark

    Scikit-Learn GraphLAB prediction.io jubatus

    Mahout WEKA

    MLBase LibSVM

    RapidMiner

    Panda

    Kibana

    InfiniDB Drill Spark SQL

    Hive Impala

    Elastic Search

    SOLR MongoDB

    Riak Membase

    Pig

    Cascading

    Talend

    Machine Learning Mystery Land

    Scalability Central

    SQL Colunnar Republic

    Vizualization County Data Clean Wasteland

    Statistician Old House

    R Real-time island

    Storm

    NOSQL Nihiland

  • DRIVER 1: BACK TO THE BASICS

    RAM - CPU - DISK

  • 2000 2013

    1000$ / GB

    6$ / GB$10 / GB

    $0.06 / GB

    memory divided by 150

    disk cost divided by 250

    MAP REDUCE times

    HACK REDUCE times

    A PERSISTENT MEMORY PROBLEM

  • DATA IS BIGGER

  • IS USEFUL DATA BIGGER ?

    WHOLE DATA

    REFINED DATA

  • GOLD

    NEEDLE IN HAYSTACK ?

  • OILD

    REFINE BEFORE USE

  • HOW BIG IS BIG DATA ?Web Site

    $1Billion revenue per year 10 Millions Unique Visitor per month 100.Millions orders / actions / per day

    10TB RAW DATA

    1TB REFINED DATA

  • 1 TERABYTE

    FITS IN MEMORY

    1TB

  • DRIVER 2 : ECOSYSTEM GROWS

    GOOGLE

    1 Circle OPEN SOURCE YAHOO IBM LINKEDIN - FACEBOOK

    2 Circle STANDFORD BERKELEY STARTUPS

  • STARTUPS

    64m$

    6.75m$

    14m$

    2m$

    40m$

    20m$

    20.5m$

    19m$

    4m$

    100m$

    1.8m$

    17m$

    11m$

    7.75m$

    1.7m$

    20132012

    2011

    2010

    2009

    $1B per year Invested in Big Data

    TECH 223m$

    301m$

  • ALL > SPARK

    Real-Time Resilient Distributed Memory Framework

    Abstraction with any DAG operation on data: - Filter - Map - Reduce - Cache

  • SPARK AND ITS ECOSYSTEM

    SHARK

    MLBASE

    STREAMING

    Real-Time Queries

    Real-Time Updates

    In-Memory Learning

    SPAR

    K

  • SooOOo WHAT IS IT IN PRACTICE?

  • www.dataiku.com

    Turn Device Logs Into Next Years' Business

    Parking ticket machine data

    OpenStreetMapdata

    Cleaning and enrichment of data Crossing data

    Data Science Studio

    Creation of a predictive algorithm

    Availability of the predictions

    Each street is segmented into small pieces that are enriched with geospatial information.

    The parking ticket history is joined with the points of

    interest from OpenStreetMap.

    The availability of parking lots is predicted by street

    segments from the joined data.

    The algorithm is finally integrated in the iPhone

    app Find me a space .

    by

  • www.dataiku.com

    Optimizing Last Mile with Data Science Studio

    Data Science Studio

    Historical delivery and retrieval data

    Modeling of a score for each delivery

    Cleaning and temporal enrichment of data

    Data aggregation by geographic location

    Incorporation of new deliveries to the existing model

    by

  • Reformulation de la recherche

    Pas de rponse

    Clic sur un pro Top recherche Clic de navigation ou filtre

    COMMENT AMLIORER LA PERTINENCE DE NOS RPONSES VIA LANALYSE DU COMPORTEMENT UTILISATEUR ?

    20 M

    Analyse & corrections

    automatisation

    >10 occurrences1,4M

    requtes

    >200M recherches

    0,5M requtes priorises

  • SOLUTION

    Machine

    Gestion Exploration

    pagesjaunes.frAnnuaire

    hadoop PIG+Hive

    Export indexation

    Moteur dinterprtation

    crawl Autres rfrentiels

    Sickit-learn

  • www.dataiku.com

    Analyst

    Panels

    1970 : Birth of Computer Analytics

    ComputerExpensive Software

    Marketing Studies

  • www.dataiku.com

    Multiple Data Sources

    Analyst Team

    Many Models

    CRM

    Logs

    2015 : BUILD YOUR FACTORY

    Server ClusterLight Software

    Personalised Experience Model

    Acquisition Cost Opportunity

    Model

    Stock Optimisation Model

    Optimize Delivery

  • www.dataiku.com

    Churn

    Volume Forecast

    RecommenderSegmentation Lifetime Value

    Risk Score Hot Location

    Pricing Ranking FraudEvent Paths

    A MODEL An automated way to make a computertake a decision from raw (historical) data

    The model can be used to take immediate (real-time)actions through an API

  • www.dataiku.com

    Churn

    Volume Forecast

    RecommenderSegmentation Lifetime Value

    Risk Score Hot Location

    Pricing Ranking FraudEvent Paths

  • SooOOo How To I ENTER WONDERLAND ?

  • STEP 1 : LEARN

    PYTHON + PANDAS + SCIKIT

    R

    SCALA

    http://scikit-learn.org/https://www.coursera.org/course/rprog

  • STEP 2 : PRACTICE Try to enter in a Contest on kaggle.com or

    or datascience.net

    Join a meetup

  • www.dataiku.com

    http://www.dataiku.com/dss/trynow/

    Dataiku HQ

    2 rue Jean Lantier

    75001 Paris France

    Dataiku West

    2423A Durant Avenue

    Berkeley, CA 94704

    Florian florian.douetteau@dataiku.com

    You have ideas

    My data is too dirty. I dont even know where to start

    We could probably better understand ours users. But how ?

    Theres a trend here, but our full historical data is just too big

    You have data

    You need a tool