The paradox of big data - dataiku / oxalide APEROTECH

  • Published on

  • View

  • Download


The Paradox of Big Data2001 Programming Languages 2004 Natural Language Processing2006 Social Recommendation2008 Distributed Computing2011 Social Gaming2012 Advertising2013 Dataiku2009 Web MiningType Spent Coding2010100%100%80%50%20%0%10%50%20%Favorite LanguageCExascriptExascriptExascriptPythonPowerpointPythonJavaNoneLargest Dataset100GB100GB10GB10TB100TB100kB500GB100TB10TBIm Florian and I like datawww.dataiku.comDataiku in shortSoftware editor behind Data Science Studio,the Photoshop for Data Science COMMUNITY EDITION For Today Big Data with the bias of what I know of it (Analytics ) Big Data: History and Feelings What are the key technologies to watch ? Some practical use cases ? How to get started ? DataikuMotivation1/8/144First Hard Drive: 3,75 Megabytes Access Time: 1 secondIN 2008 man invented big dataVolume Variety Velocity WHAT IF THE MARKETING GUY HAD CHOSEN ANOTHER LETTER? Capacity Complexity CelerityOR SIMPLERSize Serendipity Speed OR AFTER A DRINKBig Blur Blazing Or Combine C B.. S. Or Combine Complete Bull Sh.. SOOO WHAT IS BIG DATA ? PARADOX #1 SIMPLEXITY SUBTLE PATTERNS"MORE BUSINESS" BUTTONSPARADOX #2 SELF-AWAREDATA SCIENTIST AT NIGHTDATA CLEANER THE DAYDATA PLUMBERER THE WEEK-ENDWAIT COMPUTATION BETWEEN COFFEESPARADOX #3 WHERE TO STORE DATA?MY DATA IS WORTH MILLIONSI SEND IT TO THE MARKETING CLOUDAND BACKUP IT TO GOOGLEPARADOX #4 IS IT BIG OR NOT ?WE ALL LIVE IN A BIG DATA LAKEALL MY DATA MAY FITS IN HEREPARADOX #5 (at last) HUMAN OR NOT ?TECHCRUNCH SAYS THAT MACHINE LEARNING WILL SAVE US ALLI JUST WANT MORE REPORTSBIG DATA TECH TRENDSELEPHANT MAKE BABIESDataiku - Pig, Hive and CascadingWELCOME TO TECHNOSLAVIAHadoop Ceph Sphere Cassandra Kafka Flume Spark Scikit-Learn GraphLAB jubatus Mahout WEKA MLBase LibSVMRapidMiner PandaKibana InfiniDB Drill Spark SQL Hive Impala Elastic Search SOLR MongoDB Riak MembasePig Cascading TalendMachine Learning Mystery LandScalability CentralSQL Colunnar RepublicVizualization County Data Clean WastelandStatistician Old HouseR Real-time islandStormNOSQL NihilandDRIVER 1: BACK TO THE BASICSRAM - CPU - DISK 2000 20131000$ / GB6$ / GB$10 / GB$0.06 / GBmemory divided by 150 disk cost divided by 250 MAP REDUCE timesHACK REDUCE timesA PERSISTENT MEMORY PROBLEMDATA IS BIGGERIS USEFUL DATA BIGGER ?WHOLE DATAREFINED DATAGOLDNEEDLE IN HAYSTACK ?OILDREFINE BEFORE USEHOW BIG IS BIG DATA ?Web Site $1Billion revenue per year 10 Millions Unique Visitor per month 100.Millions orders / actions / per day10TB RAW DATA1TB REFINED DATA1 TERABYTEFITS IN MEMORY 1TBDRIVER 2 : ECOSYSTEM GROWS GOOGLE 1 Circle OPEN SOURCE YAHOO IBM LINKEDIN - FACEBOOK 2 Circle STANDFORD BERKELEY STARTUPSSTARTUPS64m$6.75m$14m$2m$40m$20m$20.5m$19m$4m$100m$1.8m$17m$11m$7.75m$1.7m$20132012201120102009 $1B per year Invested in Big Data TECH 223m$301m$ALL > SPARKReal-Time Resilient Distributed Memory Framework Abstraction with any DAG operation on data: - Filter - Map - Reduce - CacheSPARK AND ITS ECOSYSTEMSHARKMLBASESTREAMINGReal-Time Queries Real-Time UpdatesIn-Memory LearningSPARKSooOOo WHAT IS IT IN PRACTICE?www.dataiku.comTurn Device Logs Into Next Years' BusinessParking ticket machine dataOpenStreetMapdataCleaning and enrichment of data Crossing dataData Science StudioCreation of a predictive algorithmAvailability of the predictionsEach street is segmented into small pieces that are enriched with geospatial information.The parking ticket history is joined with the points of interest from OpenStreetMap.The availability of parking lots is predicted by street segments from the joined data.The algorithm is finally integrated in the iPhone app Find me a space . bywww.dataiku.comOptimizing Last Mile with Data Science Studio Data Science StudioHistorical delivery and retrieval dataModeling of a score for each deliveryCleaning and temporal enrichment of dataData aggregation by geographic locationIncorporation of new deliveries to the existing modelby Reformulation de la recherche Pas de rponse Clic sur un pro Top recherche Clic de navigation ou filtreCOMMENT AMLIORER LA PERTINENCE DE NOS RPONSES VIA LANALYSE DU COMPORTEMENT UTILISATEUR ?20 MAnalyse & correctionsautomatisation>10 occurrences1,4M requtes>200M recherches0,5M requtes priorisesSOLUTIONMachineGestion Explorationpagesjaunes.frAnnuairehadoop PIG+HiveExport indexationMoteur dinterprtationcrawl Autres rfrentielsSickit-learnwww.dataiku.comAnalystPanels1970 : Birth of Computer AnalyticsComputerExpensive SoftwareMarketing Studieswww.dataiku.comMultiple Data Sources Analyst TeamMany ModelsCRMLogs2015 : BUILD YOUR FACTORYServer ClusterLight SoftwarePersonalised Experience ModelAcquisition Cost Opportunity ModelStock Optimisation ModelOptimize Deliverywww.dataiku.comChurnVolume ForecastRecommenderSegmentation Lifetime ValueRisk Score Hot LocationPricing Ranking FraudEvent PathsA MODEL An automated way to make a computertake a decision from raw (historical) data The model can be used to take immediate (real-time)actions through an API www.dataiku.comChurnVolume ForecastRecommenderSegmentation Lifetime ValueRisk Score Hot LocationPricing Ranking FraudEvent PathsSooOOo How To I ENTER WONDERLAND ? STEP 1 : LEARN PYTHON + PANDAS + SCIKIT R SCALA 2 : PRACTICE Try to enter in a Contest on or or Join a meetupwww.dataiku.com HQ 2 rue Jean Lantier 75001 Paris FranceDataiku West 2423A Durant Avenue Berkeley, CA 94704Florian florian.douetteau@dataiku.comYou have ideasMy data is too dirty. I dont even know where to start We could probably better understand ours users. But how ? Theres a trend here, but our full historical data is just too big You have data You need a tool