«Почему Spark отнюдь не так хорош»

  • Published on
    19-Jul-2015

  • View
    318

  • Download
    0

Embed Size (px)

Transcript

<p>PowerPoint Presentation</p> <p>SparkAlexey Diomin, diominay@gmail.com1Intro</p> <p>2BasicRDDDAGRDDResilient Distributed Dataset</p> <p>RDDResilient Distributed Dataset</p> <p>SchemaRDD</p> <p>DAG</p> <p>DAG</p> <p>DAG</p> <p>mapValues8MythologySpark is not MapReduceMythologySpark is not MapReduceRun programs up to 100x faster than MapReduce in memory, or 10x faster on diskMythologySpark is not MapReduceRun programs up to 100x faster than MapReduce in memory, or 10x faster on diskInMemory processingMythologySpark is not MapReduceRun programs up to 100x faster than MapReduce in memory, or 10x faster on diskInMemory processingSpark Streaming is real-time streamingMythologySpark is not MapReduceRun programs up to 100x faster than MapReduce in memory, or 10x faster on diskInMemory processingSpark Streaming is real-time streamingLightning-fast cluster computingMapReduce</p> <p>MapReduce</p> <p>MapReduce</p> <p>Not MapReduce</p> <p>SparkRun programs up to 100x faster than MapReduce in memory, or 10x faster on disk</p> <p>SparkRun programs up to 100x faster than Hadoop MapReduce* in memory, or 10x faster on disk</p> <p> *Hadoop without Tezhttp://spark.apache.org/</p> <p>InMemory</p> <p>InMemoryThe MapReduce and Spark shuffles use a pull model. Every map task writes out data to local disk, and then the reduce tasks make remote requests to fetch that data</p> <p>http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/Spark StreamingRDDDAGSpark Streaming</p> <p>Spark Streaming</p> <p>Receiver.store(...)Spark Streaming</p> <p>Google Cloud DataflowOne of the most compelling aspects of Cloud Dataflow is its approach to one of the most difficult problems facing data engineers: how to develop pipeline logic that can execute in both batch and streaming contexts.</p> <p>http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-dataflow-on-apache-spark/Lightning-fast cluster computingLightning-fast cluster computing</p> <p>Lightning-fast cluster computing</p> <p>Lightning-fast cluster computing</p> <p>Lightning-fast cluster computing</p> <p>SparkLoggingPipelineIndexesJob progressEffective MemoryNetwork</p> <p>Example</p> <p>Staged (batch) execution</p> <p>Pipelined execution</p> <p>IndexesNetflixhttps://github.com/amplab/spark-indexedrddJob ProgressAccumulatorsBroadcastMemoryval value = task.run(taskId, attemptNumber)</p> <p>Memoryval value = task.run(taskId, attemptNumber)val valueBytes = resultSer.serialize(value)</p> <p>Memoryval value = task.run(taskId, attemptNumber)val valueBytes = resultSer.serialize(value)val directResult = new DirectTaskResult(valueBytes, accumUpdates, task.metrics.orNull)val serializedDirectResult = ser.serialize(directResult)</p> <p>Memoryval value = task.run(taskId, attemptNumber)val valueBytes = resultSer.serialize(value)val directResult = new DirectTaskResult(valueBytes, accumUpdates, task.metrics.orNull)val serializedDirectResult = ser.serialize(directResult)</p> <p>Default JavaSerializer public synchronized byte toByteArray()[] { return Arrays.copyOf(buf, count); }</p> <p>42Network</p> <p>Network</p> <p>NetworkProblem with firewall/nat/multiple ip/etc.SQLShark (dead)Spark SQLSpark on HiveSparkR</p> <p>SparkRUnstable APIMinimum docs</p> <p>SparkRUnstable APIMinimum docs</p> <p>Rstudio Server</p> <p>LinksSparkhttp://spark.apache.org/</p> <p>Flinkhttp://flink.apache.org/</p> <p>Tezhttp://tez.apache.org/</p>