rdd | 易学教程

java.io.NotSerializableException in Spark Streaming with enabled checkpointing

阅读更多关于 java.io.NotSerializableException in Spark Streaming with enabled checkpointing

code below: def main(args: Array[String]) { val sc = new SparkContext val sec = Seconds(3) val ssc = new StreamingContext(sc, sec) ssc.checkpoint("./checkpoint") val rdd = ssc.sparkContext.parallelize(Seq("a","b","c")) val inputDStream = new ConstantInputDStream(ssc, rdd) inputDStream.transform(rdd => { val buf = ListBuffer[String]() buf += "1" buf += "2" buf += "3" val other_rdd = ssc.sparkContext.parallelize(buf) // create a new rdd rdd.union(other_rdd) }).print() ssc.start() ssc.awaitTermination() } and throw exception: java.io.NotSerializableException: DStream checkpointing has been

Return RDD of largest N values from another RDD in SPARK

阅读更多关于 Return RDD of largest N values from another RDD in SPARK

问题 I'm trying to filter an RDD of tuples to return the largest N tuples based on key values. I need the return format to be an RDD. So the RDD: [(4, 'a'), (12, 'e'), (2, 'u'), (49, 'y'), (6, 'p')] filtered for the largest 3 keys should return the RDD: [(6,'p'), (12,'e'), (49,'y')] Doing a sortByKey() and then take(N) returns the values and doesn't result in an RDD, so that won't work. I could return all of the keys, sort them, find the Nth largest value, and then filter the RDD for key values

Get Top 3 values for every key in a RDD in Spark

阅读更多关于 Get Top 3 values for every key in a RDD in Spark

问题 I'm a beginner with Spark and I am trying to create an RDD that contains the top 3 values for every key, (Not just the top 3 values). My current RDD contains thousands of entries in the following format: (key, String, value) So imagine I had an RDD with content like this: [("K1", "aaa", 6), ("K1", "bbb", 3), ("K1", "ccc", 2), ("K1", "ddd", 9), ("B1", "qwe", 4), ("B1", "rty", 7), ("B1", "iop", 8), ("B1", "zxc", 1)] I can currently display the top 3 values in the RDD like so: ("K1", "ddd", 9) (

What does Spark recover the data from a failed node?

阅读更多关于 What does Spark recover the data from a failed node?

问题 Suppose we have an RDD, which is being used multiple times. So to save the computations again and again, we persisted this RDD using the rdd.persist() method. So when we are persisting this RDD, the nodes computing the RDD will be storing their partitions. So now suppose, the node containing this persisted partition of RDD fails, then what will happen? How will spark recover the lost data? Is there any replication mechanism? Or some other mechanism? 回答1: When you do rdd.persist, rdd doesn't

Spark: Expansion of RDD(Key, List) to RDD(Key, Value)

阅读更多关于 Spark: Expansion of RDD(Key, List) to RDD(Key, Value)

So I have an RDD of something like this RDD[(Int, List)]] Where a single element in the RDD looks like (1, List(1, 2, 3)) My question is how can I expand the key value pair to something like this (1,1) (1,2) (1,3) Thank you rdd.flatMap { case (key, values) => values.map((key, _)) } etov And in Python (based on @seanowen's answer): rdd.flatMap(lambda x: map(lambda e: (x[0], e), x[1])) 来源： https://stackoverflow.com/questions/36392938/spark-expansion-of-rddkey-list-to-rddkey-value

java.io.NotSerializableException in Spark Streaming with enabled checkpointing

阅读更多关于 java.io.NotSerializableException in Spark Streaming with enabled checkpointing

问题 code below: def main(args: Array[String]) { val sc = new SparkContext val sec = Seconds(3) val ssc = new StreamingContext(sc, sec) ssc.checkpoint("./checkpoint") val rdd = ssc.sparkContext.parallelize(Seq("a","b","c")) val inputDStream = new ConstantInputDStream(ssc, rdd) inputDStream.transform(rdd => { val buf = ListBuffer[String]() buf += "1" buf += "2" buf += "3" val other_rdd = ssc.sparkContext.parallelize(buf) // create a new rdd rdd.union(other_rdd) }).print() ssc.start() ssc

RDD基础-笔记

阅读更多关于 RDD基础-笔记

RDD编程基础Spark中的RDD是一个不可变的分布式对象集合。每个RDD都被分为多个分区，这些分区运行在集群中的不同节点上。RDD可以包含Python、java、Scala中任意类型的对象，甚至可以包含用户自定义的对象。两种方法创建RDD： 1. 读取一个外部数据集 2. 在驱动器程序里分发驱动器程序中的对象集合（比如list和set）。 RDD 支持的操作： 1. 转化操作（transformation）：一个RDD生成一个新的RDD。 2. 行动操作（action）：会对RDD计算出一个结果，并把结果返回到驱动器程序中，或把结果存储到外部。 3. 虽可以在任何时候定义新的RDD，但Spark只会惰性计算这些RDD。他们只有第一次在一个行动操作中用到时，才会真正计算。 4. 默认情况下，Spark的RDD会在你每次对他们进行行动操作时重新计算。（在任何时候都能进行重算是我们为什么把RDD描述为“弹性的原因”）如果想在多个行动操作中重用同一个RDD，可以使用RDD.persist()让Spark把这个RDD缓存下来。每个Spark程序或shell会话都按如下方式工作。 1. 从外部数据创建出输入RDD。 2. 使用诸如filter()这样的转化操作对RDD进行转化，以定义新的RDD。 3. 告诉Spark对需要被重用的中间结果RDD执行persist()操作。 4.

RDD基础-笔记

阅读更多关于 RDD基础-笔记

Spark: difference when read in .gz and .bz2

阅读更多关于 Spark: difference when read in .gz and .bz2

I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions? Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks! However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple

Update：SparkStreaming原理_运行过程_高级特性

阅读更多关于 Update：SparkStreaming原理_运行过程_高级特性

Spark Streaming 导读介绍入门原理操作 Table of Contents 1. Spark Streaming 介绍 2. Spark Streaming 入门 2. 原理 3. 操作 1. Spark Streaming 介绍导读流式计算的场景流式计算框架 Spark Streaming 的特点新的场景通过对现阶段一些常见的需求进行整理, 我们要问自己一个问题, 这些需求如何解决? 场景解释商品推荐京东和淘宝这样的商城在购物车, 商品详情等地方都有商品推荐的模块商品推荐的要求快速的处理, 加入购物车以后就需要迅速的进行推荐数据量大需要使用一些推荐算法工业大数据现在的工场中, 设备是可以联网的, 汇报自己的运行状态, 在应用层可以针对这些数据来分析运行状况和稳健程度, 展示工件完成情况, 运行情况等工业大数据的需求快速响应, 及时预测问题数据是以事件的形式动态的产品和汇报因为是运行状态信息, 而且一般都是几十上百台机器, 所以汇报的数据量很大监控一般的大型集群和平台, 都需要对其进行监控监控的需求要针对各种数据库, 包括 MySQL , HBase 等进行监控要针对应用进行监控, 例如 Tomcat , Nginx , Node.js 等要针对硬件的一些指标进行监控, 例如 CPU , 内存, 磁盘等

订阅 rdd