rdd | 易学教程

Use combineByKey to get output as (key, iterable[values])

阅读更多关于 Use combineByKey to get output as (key, iterable[values])

问题 I am trying to transform RDD(key,value) to RDD(key,iterable[value]) , same as output returned by the groupByKey method. But as groupByKey is not efficient, I am trying to use combineByKey on the RDD instead, however, it is not working. Below is the code used: val data= List("abc,2017-10-04,15.2", "abc,2017-10-03,19.67", "abc,2017-10-02,19.8", "xyz,2017-10-09,46.9", "xyz,2017-10-08,48.4", "xyz,2017-10-07,87.5", "xyz,2017-10-04,83.03", "xyz,2017-10-03,83.41", "pqr,2017-09-30,18.18", "pqr,2017

spark 内存溢出处理

阅读更多关于 spark 内存溢出处理

简介 Spark中的OOM问题不外乎以下两种情况 map执行中内存溢出 shuffle后内存溢出 map执行中内存溢出代表了所有map类型的操作。包括：flatMap，filter，mapPatitions等。 shuffle后内存溢出的shuffle操作包括join，reduceByKey，repartition等操作。后面先总结一下我对Spark内存模型的理解，再总结各种OOM的情况相对应的解决办法和性能优化方面的总结。如果理解有错，希望在评论中指出。 Spark 内存模型 Spark在一个Executor中的内存分为三块，一块是execution内存，一块是storage内存，一块是other内存。 execution内存是执行内存，文档中说join，aggregate都在这部分内存中执行，shuffle的数据也会先缓存在这个内存中，满了再写入磁盘，能够减少IO。其实map过程也是在这个内存中执行的。 storage内存是存储broadcast，cache，persist数据的地方。 other内存是程序执行时预留给自己的内存。 execution和storage是Spark Executor中内存的大户，other占用内存相对少很多，这里就不说了。在spark-1.6.0以前的版本，execution和storage的内存分配是固定的，使用的参数配置

In Spark, can I find out the machine in the cluster which stores a given element in RDD and then send message to it?

阅读更多关于 In Spark, can I find out the machine in the cluster which stores a given element in RDD and then send message to it?

问题 I am new to Spark. I want to know if in an RDD, for example, RDD = {"0", "1", "2",... "99999"} , can I find out the machine in the cluster which stores a given element (e.g.: 100 )? And then in shuffle, can I aggregate some data and send it to the certain machine? I know that the partition of RDD is transparent for users but could I use some method like key/value to achieve that? 回答1: Generally speaking the answer is no or at least not with RDD API. If you can express your logic using graphs

Enforce partition be stored on the specific executor

阅读更多关于 Enforce partition be stored on the specific executor

I have 5-parititions-RDD and 5 workers/executors. How can I ask Spark to save each RDD's partition on different worker (ip)? Am I right if I say Spark can save few partitions on one worker, and 0 partiotions on other worker? Meas I can specify the number of partitions, but Spark still can cache everything on single node. Replication is not an option, since RDD is huge. Workarounds I have found getPreferredLocations RDD's getPreferredLocations method does not provide 100% waranty that partition will be stored on specified node. Spark will try during spark.locality.wait , but afterwards Spark

Spark中的术语图解总结

阅读更多关于 Spark中的术语图解总结

参考： http://www.raincent.com/content-85-11052-1.html 1、Application：Spark应用程序指的是用户编写的Spark应用程序，包含了Driver功能代码和分布在集群中多个节点上运行的Executor代码。 Spark应用程序，由一个或多个作业JOB组成，如下图所示: 2、Driver：驱动程序 Driver负责运行Application的Main()函数并且创建SparkContext ，其中创建SparkContext的目的是为了准备Spark应用程序的运行环境。在Spark中由SparkContext负责和ClusterManager通信，进行资源的申请、任务的分配和监控等；当Executor部分运行完毕后，Driver负责将SparkContext关闭。通常SparkContext代表Driver，如下图所示: 3、Cluster Manager：资源管理器指的是在集群上获取资源的外部服务，常用的有： Standalone，Spark原生的资源管理器，由Master负责资源的分配； Haddop Yarn模式由Yarn中的ResearchManager负责资源的分配； Messos，由Messos中的Messos Master负责资源管理。 4、Executor：执行器 Application

spark与mapreduce的区别

阅读更多关于 spark与mapreduce的区别

　　 spark是通过借鉴Hadoop mapreduce发展而来，继承了其分布式并行计算的优点，并改进了mapreduce明显的缺陷，具体表现在以下几方面：　　1.spark把中间计算结果存放在内存中，减少迭代过程中的数据落地，能够实现数据高效共享，迭代运算效率高。 mapreduce中的计算中间结果是保存在磁盘上的，这样必然影响整体运行速度。　　 2.spark容错性高。 spark支持DAG图的分布式并行计算（简单介绍以下spark DAG：即有向无环图，描述了任务间的先后依赖关系，spark中rdd经过若干次 transform操作，由于transform操作是lazy的，因此，当rdd进行action操作时，rdd间的转换关系也会被提交上去，得到rdd内部的依赖关系，进而根据依赖，划分出不同的 stage。），它引进rdd弹性分布式数据集的概念，它是分布在一组节点中的只读对象集合，如果数据集一部分数据丢失，则可以根据血统来对它们进行重建；另外在RDD计算时可以通过checkpoint来实现容错，checkpoint有两种方式，即checkpiont data 和logging the updates。　　 3.spark更加通用。 hadoop只提供了map和reduce两种操作，spark提供的操作类型有很多，大致分为转换和行动操作两大类。转换操作包括：map

take top N after groupBy and treat them as RDD

阅读更多关于 take top N after groupBy and treat them as RDD

I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup (in the below) to RDD[(String, Int)] where List[Int] values are flatten The data is val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2, "bar"->6, "bar"->5, "bar"->4)) The top N items per group are computed as: val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map { case (key, numbers) => key -> numbers.toList.sortBy(-_).take(2) } The result is (bar,List(6, 5)) (foo,List(3, 2)) which was printed by topNPerGroup.collect.foreach(println) If I achieve, topNPerGroup.collect.foreach(println)

Spark 知识点总结--调优（一）

阅读更多关于 Spark 知识点总结--调优（一）

搭建集群： SPARK_WORKER-CORES : 当计算机是32核双线程的时候，需要指定SPARK_WORKER_CORES的个数为64个 SPARK_WORKER_MEMORY : 任务提交： ./spark-submit --master node:port --executor-cores --class ..jar xxx --executor-cores: 指定每个executor使用的core 的数量 --executor-memory: 指定每个executor最多使用的内存 --total-executor-cores: standalone 集群中 spark application 所使用的总的core --num-executor ：在yarn 中为 spark application 启动的executor --Driver-cores: driver使用的core --Driver-memory： driver使用的内存以上的参数是在spark-submit 提交任务的时候指定的，也可以在spark-defaults.xml中进行配置 spark 并行度调优： (一般在做测试的时候使用) sc.textFile(xx,minnum) sc.parallelize(seq,num) sc.makeRDD(seq,num) sc

Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

阅读更多关于 Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

问题 RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T] . As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same? Thanks a lot! VK 回答1: There are two different classes which can be

Lazy foreach on a Spark RDD

阅读更多关于 Lazy foreach on a Spark RDD

问题 I have a big RDD of Strings (obtained through a union of several sc.textFile(...)) . I now want to search for a given string in that RDD, and I want the search to stop when a "good enough" match has been found. I could retrofit foreach , or filter , or map for this purpose, but all of these will iterate through every element in that RDD, regardless of whether the match has been reached. Is there a way to short-circuit this process and avoid iterating through the whole RDD? 回答1: I could

订阅 rdd