rdd

Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

时光总嘲笑我的痴心妄想 提交于 2019-12-03 04:51:46
I have the following spark job, trying to keep everything in memory: val myOutRDD = myInRDD.flatMap { fp => val tuple2List: ListBuffer[(String, myClass)] = ListBuffer() : tuple2List }.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1, p2) => myMergeFunction(p1,p2) }.persist(StorageLevel.MEMORY_ONLY) However, when I looked in to the job tracker, I still have a lot of Shuffle Write and Shuffle spill to disk ... Total task time across all tasks: 49.1 h Input Size / Records: 21.6 GB / 102123058 Shuffle write: 532.9 GB / 182440290 Shuffle spill (memory): 370.7 GB Shuffle spill (disk): 15.4 GB

Scala之隐式转换

筅森魡賤 提交于 2019-12-03 03:44:02
概述 简单说,隐式转换就是:当Scala编译器进行类型匹配时,如果找不到合适的候选,那么隐式转化提供了另外一种途径来告诉编译器如何将当前的类型转换成预期类型。本文原文出处: http://blog.csdn.net/bluishglc/article/details/50866314 严禁任何形式的转载,否则将委托CSDN官方维护权益! 隐式转换有四种常见的使用场景: 将某一类型转换成预期类型 类型增强与扩展 模拟新的语法 类型类 语法 隐式转换有新旧两种定义方法,旧的定义方法指是的“implict def”形式,这是Scala 2.10版本之前的写法,在Scala 2.10版本之后,Scala推出了“隐式类”用来替换旧的隐式转换语法,因为“隐式类”是一种更加安全的方式,对被转换的类型来说,它的作用域更加清晰可控。 接下来我们通过实例来了解这两种隐式转换的写法。前文提到,隐式转换最为基本的使用场景是:将某一类型转换成预期类型,所以我们下面的例子就以最这种最简单的场景来演示,它们都实现了:将一个String类型的变量隐式转换为Int类型: “implict def”形式的隐式转换 package com.github.scala.myimplicit /** * A demo about scala implicit type conversion. * @author

大数据-sparkSQL

狂风中的少年 提交于 2019-12-03 03:43:40
SparkSQL采用Spark on Hive模式,hive只负责数据存储,Spark负责对sql命令解析执行。 SparkSQL基于Dataset实现,Dataset是一个分布式数据容器,Dataset中同时存储原始数据和元数据(schema) Dataset的底层封装了RDD,Row类型的RDD就是Dataset< Row >,DataFrame Dataset数据源包括:json,JDBC,hive,parquet,hdfs,hbase,avro... API 自带API Dataset自带了一套api能够对数据进行操作,处理逻辑与sql处理逻辑相同。 //ds代表了一个Dataset<Row>,包括字段:age,name//select name from tableds.select(ds.col("name")).show();//select name ,age+10 as addage from tableds.select(ds.col("name"),ds.col("age").plus(10).alias("addage")).show();//select name ,age from table where age>19ds.select(ds.col("name"),ds.col("age")).where(ds.col("age").gt(19))

大数据-SparkStreaming

北城余情 提交于 2019-12-03 03:43:26
SparkStreaming SparkStreaming是一种 微批处理 ,准实时的流式框架。数据来源包括:Kafka, Flume,TCP sockets,Twitter,ZeroMQ等 SparkStreaming与storm的区别: SparkStreaming微批处理数据,storm按条处理数据 SparkStreaming支持稍复杂的逻辑 SparkStreaming与storm都支持资源动态调整和事务机制 SparkStreaming的处理架构:采用recevier task持续拉取数据,拉取时间间隔为batch Interval,每次来去的数据封装为batch,batch被封装到RDD中,RDD被封装进DStream中。SparkStreaming对DStream进程处理。 数据处理与数据拉取同时进行,数据处理的速度需要与数据拉取量均衡,数据存储方式为memory_only,若数据处理速度慢于拉取速度会产生数据堆积,进而导致OOM。若数据存储方式包含disk,会加大延迟 代码实现 使用TCP sockets实现测试,liunx中命令:nc -lk 9999 实现模拟向9999端口发数据。 数据拉取的间隔时长 + sparkconf/sparkcontext => JavaStreamingContext (stream上下文) 数据源配置 + stream上下文 =

How to add a new column to a Spark RDD?

匿名 (未验证) 提交于 2019-12-03 03:10:03
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have a RDD with MANY columns (e.g., hundreds ), how do I add one more column at the end of this RDD? For example, if my RDD is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 ...... 29, 94, 956, ..., 758 how can I add a column to it, whose value is the sum of the second and the third columns? Thank you very much. 回答1: You do not have to use Tuple * objects at all for adding a new column to an RDD . It can be done by mapping each row, taking its original contents plus the

How to compute cumulative sum using Spark

匿名 (未验证) 提交于 2019-12-03 03:08:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have an rdd of (String,Int) which is sorted by key val data = Array(("c1",6), ("c2",3),("c3",4)) val rdd = sc.parallelize(data).sortByKey Now I want to start the value for the first key with zero and the subsequent keys as sum of the previous keys. Eg: c1 = 0 , c2 = c1's value , c3 = (c1 value +c2 value) , c4 = (c1+..+c3 value) expected output: (c1,0), (c2,6), (c3,9)... Is it possible to achieve this ? I tried it with map but the sum is not preserved inside the map. var sum = 0 ; val t = keycount.map{ x => { val temp = sum; sum = sum + x.

Spark Streaming Job is not recoverable

匿名 (未验证) 提交于 2019-12-03 03:05:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm using a spark streaming job that uses mapWithState with an initial RDD. When restarting the application and recovering from the checkpoint it fails with the error: This RDD lacks a SparkContext. It could happen in the following cases: RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK

Spark: produce RDD[(X, X)] of all possible combinations from RDD[X]

匿名 (未验证) 提交于 2019-12-03 02:51:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Is it possible in Spark to implement '.combinations' function from scala collections? /** Iterates over combinations. * * @return An Iterator which traverses the possible n-element combinations of this $coll. * @example `"abbbc".combinations(2) = Iterator(ab, ac, bb, bc)` */ For example how can I get from RDD[X] to RDD[List[X]] or RDD[(X,X)] for combinations of size = 2. And lets assume that all values in RDD are unique. 回答1: Cartesian product and combinations are two different things, the cartesian product will create an RDD of size rdd

How to checkpoint DataFrames?

匿名 (未验证) 提交于 2019-12-03 02:49:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can't find how to do it with DataFrames. persist and cache (which are synonyms for each other) are available for DataFrame but they do not "break the lineage" and are thus unsuitable for methods that could loop for hundreds (or thousands) of iterations. As an example, suppose that I have a list of functions whose signature is DataFrame => DataFrame. I want to have a way to compute the following even when myfunctions has hundreds or thousands of

Convert RDD to Dataframe in Spark/Scala

匿名 (未验证) 提交于 2019-12-03 02:45:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: The RDD has been created in the format Array[Array[String]] and has the following values: Array[Array[String]] = Array(Array(4580056797, 0, 2015-07-29 10:38:42, 0, 1, 1), Array(4580056797, 0, 2015-07-29 10:38:42, 0, 1, 1), Array(4580056797, 0, 2015-07-29 10:38:42, 0, 1, 1), Array(4580057445, 0, 2015-07-29 10:40:37, 0, 1, 1), Array(4580057445, 0, 2015-07-29 10:40:37, 0, 1, 1)) I want to create a dataFrame with the schema : val schemaString = "callId oCallId callTime duration calltype swId" Next steps: scala> val rowRDD = rdd.map(p => Array(p