rdd

How to write Pyspark UDAF on multiple columns?

对着背影说爱祢 提交于 2019-12-30 06:59:10
问题 I have the following data in a pyspark dataframe called end_stats_df : values start end cat1 cat2 10 1 2 A B 11 1 2 C B 12 1 2 D B 510 1 2 D C 550 1 2 C B 500 1 2 A B 80 1 3 A B And I want to aggregate it in the following way: I want to use the "start" and "end" columns as the aggregate keys For each group of rows, I need to do the following: Compute the unique number of values in both cat1 and cat2 for that group. e.g., for the group of start =1 and end =2, this number would be 4 because

How spark handles object

◇◆丶佛笑我妖孽 提交于 2019-12-30 06:30:08
问题 To test the Serialization exception in spark I wrote a task in 2 ways. First way: package examples import org.apache.spark.SparkConf import org.apache.spark.SparkContext object dd { def main(args: Array[String]):Unit = { val sparkConf = new SparkConf val sc = new SparkContext(sparkConf) val data = List(1,2,3,4,5) val rdd = sc.makeRDD(data) val result = rdd.map(elem => { funcs.func_1(elem) }) println(result.count()) } } object funcs{ def func_1(i:Int): Int = { i + 1 } } This way spark works

Is there a way to rewrite Spark RDD distinct to use mapPartitions instead of distinct?

爱⌒轻易说出口 提交于 2019-12-30 02:31:04
问题 I have an RDD that is too large to consistently perform a distinct statement without spurious errors (e.g. SparkException stage failed 4 times, ExecutorLostFailure, HDFS Filesystem closed, Max number of executor failures reached, Stage cancelled because SparkContext was shut down, etc.) I am trying to count distinct IDs in a particular column, for example: print(myRDD.map(a => a._2._1._2).distinct.count()) is there an easy, consistent, less-shuffle-intensive way to do the command above,

scala spark dataframe explode is slow - so, alternate method - create columns and rows from arrays in a column

a 夏天 提交于 2019-12-29 09:32:38
问题 Scala 2.11.8, spark 2.0.1 The explode function is very slow - so, looking for an alternate method. I think it is possible with RDD's with flatmap - and, help is greatly appreciated. I have an udf that returns List(String, String, String, Int) of varying lengths. For each row in the dataframe, I want to create multiple rows, and make multiple columns. def Udf = udf ( (s: String ) => { if (s=="1") Seq(("a", "b", "c", 0), ("a1", "b1", "c1", 1), ("a2", "b2", "c2", 2)).toList else Seq(("a", "b",

scala spark dataframe explode is slow - so, alternate method - create columns and rows from arrays in a column

让人想犯罪 __ 提交于 2019-12-29 09:32:17
问题 Scala 2.11.8, spark 2.0.1 The explode function is very slow - so, looking for an alternate method. I think it is possible with RDD's with flatmap - and, help is greatly appreciated. I have an udf that returns List(String, String, String, Int) of varying lengths. For each row in the dataframe, I want to create multiple rows, and make multiple columns. def Udf = udf ( (s: String ) => { if (s=="1") Seq(("a", "b", "c", 0), ("a1", "b1", "c1", 1), ("a2", "b2", "c2", 2)).toList else Seq(("a", "b",

How to force Spark to evaluate DataFrame operations inline

只愿长相守 提交于 2019-12-29 08:36:07
问题 According to the Spark RDD docs: All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently. There are times when I need to do certain operations on my dataframes right then and now . But because dataframe ops are " lazily evaluated " (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For

Why does RDD.foreach fail with “SparkException: This RDD lacks a SparkContext”?

只谈情不闲聊 提交于 2019-12-29 08:08:07
问题 I have a dataset (as an RDD ) that I divide into 4 RDDs by using different filter operators. val RSet = datasetRdd. flatMap(x => RSetForAttr(x, alLevel, hieDict)). map(x => (x, 1)). reduceByKey((x, y) => x + y) val Rp:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rp")) val Rc:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rc")) val RpSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RpSv")) val RcSv:RDD[(String, Int)] = RSet.filter(x => x

Why does RDD.foreach fail with “SparkException: This RDD lacks a SparkContext”?

♀尐吖头ヾ 提交于 2019-12-29 08:08:06
问题 I have a dataset (as an RDD ) that I divide into 4 RDDs by using different filter operators. val RSet = datasetRdd. flatMap(x => RSetForAttr(x, alLevel, hieDict)). map(x => (x, 1)). reduceByKey((x, y) => x + y) val Rp:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rp")) val Rc:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rc")) val RpSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RpSv")) val RcSv:RDD[(String, Int)] = RSet.filter(x => x

0905-广告点击量实时统计

孤街浪徒 提交于 2019-12-28 21:08:41
0905-广告点击量实时统计 需求七:实时维护黑名单 7.1 需求概述 7.2 简要运行流程 7.3 具体运行流程 7.4 代码实现 7.4.1 加载并转换用户数据集 7.4.2 过滤掉已经上黑名单的用户 7.4.3 对实时数据进行统计,更新点击次数表 7.4.4 添加异常用户 需求八:各省各城市广告点击量实时统计 8.1 需求概述 8.2 简要运行流程 8.3 具体运行流程 8.4 代码实现 8.4.1 转换key值 8.4.2 进行聚合 8.4.3 封装case class 并入库 需求九:每天每个省份Top3热门广告 9.1 需求概述 9.2 简要运行流程 9.3 具体运行流程 9.4 代码实现 9.4.1 封装key 9.4.2 聚合 9.4.3 转换格式 9.4.4 创建临时表并执行查询 9.4.5 封装case class并入库 需求十:最近一小时广告点击量实时统计 10.1 需求概述 10.2 简要运行流程 10.3 具体运行流程 10.4 代码实现 10.4.1 封装key 10.4.2 使用窗口操作计算 10.4.4 封装case class并入库 小结 需求七:实时维护黑名单 7.1 需求概述 从Kafka获取实时数据,对每个用户的点击次数进行累加并写入MySQL,当一天之内一个用户对一个广告的点击次数超过100次时,将用户加入黑名单中。 7.2 简要运行流程

Spark进阶

China☆狼群 提交于 2019-12-28 13:04:56
Spack进阶 1.RDD的依赖关系 窄依赖(不产生shuffle) 父RDD和子RDD的patiition之间的关系是一对一,或者是多对一的关系 宽依赖(会有shuffle产生(类似中间结果,会影响计算效率)) 父RDD和子RDD的patition之间的关系是多对一. 宽依赖与窄依赖示意图 2.stage 过程 spark会根据RDD之间的依赖关系,形成一个有向无环图(DAG),DAG会提交给DAGscheduler,DAGScheduler会将DAG划分为多个相互依赖的stage,划分规则就是从后往前遇到宽依赖就切割stage,每一个stage里包含task,将这些task以taskset的形式传给taskScheduler运行, 切割规则 从后往前遇到宽依赖就切割成stage stage的计算模式 pipeline管道计算模式,(一条路走到黑) stage的注意点 管道里面的数据说明时候可以落地呢 当RDD进行持久化时 当shuffle write时 stage的task的并行度由stage的最后一个RDD来决定的 如何改变RDD的分区数呢 reduceByKey(XXX,3),GroupByKey(4) 3.spark的资源调度与任务调度 以standalone的client为例 启动集群后,worker节点向Master节点发送本机的资源情况.