rdd | 易学教程

Convert RDD of Array(Row) to RDD of Row?

阅读更多关于 Convert RDD of Array(Row) to RDD of Row?

问题 I have such data in a file and I'd like to do some statistics using Spark. File content: aaa|bbb|ccc ddd|eee|fff|ggg I need to assign each line an id. I read them as rdd and use zipWithIndex() . Then they should be like: (0, aaa|bbb|ccc) (1, ddd|eee|fff|ggg) I need to make each string associated with the id. I can get the RDD of Array(Row), but can't jump out of the array. How should I modify my code? import org.apache.spark.sql.{Row, SparkSession} val fileRDD = spark.sparkContext.textFile

Apache Spark Transformations: groupByKey vs reduceByKey vs aggregateByKey

阅读更多关于 Apache Spark Transformations: groupByKey vs reduceByKey vs aggregateByKey

问题 These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one? 回答1: I think official guide explains it well enough. I will highlight differences (you have RDD of type (K, V) ): if you need to keep the values, then use groupByKey if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same K ), you have two choices: reduceByKey or

How to merge two presorted rdds in spark?

阅读更多关于 How to merge two presorted rdds in spark?

问题 I have two large csv files presorted by one of the columns. Is there a way to use the fact that they are already sorted to get a new sorted RDD faster, without full sorting again? 回答1: The short answer: No, there is no way to leverage the fact that two input RDDs are already sorted when using the sort facilities offered by Apache Spark. The long answer: Under certain conditions, there might be a better way than using sortBy or sortByKey . The most obvious case is when the input RDDs are

Spark - Nested RDD Operation

阅读更多关于 Spark - Nested RDD Operation

问题 I have two RDDs say rdd1 = id | created | destroyed | price 1 | 1 | 2 | 10 2 | 1 | 5 | 11 3 | 2 | 3 | 11 4 | 3 | 4 | 12 5 | 3 | 5 | 11 rdd2 = [1,2,3,4,5] # lets call these value as timestamps (ts) rdd2 is basically generated using range(intial_value, end_value, interval). The params here can vary. The size can be same or different to rdd1. The idea is to fetch records from rdd1 into rdd2 based on the values of rdd2 using a filtering criertia(records from rdd1 can repeat while fetching as you

Question about RDD、分区、stage、并行计算、集群、流水线计算、shuffle（join？？）、task、executor

阅读更多关于 Question about RDD、分区、stage、并行计算、集群、流水线计算、shuffle（join？？）、task、executor

Question about RDD、分区、stage、并行计算、集群、流水线计算、shuffle（join？？）、task、executor RDD是spark数据中最基本的数据抽象，task是spark的最小代码执行单元？数据不是代码的资源？？？那为什么RDD又是分区存储？节点中又是对分区（父分区进行流水线计算）？RDD只能转换操作，但是RDD可以分成多个分区，而且这些分区可以被保存到集群中不同的节点，可在不同的节点进行并行计算，那RDD还是高度受限的吗？在一个节点的中以流水线形式计算窄关系的父节点，那RDD还是高度受限的吗？将RDD分成stage，又是为了什么？分配资源吗？优化效率吗？哈希分区和范围分区？shuffle又是什么？？？task也又是什么？？？流水线计算？是transformation？？那就是进行数据的筛选？？不对，机器学习算法和交互式数据挖掘使用的目的是什么？理解这个能够理解父分区中的流水计算！ shuffle操作中的reduce task需要跨节点去拉取（为什么要跨节点拉取，因为RDD的不同分区都是在不同的节点储存，但宽关联是子RDD的一个分区就需要父RDD的所有分区，肯定要跨节点。而窄关联的子RDD中的一个分区只是有父RDD的一个分区就可，所以不需要跨节点，但是 join？？？？？前提组成子RDD的分区的父分区都在同一个节点？？

Spark常规性能调优

阅读更多关于 Spark常规性能调优

1.1.1 常规性能调优一：最优资源配置 Spark性能调优的第一步，就是为任务分配更多的资源，在一定范围内，增加资源的分配与性能的提升是成正比的，实现了最优的资源配置后，在此基础上再考虑进行后面论述的性能调优策略。资源的分配在使用脚本提交Spark任务时进行指定，标准的Spark任务提交脚本如代码清单2-1所示：、 /usr/opt/modules/spark/bin/spark-submit \ --class com.atguigu.spark.Analysis \ --num-executors 80 \ --driver-memory 6g \ --executor-memory 6g \ --executor-cores 3 \ /usr/opt/modules/spark/jar/spark.jar \ 可以进行分配的资源如表2-1所示：表2-1 可分配资源表名称说明 --num-executors 配置Executor的数量 --driver-memory 配置Driver内存（影响不大） --executor-memory 配置每个Executor的内存大小 --executor-cores 配置每个Executor的CPU core数量调节原则：尽量将任务分配的资源调节到可以使用的资源的最大限度。对于具体资源的分配

How to convert a case-class-based RDD into a DataFrame?

阅读更多关于 How to convert a case-class-based RDD into a DataFrame?

问题 The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass) , but my DataFrame ends up empty. Here's my Scala code: // sc is the SparkContext, while sqlContext is the SQLContext. // Define the case class and raw data case class Dog(name: String) val data = Array( Dog("Rex"), Dog("Fido") ) // Create an RDD from the raw data val dogRDD = sc.parallelize(data

大数据学习day22------spark05------1. 学科最受欢迎老师解法补充 2. 自定义排序

阅读更多关于大数据学习day22------spark05------1. 学科最受欢迎老师解法补充 2. 自定义排序

1. 学科最受欢迎老师解法补充 day21中该案例的解法四还有一个问题，就是当各个老师受欢迎度是一样的时候，其排序规则就处理不了，以下是对其优化的解法实现方式五 FavoriteTeacher5 package com._51doit.spark04 import org.apache.spark.{Partitioner, SparkConf, SparkContext} import org.apache.spark.rdd.RDD import scala.collection.mutable object FavoriteTeacher5 { def main(args: Array[String]): Unit = { val isLocal = args(0).toBoolean //创建SparkConf，然后创建SparkContext val conf = new SparkConf().setAppName(this.getClass.getSimpleName) if (isLocal) { conf.setMaster("local[*]") } val sc = new SparkContext(conf) //指定以后从哪里读取数据创建RDD val lines: RDD[String] = sc.textFile(args(1)) //对数据进行切分

Spark数据倾斜解决方案及shuffle原理

阅读更多关于 Spark数据倾斜解决方案及shuffle原理

数据倾斜调优与shuffle调优数据倾斜发生时的现象 1）个别task的执行速度明显慢于绝大多数task(常见情况) 2）spark作业突然报OOM异常(少见情况) 数据倾斜发生的原理在进行shuffle的时候，必须将各个节点上相同的key拉取到某个节点上的一个task来进行处理。此时如果某个key对应的数据量特别大的话，就会发生数据倾斜。以至于大部分task只需几分钟，而个别task需要几小时，导致整个task作业需要几个小时才能运行完成。而且如果某个task数据量特别大的时候，甚至会导致内存溢出的情况。定位数据倾斜发生的位置数据倾斜只会发生在shuffle过程中，因此我们要先确定数据倾斜发生在第几个stage中，我们可以通过Web UI来查看当前运行到了第一个stage，以及该stage中各个task分配的数据量，来确定是不是由数据分配不均导致的数据倾斜。一旦确定数据倾斜是由数据分配不均导致，下一步就要确定数据倾斜发生在哪一个stage之后，根据代码中的shuffle算子，推算出stage与代码的对应关系，判定数据倾斜发生的位置。数据倾斜的解决方案 1）使用Hive ETL预处理数据适用场景：Hive里的源数据本身就不均匀，并且需要对Hive表频繁进行shuffle操作解决方案：在Hive中预先对数据按照key进行聚合或是和其他表进行join

How to perform Standard Deviation and Mean operations on a Java Spark RDD?

阅读更多关于 How to perform Standard Deviation and Mean operations on a Java Spark RDD?

问题 I have a JavaRDD which looks like this., [ [A,8] [B,3] [C,5] [A,2] [B,8] ... ... ] I want my result to be Mean [ [A,5] [B,5.5] [C,5] ] How do I do this using Java RDDs only. P.S : I want to avoid groupBy operation so I am not using DataFrames. 回答1: Here you go : import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.util.StatCounter; import scala

订阅 rdd