rdd | 易学教程

spark中RDD的Stage任务划分

阅读更多关于 spark中RDD的Stage任务划分

1）DAG有向无环图 DAG（Directed Acyclic Graph）有向无环图是由点和线组成的拓扑图形，该图形具有方向，不会闭环。例如，DAG记录了RDD的转换过程和任务的阶段。 2）RDD任务切分中间分为：Application、Job、Stage和Task （1）Application：初始化一个SparkContext即生成一个Application；（2）Job：一个Action算子就会生成一个Job；（3）Stage：Stage等于宽依赖的个数加1；（4）Task：一个Stage阶段中，最后一个RDD的分区个数就是Task的个数。注意：Application->Job->Stage->Task每一层都是1对n的关系。 Stage任务划分 YarnClient运行模式介绍 3）代码实现 object Stage01 { def main(args: Array[String]): Unit = { //1.创建SparkConf并设置App名称 val conf: SparkConf = new SparkConf().setAppName("SparkCoreTest").setMaster("local[*]") //2. Application：初始化一个SparkContext即生成一个Application； val sc:

Spark创建Dataframe的方法

阅读更多关于 Spark创建Dataframe的方法

通过RDD创建dataframe的方式1: 把rdd[T]变成 RDD[case class类型]就可以直接toDF 通过RDD[tuple]创建dataframe 通过RDD[JavaBean]创建dataframe 通过RDD[scala bean] 创建dataframe 通过 RDD[Row] 来创建dataframe 1.通过RDD创建dataframe import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, SparkSession} /** * @description: * 通过RDD创建dataframe的方式1: 把rdd[T]变成 RDD[case class类型]就可以直接toDF */ // 1,张飞,21,北京,80.0 case class Stu(id:Int,name:String,age:Int,city:String,score:Double) object Demo2_CreateDF_1 { def main(args: Array[String]): Unit = { val spark = SparkSession .builder() .appName(this.getClass.getSimpleName) .master("local[*]")

Spark streaming and mutable broadcast variable

阅读更多关于 Spark streaming and mutable broadcast variable

问题 I found this link https://gist.github.com/BenFradet/c47c5c7247c5d5d0f076 which shows an implementation where in spark, broadcast variable is being updated. Is this a valid implementation meaning will executors see the latest value of broadcast variable? 回答1: The code you are referring to is using Broadcast.unpersist() method. If you check Spark API Broadcast.unpersist() method it says "Asynchronously delete cached copies of this broadcast on the executors. If the broadcast is used after this

Spark SQL

阅读更多关于 Spark SQL

Spark SQL 一、概述 http://spark.apache.org/docs/latest/sql-programming-guide.html Spark SQL是Spark中一个模块，用以对结构化数据进行处理。SparkSQL在RDD之上抽象出来Dataset/Dataframe 这两个类提供了类似RDD的功能，也就意味用户可以使用map、flatMap、filter等高阶算子，同时也通过了基于列的命名查询，也就是说Dataset/DataFrame提供了两套操作数据的API，这些API可以给Saprk引擎要提供更多信息，系统可以根据这些信息对计算实现一定的优化。目前Spark SQL提供了两种交互方式：1） SQL 脚本，2） Dataset API (strong-typed类型、untyped类型操作) Datasets & DataFrames Dataset 是一个分布式数据集，Dataset是在spark-1.6提出新的API，该API构建在RDD（strong type，使用lambda表达式）之上同时可以借助于Spark SQL对执行引擎的优点，使得使用Dateset执行一些数据的转换比直接使用RDD算子功能和性能都有所提升。因此我们可以认为==Python does not have the support for the Dataset API.

Spark: How to join RDDs by time range

阅读更多关于 Spark: How to join RDDs by time range

问题 I have a delicate Spark problem, where i just can't wrap my head around. We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contains Historic data. Both have an id on which they can be matched/joined. But the problem is the two tables have an N:N relation ship. Actions contains multiple rows with the same id and so does Historic . Here are some example date from both tables. Actions time is actually a timestamp id | time | valueX 1 | 12:05 | 500 1 | 12:30 | 500 2 | 12

Spark RDD: How to calculate statistics most efficiently?

阅读更多关于 Spark RDD: How to calculate statistics most efficiently?

问题 Assuming the existence of an RDD of tuples similar to the following: (key1, 1) (key3, 9) (key2, 3) (key1, 4) (key1, 5) (key3, 2) (key2, 7) ... What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to: Use the colStats function in MLLib: This approach has the advantage of easily-adaptable to use other mllib.stat

Spark RDD: How to calculate statistics most efficiently?

阅读更多关于 Spark RDD: How to calculate statistics most efficiently?

If the one partition is lost, we can use lineage to reconstruct it. Will the base RDD be loaded again?

阅读更多关于 If the one partition is lost, we can use lineage to reconstruct it. Will the base RDD be loaded again?

问题 I read the paper "Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing". The author said that if the one partition is lost, we can use lineage to reconstruct it. However, the origin RDD didn't exist in memory now. So will the base RDD be loaded again to rebuild the lost RDD partition? 回答1: Yes, as you mentioned if the RDD that was used to create the partition is not in memory anymore it has to be loaded again from disk and recomputed. If the original RDD

How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?

阅读更多关于 How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?

问题 I am trying to calculate Histogram of all columns from a CSV file using Spark Scala. I found that DoubleRDDFunctions supporting Histogram. So I coded like following for getting histogram of all columns. Get column count Create RDD[double] of each column and calculate Histogram of each RDD using DoubleRDDFunctions var columnIndexArray = Array.tabulate(rdd.first().length) (_ * 1) val histogramData = columnIndexArray.map(columns => { rdd.map(lines => lines(columns)).histogram(6) }) Is it a good

Transforming PySpark RDD with Scala

阅读更多关于 Transforming PySpark RDD with Scala

问题 TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though. I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are strings and I would like to call a method in Scala code, passing it a DStream[String] instance. However, I'm unable to receive proper JVM strings in the Scala code. It looks to me like the Python strings are not