rdd | 易学教程

Spark RDD算子与类型

阅读更多关于 Spark RDD算子与类型

弹性分布式数据集RDD 1.什么是RDD val lines: RDD[String] = sc.textFile("hdfs://hadoop01:9000/da") RDD（Resilient Distributed Dataset）叫做分布式数据集，是Spark中最基本的数据抽象，它代表一个不可变、可分区、里面的元素可并行计算的集合。RDD具有数据流模型的特点：自动容错(RDD父子依赖关系 )、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中，后续的查询能够重用工作集，这极大地提升了查询速度。 2.RDD的属性 * - A list of partitions * - A function for computing each split * - A list of dependencies on other RDDs * - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) * - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Tips for properly using large broadcast variables?

阅读更多关于 Tips for properly using large broadcast variables?

问题 I'm using a broadcast variable about 100 MB pickled in size, which I'm approximating with: >>> data = list(range(int(10*1e6))) >>> import cPickle as pickle >>> len(pickle.dumps(data)) 98888896 Running on a cluster with 3 c3.2xlarge executors, and a m3.large driver, with the following command launching the interactive session: IPYTHON=1 pyspark --executor-memory 10G --driver-memory 5G --conf spark.driver.maxResultSize=5g In an RDD, if I persist a reference to this broadcast variable, the

好程序员分享大数据的架构体系

阅读更多关于好程序员分享大数据的架构体系

flume 采集数据 MapReduce HBse (HDFS) Yarn 资源调度系统展示平台数据平台 1 ，提交任务 2 ，展示结果数据 spark 分析引擎 S3 可以进行各种的数据分析，可可以和 hive 进行整合， spark 任务可以运行在 Yarn 提交任务到集群的入口类 SC 为什么用 spark ：速度快，易用，通用，兼容性高 hadoop scala jdk spark 如果结果为定长的 toBuffer 编程变长的启动流程 spark 集群启动流程和任务提交主节点 master 子节点 work 多台 start-all 。 sh 脚本先启动 master 服务启动 work master 提交注册信息 work 响应 work 会定时发送心跳信息集群启动流程 1 、调用 start-all 脚本，开始启动 Master 2 、 master 启动以后， preStart 方法调用了一个定时器，定时的检查超时的 worker 3 、启动脚本会解析 slaves 配置文件，找到启动 work 的相应节点，开始启动 worker 4 、 worker 服务启动后开始调用 prestart 方法（生命周期方法）开始向所有的 master 注册 5 、 master 接收到 work 发送过来的注册信息， master

Why does partition parameter of SparkContext.textFile not take effect?

阅读更多关于 Why does partition parameter of SparkContext.textFile not take effect?

scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21 scala> p.partitions.size res33: Int = 729 I was expecting 8 to be printed and I see 729 tasks in Spark UI EDIT: After calling repartition() as suggested by @zero323 scala> p1 = p.repartition(8) scala> p1.partitions.size res60: Int = 8 scala> p1.count I still see 729 tasks in the Spark UI even though the spark-shell prints 8. If you take a look at the signature textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD

Spark : DB connection per Spark RDD partition and do mapPartition

阅读更多关于 Spark : DB connection per Spark RDD partition and do mapPartition

问题 I want to do a mapPartitions on my spark rdd, val newRd = myRdd.mapPartitions( partition => { val connection = new DbConnection /*creates a db connection per partition*/ val newPartition = partition.map( record => { readMatchingFromDB(record, connection) }) connection.close() newPartition }) But, this gives me a connection already closed exception, as expected because before the control reaches the .map() my connection is closed. I want to create a connection per RDD partition, and close it

Spark: subtract two DataFrames

阅读更多关于 Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDD s to end up with only the different content from the first one val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD . How can this be achieved with DataFrames in Spark version 1.3.0 ? According to the api docs , doing: dataFrame1.except(dataFrame2) will return a new DataFrame containing rows in dataFrame1 but not in dataframe2. In pyspark DOCS it would be subtract df1.subtract(df2) I tried subtract, but the result was not consistent. If I

take top N after groupBy and treat them as RDD

阅读更多关于 take top N after groupBy and treat them as RDD

问题 I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup (in the below) to RDD[(String, Int)] where List[Int] values are flatten The data is val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2, "bar"->6, "bar"->5, "bar"->4)) The top N items per group are computed as: val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map { case (key, numbers) => key -> numbers.toList.sortBy(-_).take(2) } The result is (bar,List(6, 5)) (foo,List(3, 2)) which was

Spark SQL内核剖析—学习（一）

阅读更多关于 Spark SQL内核剖析—学习（一）

本文参考了《Spark SQL内核剖析》（朱峰、张韶全、黄明等著）的目录结构和内容，这本书主要集中在对SQL内核实现的剖析上，从源码实现上学习分布式计算和数据库领域的相关技术，非常值得有相关需求的专业人士学习和购买。我写这篇文章的目的也是基于此做一个关于Spark SQL的学习笔记以及分享了一些自己的理解。什么是Spark SQL? Spark SQL是近年来SQL-on-Hadoop解决方案（包括Hive、Presto和Impala等）中的佼佼者，结合了数据库SQL处理和Spark分布式计算模型两个方面的技术，目标是取代传统的数据仓库。 1. Spark 基础知识在这一节简单介绍了Spark涉及到的几个简单技术，包括RDD编程模型、DataFrame和DataSet用户接口。 1.1. RDD 编程模型 RDD是Spark的核心数据结构，全称是弹性分布式数据集（Resilient Distributed Dataset）,其本质是一种分布式的内存抽象，表示一个只读的数据分区（Partition）集合。至于RDD的创建、计算和转换等操作的原理和技术不在本文的介绍范围内，有兴趣的读者可以自行了解，我们只需要知道RDD作为弹性数据集可以很方便地支持MapReduce应用、关系型数据处理、流式数据处理和迭代性应用（图计算、机器学习等）。 1.2. DataFrame 与

Spark : How to use mapPartition and create/close connection per partition

阅读更多关于 Spark : How to use mapPartition and create/close connection per partition

问题 So, I want to do certain operations on my spark DataFrame, write them to DB and create another DataFrame at the end. It looks like this : import sqlContext.implicits._ val newDF = myDF.mapPartitions( iterator => { val conn = new DbConnection iterator.map( row => { addRowToBatch(row) convertRowToObject(row) }) conn.writeTheBatchToDB() conn.close() }) .toDF() This gives me an error as mapPartitions expects return type of Iterator[NotInferedR] , but here it is Unit . I know this is possible with

How to share Spark RDD between 2 Spark contexts?

阅读更多关于 How to share Spark RDD between 2 Spark contexts?

问题 I have an RMI cluster. Each RMI server has a Spark context. Is there any way to share an RDD between different Spark contexts? 回答1: As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it ( SparkContext in case of RDD, SQLContext in case of DataFrame dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver, Livy, or Apache Zeppelin)

订阅 rdd