rdd

Spark RDD算子与类型

耗尽温柔 提交于 2019-11-27 05:49:56
弹性分布式数据集RDD 1.什么是RDD val lines: RDD[String] = sc.textFile("hdfs://hadoop01:9000/da") RDD(Resilient Distributed Dataset)叫做分布式数据集,是Spark中最基本的数据抽象,它代表一个 不可变 、 可分区 、里面的元素 可并行计算 的集合。RDD具有数据流模型的特点:自动容错(RDD父子依赖关系 )、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中,后续的查询能够重用工作集,这极大地提升了查询速度。 2.RDD的属性 * - A list of partitions * - A function for computing each split * - A list of dependencies on other RDDs * - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) * - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Tips for properly using large broadcast variables?

六月ゝ 毕业季﹏ 提交于 2019-11-27 05:43:45
问题 I'm using a broadcast variable about 100 MB pickled in size, which I'm approximating with: >>> data = list(range(int(10*1e6))) >>> import cPickle as pickle >>> len(pickle.dumps(data)) 98888896 Running on a cluster with 3 c3.2xlarge executors, and a m3.large driver, with the following command launching the interactive session: IPYTHON=1 pyspark --executor-memory 10G --driver-memory 5G --conf spark.driver.maxResultSize=5g In an RDD, if I persist a reference to this broadcast variable, the

好程序员分享大数据的架构体系

不问归期 提交于 2019-11-27 05:24:58
flume 采集数据 MapReduce HBse (HDFS) Yarn 资源调度系统 展示平台 数据平台 1 ,提交任务 2 ,展示结果数据 spark 分析引擎 S3 可以进行各种的数据分析 , 可可以和 hive 进行整合 , spark 任务可以运行在 Yarn 提交任务到集群的入口类 SC 为什么用 spark : 速度快,易用,通用,兼容性高 hadoop scala jdk spark 如果结果为定长的 toBuffer 编程变长的 启动流程 spark 集群启动流程 和任务提交 主节点 master 子节点 work 多台 start-all 。 sh 脚本 先启动 master 服务 启动 work master 提交注册信息 work 响应 work 会定时发送心跳信息 集群启动流程 1 、调用 start-all 脚本 ,开始启动 Master 2 、 master 启动以后, preStart 方法调用了一个定时器,定时的检查超时的 worker 3 、启动脚本会解析 slaves 配置文件,找到启动 work 的相应节点,开始启动 worker 4 、 worker 服务启动后开始调用 prestart 方法(生命周期方法)开始向所有的 master 注册 5 、 master 接收到 work 发送过来的注册信息, master

Why does partition parameter of SparkContext.textFile not take effect?

蹲街弑〆低调 提交于 2019-11-27 04:55:39
scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21 scala> p.partitions.size res33: Int = 729 I was expecting 8 to be printed and I see 729 tasks in Spark UI EDIT: After calling repartition() as suggested by @zero323 scala> p1 = p.repartition(8) scala> p1.partitions.size res60: Int = 8 scala> p1.count I still see 729 tasks in the Spark UI even though the spark-shell prints 8. If you take a look at the signature textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD

Spark : DB connection per Spark RDD partition and do mapPartition

冷暖自知 提交于 2019-11-27 04:35:30
问题 I want to do a mapPartitions on my spark rdd, val newRd = myRdd.mapPartitions( partition => { val connection = new DbConnection /*creates a db connection per partition*/ val newPartition = partition.map( record => { readMatchingFromDB(record, connection) }) connection.close() newPartition }) But, this gives me a connection already closed exception, as expected because before the control reaches the .map() my connection is closed. I want to create a connection per RDD partition, and close it

Spark: subtract two DataFrames

江枫思渺然 提交于 2019-11-27 03:43:53
In Spark version 1.2.0 one could use subtract with 2 SchemRDD s to end up with only the different content from the first one val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD . How can this be achieved with DataFrames in Spark version 1.3.0 ? According to the api docs , doing: dataFrame1.except(dataFrame2) will return a new DataFrame containing rows in dataFrame1 but not in dataframe2. In pyspark DOCS it would be subtract df1.subtract(df2) I tried subtract, but the result was not consistent. If I

take top N after groupBy and treat them as RDD

强颜欢笑 提交于 2019-11-27 03:35:27
问题 I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup (in the below) to RDD[(String, Int)] where List[Int] values are flatten The data is val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2, "bar"->6, "bar"->5, "bar"->4)) The top N items per group are computed as: val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map { case (key, numbers) => key -> numbers.toList.sortBy(-_).take(2) } The result is (bar,List(6, 5)) (foo,List(3, 2)) which was

Spark SQL内核剖析—学习(一)

时光总嘲笑我的痴心妄想 提交于 2019-11-27 03:28:29
本文参考了《Spark SQL内核剖析》(朱峰、张韶全、黄明等著)的目录结构和内容,这本书主要集中在对SQL内核实现的剖析上,从源码实现上学习分布式计算和数据库领域的相关技术,非常值得有相关需求的专业人士学习和购买。我写这篇文章的目的也是基于此做一个关于Spark SQL的学习笔记以及分享了一些自己的理解。 什么是Spark SQL? Spark SQL是近年来SQL-on-Hadoop解决方案(包括Hive、Presto和Impala等)中的佼佼者,结合了数据库SQL处理和Spark分布式计算模型两个方面的技术,目标是取代传统的数据仓库。 1. Spark 基础知识 在这一节简单介绍了Spark涉及到的几个简单技术,包括RDD编程模型、DataFrame和DataSet用户接口。 1.1. RDD 编程模型 RDD是Spark的核心数据结构,全称是弹性分布式数据集(Resilient Distributed Dataset),其本质是一种分布式的内存抽象,表示一个只读的数据分区(Partition)集合。 至于RDD的创建、计算和转换等操作的原理和技术不在本文的介绍范围内,有兴趣的读者可以自行了解,我们只需要知道RDD作为弹性数据集可以很方便地支持MapReduce应用、关系型数据处理、流式数据处理和迭代性应用(图计算、机器学习等)。 1.2. DataFrame 与

Spark : How to use mapPartition and create/close connection per partition

☆樱花仙子☆ 提交于 2019-11-27 02:56:53
问题 So, I want to do certain operations on my spark DataFrame, write them to DB and create another DataFrame at the end. It looks like this : import sqlContext.implicits._ val newDF = myDF.mapPartitions( iterator => { val conn = new DbConnection iterator.map( row => { addRowToBatch(row) convertRowToObject(row) }) conn.writeTheBatchToDB() conn.close() }) .toDF() This gives me an error as mapPartitions expects return type of Iterator[NotInferedR] , but here it is Unit . I know this is possible with

How to share Spark RDD between 2 Spark contexts?

丶灬走出姿态 提交于 2019-11-27 02:52:08
问题 I have an RMI cluster. Each RMI server has a Spark context. Is there any way to share an RDD between different Spark contexts? 回答1: As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it ( SparkContext in case of RDD, SQLContext in case of DataFrame dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver, Livy, or Apache Zeppelin)