rdd

spark中flatmap和map的区别

安稳与你 提交于 2019-12-07 19:28:34
以前总是分不清楚spark中flatmap和map的区别,现在弄明白了,总结分享给大家,先看看flatmap和map的定义。 map()是将函数用于RDD中的每个元素,将返回值构成新的RDD。 flatmap()是将函数应用于RDD中的每个元素,将返回的迭代器的所有内容构成新的RDD 有些拗口,看看例子就明白了。 val rdd = sc.parallelize(List( "coffee panda" , "happy panda" , "happiest panda party" )) 输入 rdd .map ( x => x ) .collect 结果 res9: Array[String] = Array(coffee panda, happy panda, happiest panda party) 输入 rdd .flatMap ( x => x .split ( " " )) .collect 结果: res8: Array[String] = Array(coffee, panda, happy, panda, happiest, panda, party) flatMap说明白就是先map然后再flat,再来看个例子 val rdd1 = sc .parallelize (List( 1 , 2 , 3 , 3 )) scala> rdd1 .map ( x =>

Spark中map和flatMap的理解

坚强是说给别人听的谎言 提交于 2019-12-07 19:06:40
笔记:本文记录了map和flatMap的区别 函数原型 1.data.map(function) 该函数是data的方法,传入的参数为一个函数(function) ,作用:对data中的每一个项进行function操作,并返回RDD,该RDD的项的数目等于原data的项的数目。 2.data.flatMap(function) flatMap方法和map方法类似,但是每个输入项可成为0个或多个输出项,实际上是在map的基础上进行了扁平化处理。 形象化理解map和flatMap: 3 map接龙:连续调用map方法 形式如下:data.map().map().map() map部分示例代码(来源:《Spark MLlib 机器学习实战(第二版)》 -王晓华 P59): val rdd=sc.textFile(“c://test.txt”)//创建RDD文件路径 .map(_.split(' '))//按“ ”分割 .map(_.toDouble))//转换成Double类型 .map(line=>Vectors.dense(line)//转换成Vector格式 参考文章:https://www.sogou.com/link?url=hedJjaC291MxDp_bleWQj5pP-YHblviUM3un4Y7gPgD-TMBYoIJyGI_2LFoNtBLn 来源: CSDN 作者:

Spark: How to split an RDD[T]` into Seq[RDD[T]] and preserve the ordering

爷,独闯天下 提交于 2019-12-07 13:01:24
问题 How can I effectively split up an RDD[T] into a Seq[RDD[T]] / Iterable[RDD[T]] with n elements and preserve the original ordering? I would like to be able to write something like this RDD(1, 2, 3, 4, 5, 6, 7, 8, 9).split(3) which should result in something like Seq(RDD(1, 2, 3), RDD(4, 5, 6), RDD(7, 8, 9)) Does spark provide such a function? If not what is a performant way to achieve this? val parts = rdd.length / n val rdds = rdd.zipWithIndex().map{ case (t, i) => (i - (i % parts), t)}

Spark: Group RDD Sql Query

こ雲淡風輕ζ 提交于 2019-12-07 11:24:18
I have 3 RDDs that I need to join. val event1001RDD: schemaRDD = [eventtype,id,location,date1] [1001,4929102,LOC01,2015-01-20 10:44:39] [1001,4929103,LOC02,2015-01-20 10:44:39] [1001,4929104,LOC03,2015-01-20 10:44:39] val event2009RDD: schemaRDD = [eventtype,id,celltype,date1] (not grouped by id since I need 4 dates from this depending on celltype) [2009,4929101,R01,2015-01-20 20:44:39] [2009,4929102,R02,2015-01-20 14:00:00] (RPM) [2009,4929102,P01,2015-01-20 12:00:00] (PPM) [2009,4929102,R03,2015-01-20 15:00:00] (RPM) [2009,4929102,C01,2015-01-20 13:00:00] (RPM) [2009,4929103,R01,2015-01-20

What is RDD dependency in Spark?

谁说我不能喝 提交于 2019-12-07 11:17:41
问题 As I know there are two types of dependencies: narrow & wide . But I dont understand how dependency affects to child RDD . Is child RDD only metadata which contains info how to build new RDD blocks from parent RDD ? Or child RDD is self-sufficient set of data which was created from parent RDD ? 回答1: Yes, the child RDD is metadata that describes how to calculate the RDD from the parent RDD. Consider org/apache/spark/rdd/MappedRDD.scala for example: private[spark] class MappedRDD[U: ClassTag, T

Efficient PairRDD operations on DataFrame with Spark SQL GROUP BY

余生长醉 提交于 2019-12-07 08:32:24
问题 This question is about the duality between DataFrame and RDD when it comes to aggregation operations. In Spark SQL one can use table generating UDFs for custom aggregations but creating one of those is typically noticeably less user-friendly than using the aggregation functions available for RDDs, especially if table output is not required. Is there an efficient way to apply pair RDD operations such as aggregateByKey to a DataFrame which has been grouped using GROUP BY or ordered using

Spark (三) 性能优化

∥☆過路亽.° 提交于 2019-12-06 20:35:05
参数配置 1、spark-env.sh 2、程序通过SparkConf或System.setProperty 性能观察与日志 1)Web UI。 2)Driver程序控制台日志。 3)logs文件夹下日志。 4)work文件夹下日志。 5)Profiler工具。 调度与分区优化 1.小分区合并 频繁的过滤或者过滤掉的数据量过大就会产生问题,造成大量小分区的产生。Spark是每个数据分区都会分配一个任务执行,如果任务过多,则每个任务处理的数据量很小,会造成线程切换开销大,很多任务等待执行,并行度不高; 解决方式:可以采用RDD中重分区的函数进行数据紧缩,减少分区数,将小分区合并变为大分区。 通过coalesce函数来减少分区。这个函数会返回一个含有numPartitions数量个分区的新RDD,即将整个RDD重分区。 当分区由10000重分区到100时,由于前后两个阶段的分区是窄依赖的,所以不会产生Shuffle的操作。 但是如果分区数量急剧减少,如极端状况从10000重分区为一个分区时,就会造成一个问题:数据会分布到一个节点上进行计算,完全无法开掘集群并行计算的能力。为了规避这个问题,可以设置shuffle=true 由于Shuffle可以分隔Stage,这就保证了上一阶段Stage中的上游任务仍是10000个分区在并行计算。如果不加Shuffle

关于Spark的基本概念和特性简介

独自空忆成欢 提交于 2019-12-06 17:55:03
1、Spark是什么? ○ 高可伸缩性 ○ 高容错 ○ 基于内存计算 2、Spark的生态体系(BDAS,中文:伯利克分析栈) ○ MapReduce属于Hadoop生态体系之一,Spark则属于BDAS生态体系之一 ○ Hadoop包含了MapReduce、HDFS、HBase、Hive、Zookeeper、Pig、Sqoop等 ○ BDAS包含了Spark、Shark(相当于Hive)、BlinkDB、Spark Streaming(消息实时处理框架,类似Storm)等等 ○ BDAS生态体系图: 3、Spark与MapReduce 优势: ○ MapReduce通常将中间结果放到HDFS上,Spark是基于内存并行大数据框架,中间结果存放到内存,对于迭代数据Spark效率高。 ○ MapReduce总是消耗大量时间排序,而有些场景不需要排序,Spark可以避免不必要的排序所带来的开销 ○ Spark是一张有向无环图(从一个点出发最终无法回到该点的一个拓扑),并对其进行优化。 4、Spark支持的API Scala、Python、Java等 5、运行模式 ○ Local (用于测试、开发) ○ Standlone (独立集群模式) ○ Spark on Yarn (Spark在Yarn上) ○ Spark on Mesos (Spark在Mesos) 6、运行时的Spark

How to convert a JavaPairRDD to Dataset?

一世执手 提交于 2019-12-06 16:46:23
SparkSession.createDataset() only allows List, RDD, or Seq - but it doesn't support JavaPairRDD . So if I have a JavaPairRDD<String, User> that I want to create a Dataset from, would a viable workround for the SparkSession.createDataset() limitation to create a wrapper UserMap class that contains two fields: String and User . Then do spark.createDataset(userMap, Encoders.bean(UserMap.class)); ? If you can convert the JavaPairRDD to List<Tuple2<K, V>> then you can use createDataset method which takes List. See below sample code. JavaPairRDD<String, User> pairRDD = ...; Dataset<Row> df = spark

Passing class functions to PySpark RDD

丶灬走出姿态 提交于 2019-12-06 16:33:46
I have a class named some_class() in a Python file here: /some-folder/app/bin/file.py I am importing it to my code here: /some-folder2/app/code/file2.py By import sys sys.path.append('/some-folder/app/bin') from file import some_class clss = some_class() I want to use this class's function named some_function in map of spark sc.parallelize(some_data_iterator).map(lambda x: clss.some_function(x)) This is giving me an error : No module named file While class.some_function when I am calling it outside map function of pyspark i.e. normally but not in pySpark's RDD. I think this has something to do