rdd | 易学教程

Why Spark doesn't allow map-side combining with array keys?

阅读更多关于 Why Spark doesn't allow map-side combining with array keys?

问题 I'm using Spark 1.3.1 and I'm curious why Spark doesn't allow using array keys on map-side combining. Piece of combineByKey function : if (keyClass.isArray) { if (mapSideCombine) { throw new SparkException("Cannot use map-side combining with array keys.") } } 回答1: Basically for the same reason why default partitioner cannot partition array keys. Scala Array is just a wrapper around Java array and its hashCode doesn't depend on a content: scala> val x = Array(1, 2, 3) x: Array[Int] = Array(1,

Spill to disk and shuffle write spark

阅读更多关于 Spill to disk and shuffle write spark

I'm getting confused about spill to disk and shuffle write . Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it to disk and then clean up the map for the next spill(if occur), my questions are : What is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system and also record. Admit are different, so Spill records are sorted because the are passed through the map, instead shuffle write records no

foldLeft or foldRight equivalent in Spark?

阅读更多关于 foldLeft or foldRight equivalent in Spark?

问题 In Spark's RDDs and DStreams we have the 'reduce' function for transforming an entire RDD into one element. However the reduce function takes (T,T) => T However if we want to reduce a List in Scala we can use foldLeft or foldRight which takes type (B)( (B,A) => B) This is very useful because you start folding with a type other then what is in your list. Is there a way in Spark to do something similar? Where I can start with a value that is of different type then the elements in the RDD itself

Why the Spark's repartition didn't balance data into partitions?

阅读更多关于 Why the Spark's repartition didn't balance data into partitions?

问题 >>> rdd = sc.parallelize(range(10), 2) >>> rdd.glom().collect() [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]] >>> rdd.repartition(3).glom().collect() [[], [0, 1, 2, 3, 4], [5, 6, 7, 8, 9]] >>> The first partition is empty? Why? I really appreciate you telling me the reasons. 回答1: That happens because Spark doesn't shuffle individual elements but rather blocks of data - with minimum batch size equal to 10. So if you have less elements than that per partition, Spark won't separate content of partitions.

updateStateByKey

阅读更多关于 updateStateByKey

updateStateByKey操作允许您在使用新的信息持续更新时保持任意状态。 1、定义状态 - 状态可以是任意数据类型。 2、定义状态更新功能 - 使用函数指定如何使用上一个状态更新状态，并从输入流中指定新值。如何使用该函数，spark文档写的很模糊，网上资料也不够详尽，自己翻阅源码总结一下，并给一个完整的例子 updateStateBykey函数有6种重载函数： 1、只传入一个更新函数，最简单的一种。更新函数两个参数Seq[V], Option[S]，前者是每个key新增的值的集合，后者是当前保存的状态， def updateStateByKey [ S : ClassTag ] ( updateFunc : ( Seq [ V ] , Option [ S ] ) = > Option [ S ] ) : DStream [ ( K , S ) ] = ssc . withScope { updateStateByKey ( updateFunc , defaultPartitioner ( ) ) } 例如，对于wordcount，我们可以这样定义更新函数： ( values : Seq [ Int ] , state : Option [ Int ] ) = > { //创建一个变量，用于记录单词出现次数 var newValue = state .

spark

阅读更多关于 spark

http://spark.apache.org/ RM spark 1:100 内存； 1:10磁盘。 DAG You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources. Mesos resilient 英 [rɪ'zɪlɪənt] 美 [rɪ'zɪlɪənt] adj. 弹回的，有弹力的 stom 流式处理，数据在源源不断的输入，源源不断的产生。 spark streaming ：流式处理 spark core : 批处理。 spark sql : 即席处理。（sql查询）　　 spark官方配置项 http://spark.apache.org/docs/latest/configuration.html 　　 pair n. 一对，一双，一副 vt. 把…组成一对 RDD(Resilient Distributed Datasets) [1] ，弹性分布式数据集，是分布式内存的一个抽象概念

【spark 算子案例】

阅读更多关于【spark 算子案例】

1 package spark_example01; 2 3 4 import java.io.File; 5 import java.io.FileWriter; 6 import java.io.IOException; 7 import java.util.Random; 8 9 /** 10 */ 11 public class PeopleInfoFileGenerator { 12 public static void main(String[] args){ 13 File file = new File("/Users/xls/Desktop/code/bigdata/data/PeopleInfo.txt"); 14 15 try { 16 Random random = new Random();//生成随机数 17 FileWriter fileWriter = new FileWriter(file);//新建一个文件 18 for (long i=1;i<=100000000;i++){ //生成1000万个数字 19 int height = random.nextInt(220); 20 if (height < 50) { 21 height = height + 50; 22 } 23 String gender = getRandomGender

spark05

阅读更多关于 spark05

spark05 def main(args: Array[String]): Unit = { //每个用户最喜欢得电影类型 //观看量评分得平均值 val conf = new SparkConf() conf.setMaster( "local[*]" ) conf.setAppName( "movie" ) val sc = new SparkContext(conf) val ratRDD:RDD[String] = sc.textFile( "ratings.txt" ) val mRDD:RDD[String] = sc.textFile( "movies.txt" ) val ratRDD1:RDD[(String,String)] = ratRDD.map(t=>{ var strs = t.split( "," ) (strs(1),strs(0)) //mId userId }) val mRDD1:RDD[(String,String)] = mRDD.flatMap(t=>{ val strs = t.split( "," ) val mid = strs(0) val types = strs(strs.length-1).split( " \\ |" ) val mtype = types.map(tp=>{ (mid,tp) }) mtype })

spark04

阅读更多关于 spark04

spark04 join leftOuterjoin rightOuterJoin cogroup scala> var arr = Array(("zhangsan",200),("lisi",300),("wangwu",350)) arr: Array[(String, Int)] = Array((zhangsan,200), (lisi,300), (wangwu,350)) scala> var arr1 = Array(("zhangsan",10),("lisi",15),("zhaosi",20)) arr1: Array[(String, Int)] = Array((zhangsan,10), (lisi,15), (zhaosi,20)) scala> sc.makeRDD(arr,3) res0: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at makeRDD at <console>:27 scala> sc.makeRDD(arr1,3) res1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at makeRDD at <console>:27 scala> scala>

spark02

阅读更多关于 spark02

spark02 自定义资源分配 --executor-cores --executor-memory --total-executor-cores 最大允许使用多少核数 3 台机器每个机器 8cores 1G --executor-cores --executor-memory --total-executor-cores executors 8 1G 3 4 1G 3 4 1G 4 1 4 512M 6 4 512M 8 2 6 512M 3 RDD 的简介 At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are