rdd

What does “Stage Skipped” mean in Apache Spark web UI?

与世无争的帅哥 提交于 2019-11-26 13:01:53
From my Spark UI. What does it mean by skipped? Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling ( reduceByKey ). Whenever there is shuffling involved Spark automatically caches generated data : Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is

Why does partition parameter of SparkContext.textFile not take effect?

 ̄綄美尐妖づ 提交于 2019-11-26 12:46:27
问题 scala> val p=sc.textFile(\"file:///c:/_home/so-posts.xml\", 8) //i\'ve 8 cores p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21 scala> p.partitions.size res33: Int = 729 I was expecting 8 to be printed and I see 729 tasks in Spark UI EDIT: After calling repartition() as suggested by @zero323 scala> p1 = p.repartition(8) scala> p1.partitions.size res60: Int = 8 scala> p1.count I still see 729 tasks in the Spark UI even though the spark-shell prints 8. 回答1:

Spark groupByKey alternative

主宰稳场 提交于 2019-11-26 12:27:15
问题 According to Databricks best practices, Spark groupByKey should be avoided as Spark groupByKey processing works in a way that the information will be first shuffled across workers and then the processing will occur. Explanation So, my question is, what are the alternatives for groupByKey in a way that it will return the following in a distributed and fast way? // want this {\"key1\": \"1\", \"key1\": \"2\", \"key1\": \"3\", \"key2\": \"55\", \"key2\": \"66\"} // to become this {\"key1\": [\"1

spark

别来无恙 提交于 2019-11-26 11:35:01
论文:Spark: Cluster Computing with Working Sets 1.背景: 解决一些hadoop无法胜任的工作,倒不是hadoop不能做,就是hadoop做这些事情效果不好,时间长。因为hadoop 每个map reduce都会把中间内容数据写到disk里面,之后再从disk里面读,太耗时间了。特别是下面两种情况: Hadoop无法解决那些需要重复迭代使用数据的问题。 (1)Iterative jops 迭代:重复在一个数据集上进行操作,比如Kmeans算法,那些在数据集上进行迭代运算,求梯度的算法等等。Hadoop 每次迭代都分解为map reduce。每次都需要从磁盘度数据,性能严重影响。 (2)Itrative analytics: 那些比如要在一个数据集重复查询某些query进行分析。 说到底就是hadoop都是磁盘读数据,严重影响性能。要是能把数据放到内存里去就好了。 2.解决方案: Spark框架,保留了hadoop的优点,还通过把数据放到内存里面,解决了这些问题,spark还是建立在hadoop生态全上的,数据存储还是用hdfs。 具体如何实现的呢?因为引入了RDD这个东西。 何为RDD: RDD,RDD是spark的抽象数据结构类型,在spark里面,任何数据都是被表示为RDD,是一个只可读的分布式数据集合。可以认为就是个数组或者表

RDD、DataFrame、DataSet的区别

≯℡__Kan透↙ 提交于 2019-11-26 10:33:39
●结构图解 RDD[Person] 以Person为类型参数,但Spark框架本身不了解 Person类的内部结构。 DataFrame 提供了详细的结构信息schema,使得Spark SQL可以清楚地知道该数据集中包含哪些列,每列的名称和类型各是什么。这样看起来就像一张表了 DataSet[Person] 中不光有schema信息,还有类型信息 ●数据图解 1. 假设RDD中的两行数据长这样: RDD[Person] 2. 那么DataFrame中的数据长这样 DataFrame = DataSet[Row] = RDD[Person] - 泛型 + Schema + SQL操作 + 优化 3. 那么Dataset中的数据长这样(每行数据是个Object): Dataset[Person] = DataFrame + 泛型 或者长这样:Dataset[Row ] 总结 DataFrame = RDD - 泛型 + Schema + SQL + 优化 DataSet = DataFrame + 泛型 DataSet = RDD + Schema + SQL + 优化 DataFrame = DataSet[Row] 来源: https://blog.csdn.net/qq_38483094/article/details/98787864

Spark Application、Driver、Job、stage、task

南楼画角 提交于 2019-11-26 10:12:56
1、Application   application(应用)其实就是用spark-submit提交的程序。一个application通常包含三部分:从数据源(比方说HDFS)取数据形成RDD,通过RDD的transformation和action进行计算,将结果输出到console或者外部存储。 2、Driver   Spark中的driver感觉其实和yarn中Application Master的功能相类似。主要完成任务的调度以及和executor和cluster manager进行协调。有client和cluster联众模式。client模式driver在任务提交的机器上运行,而cluster模式会随机选择机器中的一台机器启动driver。通俗讲,driver可以理解为用户自己编写的程序。我们使用spark-submit提交一个Spark作业之后,这个作业就会启动一个对应的Driver进程.   Driver进程本身会根据我们设置的参数,占有一定数量的内存和CPU core。而Driver进程要做的第一件事情,就是向集群管理器(常用的如 yarn)申请运行Spark作业需要使用的资源,这里的资源指的就是Executor进程。YARN集群管理器会根据我们为Spark作业设置的资源参数,在各个工作节点上,启动一定数量的Executor进程

Parsing multiline records in Scala

笑着哭i 提交于 2019-11-26 10:01:53
问题 Here is my RDD[String] M1 module1 PIP a Z A PIP b Z B PIP c Y n4 M2 module2 PIP a I n4 PIP b O D PIP c O n5 and so on. Basically, I need a RDD of key (containing the second word on line1) and values of the subsequent PIP lines that can be iterated upon. I\'ve tried the following val usgPairRDD = usgRDD.map(x => (x.split(\"\\\\n\")(0), x)) but this gives me the following output (,) (M1 module1,M1 module1) (PIP a Z A,PIP a Z A) (PIP b Z B,PIP b Z B) (PIP c Y n4,PIP c Y n4) (,) (M2 module2,M2

A list as a key for PySpark&#39;s reduceByKey

天涯浪子 提交于 2019-11-26 09:59:18
问题 I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ... It seems pyspark will not accept an array as the key in normal key, value reduction by simply applying .reduceByKey(add). I have already tried first converting the array to a string, by .map((x,y): (str(x),y)) but this does not work because post processing of the strings back into arrays is too slow. Is there a way I can make pyspark use the array as a key or use

Spark union of multiple RDDs

你说的曾经没有我的故事 提交于 2019-11-26 09:08:39
问题 In my pig code I do this: all_combined = Union relation1, relation2, relation3, relation4, relation5, relation 6. I want to do the same with spark. However, unfortunately, I see that I have to keep doing it pairwise: first = rdd1.union(rdd2) second = first.union(rdd3) third = second.union(rdd4) # .... and so on Is there a union operator that will let me operate on multiple rdds at a time: e.g. union(rdd1, rdd2,rdd3, rdd4, rdd5, rdd6) It is a matter on convenience. 回答1: If these are RDDs you

How DAG works under the covers in RDD?

岁酱吖の 提交于 2019-11-26 08:43:23
问题 The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the internal mechanics on Resilient Distributed Datasets with Directed Acyclic Graph seems lacking in this paper. Should it be better learned by investigating the source code? 回答1: Even i have been looking in the web to learn about how spark computes the