rdd

How to convert Spark RDD to pandas dataframe in ipython?

爷,独闯天下 提交于 2019-11-30 10:50:25
I have a RDD and I want to convert it to pandas dataframe . I know that to convert and RDD to a normal dataframe we can do df = rdd1.toDF() But I want to convert the RDD to pandas dataframe and not a normal dataframe . How can I do it? You can use function toPandas() : Returns the contents of this DataFrame as Pandas pandas.DataFrame. This is only available if Pandas is installed and available. >>> df.toPandas() age name 0 2 Alice 1 5 Bob RKD314 You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame. For example, let's say I have a text

Not able to declare String type accumulator

痴心易碎 提交于 2019-11-30 08:40:34
问题 I am trying to define an accumulator variable of type String in Scala shell (driver) but I keep getting the following error:- scala> val myacc = sc.accumulator("Test") <console>:21: error: could not find implicit value for parameter param: org.apache.spark.AccumulatorParam[String] val myacc = sc.accumulator("Test") ^ This seems to be no issue for Int or Double type of accumulator. Thanks 回答1: That's because Spark by default provides only accumulators of type Long , Double and Float . If you

Get the max value for each key in a Spark RDD

不羁的心 提交于 2019-11-30 07:23:07
What is the best way to return the max row (value) associated with each unique key in a spark RDD? I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF? I have in RDD format: [(v, 3), (v, 1), (v, 1), (w, 7), (w, 1), (x, 3), (y, 1), (y, 1), (y, 2), (y, 3)] And I need to return: [(v, 3), (w, 7), (x, 3), (y, 3)] Ties can return the first value or random. Actually you have a PairRDD. One of the best ways to do it is with reduceByKey: (Scala) val grouped = rdd.reduceByKey(math.max(_, _)) (Python) grouped = rdd

Spark工作原理

岁酱吖の 提交于 2019-11-30 07:03:16
Apache Spark是一个围绕速度、易用性和复杂分析构建的大数据处理框架,最初在2009年由加州大学伯克利分校的AMPLab开发,并于2010年成为Apache的开源项目之一,与Hadoop和Storm等其他大数据和MapReduce技术相比,Spark有如下优势: 1.运行速度快,Spark拥有DAG执行引擎,支持在内存中对数据进行迭代计算。官方提供的数据表明,如果数据由磁盘读取,速度是Hadoop MapReduce的10倍以上,如果数据从内存中读取,速度可以高达100多倍。 2.适用场景广泛,大数据分析统计,实时数据处理,图计算及机器学习 3.易用性,编写简单,支持80种以上的高级算子,支持多种语言,数据源丰富,可部署在多种集群中 4.容错性高。Spark引进了弹性分布式数据集RDD (Resilient Distributed Dataset) 的抽象,它是分布在一组节点中的只读对象集合,这些集合是弹性的,如果数据集一部分丢失,则可以根据“血统”(即充许基于数据衍生过程)对它们进行重建。另外在RDD计算时可以通过CheckPoint来实现容错,而CheckPoint有两种方式:CheckPoint Data,和Logging The Updates,用户可以控制采用哪种方式来实现容错。 Spark的适用场景 目前大数据处理场景有以下几个类型: 1.复杂的批量处理

How Can I Obtain an Element Position in Spark's RDD?

我只是一个虾纸丫 提交于 2019-11-30 06:53:51
I am new to Apache Spark, and I know that the core data structure is RDD. Now I am writing some apps which require element positional information. For example, after converting an ArrayList into a (Java)RDD, for each integer in RDD, I need to know its (global) array subscript. Is it possible to do it? As I know, there is a take(int) function for RDD, so I believe the positional information is still maintained in RDD. Essentially, RDD's zipWithIndex() method seems to do this, but it won't preserve the original ordering of the data the RDD was created from. At least you'll get a stable ordering.

How to convert an RDD[Row] back to DataFrame [duplicate]

╄→尐↘猪︶ㄣ 提交于 2019-11-30 06:49:26
问题 This question already has answers here : How to convert rdd object to dataframe in spark (10 answers) Closed 3 years ago . I've been playing around with converting RDDs to DataFrames and back again. First, I had an RDD of type (Int, Int) called dataPair. Then I created a DataFrame object with column headers using: val dataFrame = dataPair.toDF(header(0), header(1)) Then I converted it from a DataFrame back to an RDD using: val testRDD = dataFrame.rdd which returns an RDD of type org.apache

Spark简介

淺唱寂寞╮ 提交于 2019-11-30 06:16:10
Spark简介 1、spark是什么 一个快速、通用的集群计算平台。 2、spark特点 快速: 1、spark首先是基于mapreduce来优化的一个集群计算平台,他扩充了mapreduce的计算模型。 2、spark是基于内存计算的,那么基于内存的意思是什么呢?像我们平时计算数据很少会直接得到结果,都要经过几次的运算才可以得到一个准确的,精准的数值,那么每次计算都会产生一个中间的计算结果,我们还需要使用这个计算结果来继续运算,所以他会被暂时的保存。 但是保存地点无非就是存于硬盘或内存,存在硬盘中难免的就会进行多次额外的写入写出操作,而spark就是采用将中间结果存储在内存中来进行,这样会大大的减少运行的时间,增加计算效率,像处理相同的数据mapreduce的时间可能在几分钟到几小时,而spark会在几秒到几分钟之内完成。 通用: spark的设计容纳了许多其他分布式处理系统的功能,比如批处理(hadoop)、交互查询(hive)、流处理(storm)等等,这样整合后会大大降低集群的维护成本。 开放: 在上次介绍scala的时候我们就讲过,scala的大火跟spark的兴起是息息相关的,因为scala是spark的底层代码,但是spark并不是只可以使用scala,它还提供了java、python等语言的API以及丰富的内置库。而且spark和hadoop

How to calculate the best numberOfPartitions for coalesce?

ⅰ亾dé卋堺 提交于 2019-11-30 05:51:23
问题 So, I understand that in general one should use coalesce() when: the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset. I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes (

how to interpret RDD.treeAggregate

大憨熊 提交于 2019-11-30 05:08:31
I ran into this line in the Apache Spark code source val (gradientSum, lossSum, miniBatchSize) = data .sample(false, miniBatchFraction, 42 + i) .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))( seqOp = (c, v) => { // c: (grad, loss, count), v: (label, features) val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1)) (c._1, c._2 + l, c._3 + 1) }, combOp = (c1, c2) => { // c: (grad, loss, count) (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3) } ) I have multiple trouble reading this : First I can't find anything on the web that explains exactly how treeAggregate works, what

What happens if I cache the same RDD twice in Spark

我的梦境 提交于 2019-11-30 04:39:02
问题 I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example: public JavaRDD<String> foo(JavaRDD<String> r) { r.cache(); JavaRDD t1 = r... //Some calculations JavaRDD t2 = r... //Other calculations return t1.union(t2); } My question is, since r is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of