rdd

Convert RDD into Dataframe in pyspark

好久不见. 提交于 2019-12-10 23:36:17
问题 I am trying to convert my RDD into Dataframe in pyspark. My RDD: [(['abc', '1,2'], 0), (['def', '4,6,7'], 1)] I want the RDD in the form of a Dataframe: Index Name Number 0 abc [1,2] 1 def [4,6,7] I tried: rd2=rd.map(lambda x,y: (y, x[0] , x[1]) ).toDF(["Index", "Name" , "Number"]) But I am getting errors An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 62.0 failed 1 times, most

How to filter RDDs based on a given partition?

99封情书 提交于 2019-12-10 22:13:56
问题 Consider the following example: JavaPairRDD<String, Row> R = input.textFile("test").mapToPair(new PairFunction<String, String, Row>() { public Tuple2<String, Row> call(String arg0) throws Exception { String[] parts = arg0.split(" "); Row r = RowFactory.create(parts[0],parts[1]); return new Tuple2<String, Row>(r.get(0).toString(), r); }}).partitionBy(new HashPartitioner(20)); The code above creates an RDD named R which is partitioned in 20 pieces by hashing on the first column of a txt file

Spark-Core RDD概述

拥有回忆 提交于 2019-12-10 21:37:27
一、什么是RDD   1、 RDD(Resilient Distributed DataSet)弹性分布式数据集   2、是Spark中最基本的数据抽象   3、在代码中是一个抽象类,它代表一个 弹性的、不可变的、可分区,里面的元素可并行计算的集合 二、RDD的5个主要属性(property)   1、A list of partitions   (1)多个分区,分区可以看成是数据集的基本组成单位   (2)对于RDD来说,每个分区都会被一个计算任务处理,并决定了并行计算的粒度   (3)用户可以在创建 RDD 时指定 RDD 的分区数, 如果没有指定, 那么就会采用默认值 . 默认值就是程序所分配到的 CPU Coure 的数目 .   (4)每个分配的存储是由BlockManager实现的。每个分区都会被逻辑映射成BlockManager的一个Block,而这个Block会被一个Task负责计算   2、A function for computing each split   (1)计算每个切片(分区)的函数   (2)Spark 中 RDD 的计算是以分片为单位的, 每个 RDD 都会实现 compute 函数以达到这个目的.   3、A list of dependencies on other RDDs (1)与其他 RDD 之间的依赖关系 (2)RDD

Function input() in pyspark

我们两清 提交于 2019-12-10 18:29:36
问题 My problem here is when I enter the value of p, Nothing happens, It does not pursue execution: is there a way to fix it please? import sys from pyspark import SparkContext sc = SparkContext("local", "simple App") p =input("Enter the word") rdd1 = sc.textFile("monfichier") rdd2= rdd1.map(lambda l : l.split("\t")) rdd3=rdd2.map(lambda l: l[1]) print rdd3.take(6) rdd5=rdd3.filter(lambda l : p in l) sc.stop() 回答1: You have to distinguish between to different cases: Script submitted with $SPARK

Spark RDD find by key

浪子不回头ぞ 提交于 2019-12-10 18:23:36
问题 I have an RDD transformed from HBase: val hbaseRDD: RDD[(String, Array[String])] where the tuple._1 is the rowkey. and the array are the values in HBase. 4929101-ACTIVE, ["4929101","2015-05-20 10:02:44","dummy1","dummy2"] 4929102-ACTIVE, ["4929102","2015-05-20 10:02:44","dummy1","dummy2"] 4929103-ACTIVE, ["4929103","2015-05-20 10:02:44","dummy1","dummy2"] I also have a SchemaRDD (id,date1,col1,col2,col3) transformed to val refDataRDD: RDD[(String, Array[String])] for which I will iterate over

Spark——RDD是什么?

冷暖自知 提交于 2019-12-10 17:49:36
Spark 中最基本的数据抽象是 RDD。 RDD:弹性分布式数据集 (Resilient Distributed DataSet)。 1,RDD 有三个基本特性 这三个特性分别为:分区,不可变,并行操作。 a, 分区 每一个 RDD 包含的数据被存储在系统的不同节点上。逻辑上我们可以将 RDD 理解成一个大的数组,数组中的每个元素就代表一个分区 (Partition) 。 在物理存储中,每个分区指向一个存储在内存或者硬盘中的数据块 (Block) ,其实这个数据块就是每个 task 计算出的数据块,它们可以分布在不同的节点上。 所以,RDD 只是抽象意义的数据集合,分区内部并不会存储具体的数据,只会存储它在该 RDD 中的 index,通过该 RDD 的 ID 和分区的 index 可以唯一确定对应数据块的编号,然后通过底层存储层的接口提取到数据进行处理。 在集群中,各个节点上的数据块会尽可能的存储在内存中,只有当内存没有空间时才会放入硬盘存储,这样可以最大化的减少硬盘 IO 的开销。 b,不可变 不可变性是指每个 RDD 都是只读的,它所包含的分区信息是不可变的。由于已有的 RDD 是不可变的,所以我们只有对现有的 RDD 进行转化 (Transformation) 操作,才能得到新的 RDD ,一步一步的计算出我们想要的结果。 这样会带来这样的好处:我们在 RDD

Different floating point precision from RDD and DataFrame

最后都变了- 提交于 2019-12-10 17:28:02
问题 I changed an RDD to DataFrame and compared the results with another DataFrame which I imported using read.csv but the floating point precision is not the same from the two approaches. I appreciate your help. The data I am using is from here. from pyspark.sql import Row from pyspark.sql.types import * RDD way orders = sc.textFile("retail_db/orders") order_items = sc.textFile('retail_db/order_items') orders_comp = orders.filter(lambda line: ((line.split(',')[-1] == 'CLOSED') or (line.split(',')

Does cache() in spark change the state of the RDD or create a new one?

江枫思渺然 提交于 2019-12-10 17:02:04
问题 This question is a follow up to a previous question I had What happens if I cache the same RDD twice in Spark. When calling cache() on a RDD, does the state of the RDD changed (and the returned RDD is just this for ease of use) or a new RDD is created the wrapped the existing one? What will happen in the following code: // Init JavaRDD<String> a = ... // some initialise and calculation functions. JavaRDD<String> b = a.cache(); JavaRDD<String> c = b.cache(); // Case 1, will 'a' be calculated

finding min/max with pyspark in single pass over data

此生再无相见时 提交于 2019-12-10 16:44:21
问题 I have an RDD with a huge list of numbers (length of lines from file), I want to know how to get the min/max in single pass over the data. I know that about Min and Max functions but that would require two passes. 回答1: Try this: >>> from pyspark.statcounter import StatCounter >>> >>> rdd = sc.parallelize([9, -1, 0, 99, 0, -10]) >>> stats = rdd.aggregate(StatCounter(), StatCounter.merge, StatCounter.mergeStats) >>> stats.minValue, stats.maxValue (-10.0, 99.0) 回答2: Here's a working yet

When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?

让人想犯罪 __ 提交于 2019-12-10 16:36:27
问题 I want to do a join operation between two very big key-value pair RDDs. The keys of these two RDD comes from the same set. To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. Hopefully this could reduce some shuffle time. I want to know is spark smart enough to do that for me or I have to implement this logic myself? I know when I join two RDD, one preprocess with partitionBy. Spark is smart enough to