rdd | 易学教程

Convert RDD into Dataframe in pyspark

阅读更多关于 Convert RDD into Dataframe in pyspark

问题 I am trying to convert my RDD into Dataframe in pyspark. My RDD: [(['abc', '1,2'], 0), (['def', '4,6,7'], 1)] I want the RDD in the form of a Dataframe: Index Name Number 0 abc [1,2] 1 def [4,6,7] I tried: rd2=rd.map(lambda x,y: (y, x[0] , x[1]) ).toDF(["Index", "Name" , "Number"]) But I am getting errors An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 62.0 failed 1 times, most

How to filter RDDs based on a given partition?

阅读更多关于 How to filter RDDs based on a given partition?

问题 Consider the following example: JavaPairRDD<String, Row> R = input.textFile("test").mapToPair(new PairFunction<String, String, Row>() { public Tuple2<String, Row> call(String arg0) throws Exception { String[] parts = arg0.split(" "); Row r = RowFactory.create(parts[0],parts[1]); return new Tuple2<String, Row>(r.get(0).toString(), r); }}).partitionBy(new HashPartitioner(20)); The code above creates an RDD named R which is partitioned in 20 pieces by hashing on the first column of a txt file

Spark-Core RDD概述

阅读更多关于 Spark-Core RDD概述

一、什么是RDD 　　1、 RDD（Resilient Distributed DataSet）弹性分布式数据集　　2、是Spark中最基本的数据抽象　　3、在代码中是一个抽象类，它代表一个弹性的、不可变的、可分区，里面的元素可并行计算的集合二、RDD的5个主要属性（property）　　1、A list of partitions 　　（1）多个分区，分区可以看成是数据集的基本组成单位　　（2）对于RDD来说，每个分区都会被一个计算任务处理，并决定了并行计算的粒度　　（3）用户可以在创建 RDD 时指定 RDD 的分区数, 如果没有指定, 那么就会采用默认值 . 默认值就是程序所分配到的 CPU Coure 的数目 . 　　（4）每个分配的存储是由BlockManager实现的。每个分区都会被逻辑映射成BlockManager的一个Block，而这个Block会被一个Task负责计算　　2、A function for computing each split 　　（1）计算每个切片（分区）的函数　　（2）Spark 中 RDD 的计算是以分片为单位的, 每个 RDD 都会实现 compute 函数以达到这个目的. 　　3、A list of dependencies on other RDDs （1）与其他 RDD 之间的依赖关系（2）RDD

Function input() in pyspark

阅读更多关于 Function input() in pyspark

问题 My problem here is when I enter the value of p, Nothing happens, It does not pursue execution: is there a way to fix it please? import sys from pyspark import SparkContext sc = SparkContext("local", "simple App") p =input("Enter the word") rdd1 = sc.textFile("monfichier") rdd2= rdd1.map(lambda l : l.split("\t")) rdd3=rdd2.map(lambda l: l[1]) print rdd3.take(6) rdd5=rdd3.filter(lambda l : p in l) sc.stop() 回答1: You have to distinguish between to different cases: Script submitted with $SPARK

Spark RDD find by key

阅读更多关于 Spark RDD find by key

问题 I have an RDD transformed from HBase: val hbaseRDD: RDD[(String, Array[String])] where the tuple._1 is the rowkey. and the array are the values in HBase. 4929101-ACTIVE, ["4929101","2015-05-20 10:02:44","dummy1","dummy2"] 4929102-ACTIVE, ["4929102","2015-05-20 10:02:44","dummy1","dummy2"] 4929103-ACTIVE, ["4929103","2015-05-20 10:02:44","dummy1","dummy2"] I also have a SchemaRDD (id,date1,col1,col2,col3) transformed to val refDataRDD: RDD[(String, Array[String])] for which I will iterate over

Spark——RDD是什么？

阅读更多关于 Spark——RDD是什么？

Spark 中最基本的数据抽象是 RDD。 RDD：弹性分布式数据集 (Resilient Distributed DataSet)。 1，RDD 有三个基本特性这三个特性分别为：分区，不可变，并行操作。 a，分区每一个 RDD 包含的数据被存储在系统的不同节点上。逻辑上我们可以将 RDD 理解成一个大的数组，数组中的每个元素就代表一个分区 (Partition) 。在物理存储中，每个分区指向一个存储在内存或者硬盘中的数据块 (Block) ，其实这个数据块就是每个 task 计算出的数据块，它们可以分布在不同的节点上。所以，RDD 只是抽象意义的数据集合，分区内部并不会存储具体的数据，只会存储它在该 RDD 中的 index，通过该 RDD 的 ID 和分区的 index 可以唯一确定对应数据块的编号，然后通过底层存储层的接口提取到数据进行处理。在集群中，各个节点上的数据块会尽可能的存储在内存中，只有当内存没有空间时才会放入硬盘存储，这样可以最大化的减少硬盘 IO 的开销。 b，不可变不可变性是指每个 RDD 都是只读的，它所包含的分区信息是不可变的。由于已有的 RDD 是不可变的，所以我们只有对现有的 RDD 进行转化 (Transformation) 操作，才能得到新的 RDD ，一步一步的计算出我们想要的结果。这样会带来这样的好处：我们在 RDD

Different floating point precision from RDD and DataFrame

阅读更多关于 Different floating point precision from RDD and DataFrame

问题 I changed an RDD to DataFrame and compared the results with another DataFrame which I imported using read.csv but the floating point precision is not the same from the two approaches. I appreciate your help. The data I am using is from here. from pyspark.sql import Row from pyspark.sql.types import * RDD way orders = sc.textFile("retail_db/orders") order_items = sc.textFile('retail_db/order_items') orders_comp = orders.filter(lambda line: ((line.split(',')[-1] == 'CLOSED') or (line.split(',')

Does cache() in spark change the state of the RDD or create a new one?

阅读更多关于 Does cache() in spark change the state of the RDD or create a new one?

问题 This question is a follow up to a previous question I had What happens if I cache the same RDD twice in Spark. When calling cache() on a RDD, does the state of the RDD changed (and the returned RDD is just this for ease of use) or a new RDD is created the wrapped the existing one? What will happen in the following code: // Init JavaRDD<String> a = ... // some initialise and calculation functions. JavaRDD<String> b = a.cache(); JavaRDD<String> c = b.cache(); // Case 1, will 'a' be calculated

finding min/max with pyspark in single pass over data

阅读更多关于 finding min/max with pyspark in single pass over data

问题 I have an RDD with a huge list of numbers (length of lines from file), I want to know how to get the min/max in single pass over the data. I know that about Min and Max functions but that would require two passes. 回答1: Try this: >>> from pyspark.statcounter import StatCounter >>> >>> rdd = sc.parallelize([9, -1, 0, 99, 0, -10]) >>> stats = rdd.aggregate(StatCounter(), StatCounter.merge, StatCounter.mergeStats) >>> stats.minValue, stats.maxValue (-10.0, 99.0) 回答2: Here's a working yet

When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?

阅读更多关于 When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?

问题 I want to do a join operation between two very big key-value pair RDDs. The keys of these two RDD comes from the same set. To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. Hopefully this could reduce some shuffle time. I want to know is spark smart enough to do that for me or I have to implement this logic myself? I know when I join two RDD, one preprocess with partitionBy. Spark is smart enough to