rdd | 易学教程

How to share Spark RDD between 2 Spark contexts?

阅读更多关于 How to share Spark RDD between 2 Spark contexts?

I have an RMI cluster. Each RMI server has a Spark context. Is there any way to share an RDD between different Spark contexts? As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it ( SparkContext in case of RDD, SQLContext in case of DataFrame dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver , Livy , or Apache Zeppelin ). Since RDD or DataFrame is just a small local object there is really not much to share. Sharing data is

How does Sparks RDD.randomSplit actually split the RDD

阅读更多关于 How does Sparks RDD.randomSplit actually split the RDD

So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions. When calling RDD.randomSplit(0.8,0.2) Does the function also shuffle the rdd? Our does the splitting simply sample 20% continuously of the rdd? Or does it select 20% of the partitions randomly? Ideally does the resulting split have the same class distribution as the original RDD. (i.e. 2:1) Thanks For each range defined by weights array there is a separate mapPartitionsWithIndex transformation which preserves partitioning. Each

Apache Spark's RDD splitting according to the particular size

阅读更多关于 Apache Spark's RDD splitting according to the particular size

问题 I am trying to read strings from a text file, but I want to limit each line according to a particular size. For example; Here is my representing the file. aaaaa\nbbb\nccccc When trying to read this file by sc.textFile, RDD would appear this one. scala> val rdd = sc.textFile("textFile") scala> rdd.collect res1: Array[String] = Array(aaaaa, bbb, ccccc) But I want to limit the size of this RDD. For example, if the limit is 3, then I should get like this one. Array[String] = Array(aaa, aab, bbc,

coalesce 与 repartition的区别

阅读更多关于 coalesce 与 repartition的区别

目录一.spark 分区 partition的理解二.coalesce 与 repartition的区别（我们下面说的coalesce都默认shuffle参数为false的情况）三.实例 1. coalesce 2. repartition 四.总结一.spark 分区 partition的理解 spark中是以vcore级别调度task的。如果读取的是hdfs，那么有多少个block，就有多少个partition 举例来说：sparksql 要读表T, 如果表T有1w个小文件，那么就有1w个partition 这时候读取效率会较低。假设设置资源为 --executor-memory 2g --executor-cores 2 --num-executors 5。步骤是拿出1-10号10个小文件（也就是10个partition）分别给5个executor读取（spark调度会以vcore为单位，实际就是5个executor，10个task读10个partition）如果5个executor执行速度相同，再拿11-20号文件依次给这5个executor读取而实际执行速度不会完全相同，那就是哪个task先执行完，哪个task领取下一个partition读取执行，以此类推。这样往往读取文件的调度时间大于读取文件本身，而且会频繁打开关闭文件句柄，浪费较为宝贵的io资源

How do you perform basic joins of two RDD tables in Spark using Python?

阅读更多关于 How do you perform basic joins of two RDD tables in Spark using Python?

How would you perform basic joins in Spark using python? In R you could use merg() to do this. What is the syntax using python on spark for: Inner Join Left Outer Join Cross Join With two tables (RDD) with a single column in each that has a common key. RDD(1):(key,U) RDD(2):(key,V) I think an inner join is something like this: rdd1.join(rdd2).map(case (key, u, v) => (key, ls ++ rs)); Is that right? I have searched the internet and can't find a good example of joins. Thanks in advance. It can be done either using PairRDDFunctions or Spark Data Frames. Since data frame operations benefit from

Split Time Series pySpark data frame into test & train without using random split

阅读更多关于 Split Time Series pySpark data frame into test & train without using random split

问题 I have a spark Time Series data frame. I would like to split it into 80-20 (train-test). As this is a time series data frame , I don't want to do a random split. How do I do this in order to pass the first data frame into train and the second to test? 回答1: You can use pyspark.sql.functions.percent_rank() to get the percentile ranking of your DataFrame ordered by the timestamp/date column. Then pick all the columns with a rank <= 0.8 as your training set and the rest as your test set. For

How to read PDF files and xml files in Apache Spark scala?

阅读更多关于 How to read PDF files and xml files in Apache Spark scala?

问题 My sample code for reading text file is val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions) var rddwithPath = text.asInstanceOf[HadoopRDD[LongWritable, Text]].mapPartitionsWithInputSplit { (inputSplit, iterator) ⇒ val file = inputSplit.asInstanceOf[FileSplit] iterator.map { tpl ⇒ (file.getPath.toString, tpl._2.toString) } }.reduceByKey((a,b) => a) In this way how can I use PDF and Xml files 回答1: PDF & XML can be parsed using

RDD to LabeledPoint conversion

阅读更多关于 RDD to LabeledPoint conversion

问题 If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such: val data = rdd.map(col => new LabeledPoint( col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray)) Any suggestions or

Spark broadcast error: exceeds spark.akka.frameSize Consider using broadcast

阅读更多关于 Spark broadcast error: exceeds spark.akka.frameSize Consider using broadcast

问题 I have a large data called "edges" org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[(String, Int)]] = MappedRDD[27] at map at <console>:52 When I was working in standalone mode, I was able to collect, count and save this file. Now, on a cluster, I'm getting this error edges.count ... Serialized task 28:0 was 12519797 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast variables for large values. Same with .saveAsTextFile("edges") This is from the spark

大数据实战十七课（上）- Spark-Core05

阅读更多关于大数据实战十七课（上）- Spark-Core05

一、上次课回顾二、Map和MapPartition 2.1 foreachPartition 三、sc.textFile源码剖析 3.1 了解spark-shell启动流程一、上次课回顾大数据实战十六课（下）- Spark-Core04 https://blog.csdn.net/zhikanjiani/article/details/99731015 二、MapPartition 在高阶函数中，做一个map就是对函数中每一个元素做一个映射。y = f(x) 1、Map在RDD.scala中的定义： Return a new RDD by applying a function to all elements of this RDD. //返回的是一个RDD，RDD中的每一个元素都作用上一个函数 def map[U: ClassTag](f: T => U): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) } 2、MapPartition在RDD.scala中的定义： Return a new RDD by applying a function to each partition