rdd | 易学教程

efficiently using union in spark

阅读更多关于 efficiently using union in spark

问题 I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage. Is there more efficient way to do this? 回答1: Why don't you convert the two RDDs to dataframes and use union function. Converting to dataframe is easy you just need to import sqlContext

Spark 1.1.1 Programing Guide

阅读更多关于 Spark 1.1.1 Programing Guide

回到目录 Spark Programming Guide Overview Linking with Spark Initializing Spark Using the Shell Resilient Distributed Datasets (RDDs) Parallelized Collections External Datasets RDD Operations Basics Passing Functions to Spark Working with Key-Value Pairs Transformations Actions RDD Persistence Which Storage Level to Choose? Removing Data Shared Variables Broadcast Variables Accumulators Deploying to a Cluster Unit Testing Migrating from pre-1.0 Versions of Spark Where to Go from Here Overview At a high level, every Spark application consists of a driver program that runs the user’s main function

Scala: How so i split dataframe to multiple csv files based on number of rows

阅读更多关于 Scala: How so i split dataframe to multiple csv files based on number of rows

问题 I have a dataframe say df1 with 10M rows. I want to split the same to multiple csv files with 1M rows each. Any suggestions to do the same in scala? 回答1: You can use the randomSplit method on Dataframes. import scala.util.Random val df = List(0,1,2,3,4,5,6,7,8,9).toDF val splitted = df.randomSplit(Array(1,1,1,1,1)) splitted foreach { a => a.write.format("csv").save("path" + Random.nextInt) } I used the Random.nextInt to have a unique name. You can add some other logic there if necessary.

Apache Spark's RDD[Vector] Immutability issue

阅读更多关于 Apache Spark's RDD[Vector] Immutability issue

问题 I know the RDDs are immutable and therefore their value cannot be changed but I see the following behaviour: I wrote an implementation for FuzzyCMeans (https://github.com/salexln/FinalProject_FCM) algorithm and now I'm testing it, so I run the following example: import org.apache.spark.mllib.clustering.FuzzyCMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile("/home/development/myPrjects/R/butterfly/butterfly.txt") val parsedData = data.map(s => Vectors.dense(s.split(' '

Spark: Work around nested RDD

阅读更多关于 Spark: Work around nested RDD

问题 There are two tables. First table has records with two fields book1 and book2 . These are id's of books that usualy are read together, in pairs. Second table has columns books and readers of these books, where books and readers are book and reader IDs, respectively. For every reader in the second table I need to find corresponding books in the pairs table. For example if reader read books 1,2,3 and we have pairs (1,7), (6,2), (4,10) the resulting list for this reader should have books 7,6. I

perform RDD operations on DataFrames

阅读更多关于 perform RDD operations on DataFrames

问题 I have a dataset of 10 fields. I need to perform RDD operations on these DataFrame. Is it possible to perform RDD operations like map , flatMap , etc.. here is my sample code: df.select("COUNTY","VEHICLES").show(); this is my dataframe and i need to convert this dataframe to RDD and operate some RDD operations on this new RDD. Here is code how i am converted dataframe to RDD RDD<Row> java = df.select("COUNTY","VEHICLES").rdd(); after converting to RDD, i am not able to see the RDD results, i

Parsing Data in Apache Spark Scala org.apache.spark.SparkException: Task not serializable error when trying to use textinputformat.record.delimiter

阅读更多关于 Parsing Data in Apache Spark Scala org.apache.spark.SparkException: Task not serializable error when trying to use textinputformat.record.delimiter

问题 Input file: ___DATE___ 2018-11-16T06:3937 Linux hortonworks 3.10.0-514.26.2.el7.x86_64 #1 SMP Fri Jun 30 05:26:04 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 06:39:37 up 100 days, 1:04, 2 users, load average: 9.01, 8.30, 8.48 06:30:01 AM all 6.08 0.00 2.83 0.04 0.00 91.06 ___DATE___ 2018-11-16T06:4037 Linux cloudera 3.10.0-514.26.2.el7.x86_64 #1 SMP Fri Jun 30 05:26:04 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 06:40:37 up 100 days, 1:05, 28 users, load average: 8.39, 8.26, 8.45 06:40:01 AM all 6.92

RDD Collect Issue

阅读更多关于 RDD Collect Issue

问题 I configured a new system, spark 2.3.0, python 3.6.0, dataframe read and other operations working as expected. But, RDD collect is failing - distFile = spark.sparkContext.textFile("/Users/aakash/Documents/Final_HOME_ORIGINAL/Downloads/PreloadedDataset/breast-cancer-wisconsin.csv") distFile.collect() Error: py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. Traceback: Traceback (most recent call last): File "/Users/aakash

Homemade DataFrame aggregation/dropDuplicates Spark

阅读更多关于 Homemade DataFrame aggregation/dropDuplicates Spark

问题 I want to perform a transformation on my DataFrame df so that I only have each key once and only once in the final DataFrame. For machine learning purposes, I don't want to have a bias in my dataset. This should never occur, but the data I get from my data source contains this "weirdness". So if I have lines with the same keys, I want to be able to chose either a combination of the two (like mean value) or a string concatenation (labels for example) or a random values set. Say my DataFrame df

Convert an RDD to a DataFrame in Spark using Scala

阅读更多关于 Convert an RDD to a DataFrame in Spark using Scala

问题 I have textRDD: org.apache.spark.rdd.RDD[(String, String)] I would like to convert it to a DataFrame. The columns correspond to the title and content of each page(row). 回答1: Use toDF() , provide the column names if you have them. val textDF = textRDD.toDF("title": String, "content": String) textDF: org.apache.spark.sql.DataFrame = [title: string, content: string] or val textDF = textRDD.toDF() textDF: org.apache.spark.sql.DataFrame = [_1: string, _2: string] The shell auto-imports (I am using