rdd

efficiently using union in spark

隐身守侯 提交于 2019-12-25 15:16:01
问题 I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage. Is there more efficient way to do this? 回答1: Why don't you convert the two RDDs to dataframes and use union function. Converting to dataframe is easy you just need to import sqlContext

Spark 1.1.1 Programing Guide

梦想与她 提交于 2019-12-25 14:21:08
回到目录 Spark Programming Guide Overview Linking with Spark Initializing Spark Using the Shell Resilient Distributed Datasets (RDDs) Parallelized Collections External Datasets RDD Operations Basics Passing Functions to Spark Working with Key-Value Pairs Transformations Actions RDD Persistence Which Storage Level to Choose? Removing Data Shared Variables Broadcast Variables Accumulators Deploying to a Cluster Unit Testing Migrating from pre-1.0 Versions of Spark Where to Go from Here Overview At a high level, every Spark application consists of a driver program that runs the user’s main function

Scala: How so i split dataframe to multiple csv files based on number of rows

风格不统一 提交于 2019-12-25 08:40:04
问题 I have a dataframe say df1 with 10M rows. I want to split the same to multiple csv files with 1M rows each. Any suggestions to do the same in scala? 回答1: You can use the randomSplit method on Dataframes. import scala.util.Random val df = List(0,1,2,3,4,5,6,7,8,9).toDF val splitted = df.randomSplit(Array(1,1,1,1,1)) splitted foreach { a => a.write.format("csv").save("path" + Random.nextInt) } I used the Random.nextInt to have a unique name. You can add some other logic there if necessary.

Apache Spark's RDD[Vector] Immutability issue

别来无恙 提交于 2019-12-25 06:43:56
问题 I know the RDDs are immutable and therefore their value cannot be changed but I see the following behaviour: I wrote an implementation for FuzzyCMeans (https://github.com/salexln/FinalProject_FCM) algorithm and now I'm testing it, so I run the following example: import org.apache.spark.mllib.clustering.FuzzyCMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile("/home/development/myPrjects/R/butterfly/butterfly.txt") val parsedData = data.map(s => Vectors.dense(s.split(' '

Spark: Work around nested RDD

被刻印的时光 ゝ 提交于 2019-12-25 06:31:04
问题 There are two tables. First table has records with two fields book1 and book2 . These are id's of books that usualy are read together, in pairs. Second table has columns books and readers of these books, where books and readers are book and reader IDs, respectively. For every reader in the second table I need to find corresponding books in the pairs table. For example if reader read books 1,2,3 and we have pairs (1,7), (6,2), (4,10) the resulting list for this reader should have books 7,6. I

perform RDD operations on DataFrames

断了今生、忘了曾经 提交于 2019-12-25 04:23:09
问题 I have a dataset of 10 fields. I need to perform RDD operations on these DataFrame. Is it possible to perform RDD operations like map , flatMap , etc.. here is my sample code: df.select("COUNTY","VEHICLES").show(); this is my dataframe and i need to convert this dataframe to RDD and operate some RDD operations on this new RDD. Here is code how i am converted dataframe to RDD RDD<Row> java = df.select("COUNTY","VEHICLES").rdd(); after converting to RDD, i am not able to see the RDD results, i

Parsing Data in Apache Spark Scala org.apache.spark.SparkException: Task not serializable error when trying to use textinputformat.record.delimiter

只谈情不闲聊 提交于 2019-12-25 03:28:08
问题 Input file: ___DATE___ 2018-11-16T06:3937 Linux hortonworks 3.10.0-514.26.2.el7.x86_64 #1 SMP Fri Jun 30 05:26:04 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 06:39:37 up 100 days, 1:04, 2 users, load average: 9.01, 8.30, 8.48 06:30:01 AM all 6.08 0.00 2.83 0.04 0.00 91.06 ___DATE___ 2018-11-16T06:4037 Linux cloudera 3.10.0-514.26.2.el7.x86_64 #1 SMP Fri Jun 30 05:26:04 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 06:40:37 up 100 days, 1:05, 28 users, load average: 8.39, 8.26, 8.45 06:40:01 AM all 6.92

RDD Collect Issue

▼魔方 西西 提交于 2019-12-25 02:19:13
问题 I configured a new system, spark 2.3.0, python 3.6.0, dataframe read and other operations working as expected. But, RDD collect is failing - distFile = spark.sparkContext.textFile("/Users/aakash/Documents/Final_HOME_ORIGINAL/Downloads/PreloadedDataset/breast-cancer-wisconsin.csv") distFile.collect() Error: py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. Traceback: Traceback (most recent call last): File "/Users/aakash

Homemade DataFrame aggregation/dropDuplicates Spark

时光总嘲笑我的痴心妄想 提交于 2019-12-25 01:46:19
问题 I want to perform a transformation on my DataFrame df so that I only have each key once and only once in the final DataFrame. For machine learning purposes, I don't want to have a bias in my dataset. This should never occur, but the data I get from my data source contains this "weirdness". So if I have lines with the same keys, I want to be able to chose either a combination of the two (like mean value) or a string concatenation (labels for example) or a random values set. Say my DataFrame df

Convert an RDD to a DataFrame in Spark using Scala

Deadly 提交于 2019-12-24 16:19:13
问题 I have textRDD: org.apache.spark.rdd.RDD[(String, String)] I would like to convert it to a DataFrame. The columns correspond to the title and content of each page(row). 回答1: Use toDF() , provide the column names if you have them. val textDF = textRDD.toDF("title": String, "content": String) textDF: org.apache.spark.sql.DataFrame = [title: string, content: string] or val textDF = textRDD.toDF() textDF: org.apache.spark.sql.DataFrame = [_1: string, _2: string] The shell auto-imports (I am using