spark-dataframe | 易学教程

Spark filter DataFrames based on common values

阅读更多关于 Spark filter DataFrames based on common values

问题 I have DF1 and DF2. First one has a column "new_id", the second has a column "db_id" I need to FILTER OUT all the rows from the first DataFrame, where the value of new_id is not in db_id. val new_id = Seq(1, 2, 3, 4) val db_id = Seq(1, 4, 5, 6, 10) Then I need the rows with new_id == 1 and 4 to stay in df1 and delete the rows with news_id = 2 and 3 since 2 and 3 are not in db_id There is a ton of questions on DataFrames here. I might have missed this one. Sorry if that is a duplicate. p.s I

Pyspark - create new column from operations of DataFrame columns gives error “Column is not iterable”

阅读更多关于 Pyspark - create new column from operations of DataFrame columns gives error “Column is not iterable”

问题 I have a PySpark DataFrame and I have tried many examples showing how to create a new column based on operations with existing columns, but none of them seem to work. So I have t̶w̶o̶ one questions: 1- Why doesn't this code work? from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext import pyspark.sql.functions as F sc = SparkContext() sqlContext = SQLContext(sc) a = sqlContext.createDataFrame([(5, 5, 3)], ['A', 'B', 'C']) a.withColumn('my_sum', F.sum(a[col] for col

how to create EdgeRDD from data frame in Spark

阅读更多关于 how to create EdgeRDD from data frame in Spark

问题 I have a dataframe in spark. Each row represents a person and I want to retrieve possible connections among them. The rule to have a link is that, for each possible pair, if they have the same prop1:String and the absolute difference of prop2:Int is < 5 then the link exists. I am trying to understand the best way to accomplish this task working with data frame. I am trying to retrieve indexed RDDs: val idusers = people.select("ID") .rdd .map(r => r(0).asInstanceOf[Int]) .zipWithIndex val

Convert ArrayType(FloatType,false) to VectorUTD

阅读更多关于 Convert ArrayType(FloatType,false) to VectorUTD

问题 I want to perform cluster analysis using K-Means on itemFactors produced by ALS . Although the itemFactors of ALSModel returns a dataframe that contains the id and the features of the itemFactors , this data structure seems to be unsuitable for K-Means . Here's the code for collaborative filtering using ALS : val als = new ALS() .setRegParam(0.01) .setNonnegative(false) .setUserCol("userId") .setItemCol("movieId") .setRatingCol("rating") val model = als.fit(training) val predictions = model

Spark DataFrame write to JDBC - Can't get JDBC type for array<array<int>>

阅读更多关于 Spark DataFrame write to JDBC - Can't get JDBC type for array

问题 I'm trying to save a dataframe via JDBC (to postgres). One of the fields is of type Array[Array[Int]] . Without any casting, it fails with Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC type for array<array<int>> at ... (JdbcUtils.scala:148) I added explicit casting to the array datatype to guide the transformation: val df = readings .map { case ((a, b), (_, d, e, arrayArrayInt)) => (a, b, d, e, arrayArrayInt) } .toDF("A", "B", "D", "E", "arrays") edgesDF

Spark DataFrame columns transform to Map type and List of Map Type [duplicate]

阅读更多关于 Spark DataFrame columns transform to Map type and List of Map Type [duplicate]

问题 This question already has an answer here : Converting multiple different columns to Map column with Spark Dataframe scala (1 answer) Closed 2 years ago . I have dataframe as below and Appreciate if someone can help me to get the output in below different format. Input: |customerId|transHeader|transLine| |1001 |1001aa |1001aa1 | |1001 |1001aa |1001aa2 | |1001 |1001aa |1001aa3 | |1001 |1001aa |1001aa4 | |1002 |1002bb |1002bb1 | |1002 |1002bb |1002bb2 | |1002 |1002bb |1002bb3 | |1002 |1002bb

Iterate across columns in spark dataframe and calculate min max value

阅读更多关于 Iterate across columns in spark dataframe and calculate min max value

问题 I want to iterate across the columns of dataframe in my Spark program and calculate min and max value. I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe. I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection. val parquetRDD = spark.read.parquet("filename.parquet") parquetRDD.collect.foreach ({ i => parquetRDD

What the equivalent of OFFSET in Spark SQL?

阅读更多关于 What the equivalent of OFFSET in Spark SQL?

问题 I got a result set of 100 rows using Spark SQL. I want to get final result starting from row number 6 to 15. In SQL we use OFFSET to skip rows like OFFSET 5 LIMIT 10 is used to get rows from number 6 to 15. In Spark SQL, How can I achieve the same? 回答1: I guess SparkSQL does not support offset . So I use id as the filter condition. Each time, I only retrieve N data. The following is my sample code: sc = SparkContext() sqlContext = SQLContext(sc) df = sqlContext.read.format('com.databricks

How to split a spark dataframe with equal records

阅读更多关于 How to split a spark dataframe with equal records

问题 I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it? 回答1: In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment. For that you usually: Randomize the dataset Apply modulus operation to assign each element to a fold (partition) After this step you will have to extract each partition using filter , afaik there is still no transformation to separate a single RDD into many. Here is

Why is pyspark so much slower in finding the max of a column?

阅读更多关于 Why is pyspark so much slower in finding the max of a column?

问题 Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column? I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value. I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and