spark-dataframe | 易学教程

Filter dataframe by value NOT present in column of other dataframe [duplicate]

阅读更多关于 Filter dataframe by value NOT present in column of other dataframe [duplicate]

问题 This question already has answers here : Filter Spark DataFrame based on another DataFrame that specifies blacklist criteria (2 answers) Closed 3 years ago . Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe. I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin"

Spark ML StringIndexer Different Labels Training/Testing

阅读更多关于 Spark ML StringIndexer Different Labels Training/Testing

I'm using Scala and am using StringIndexer to assign indices to each category in my training set. It assigns indices based on the frequency of each category. The problem is that in my testing data, the frequency of the categories are different and so StringIndexer assigns different indices to the categories, which prevents me from evaluating the model (Random Forest) correctly. I am processing the training/testing data in the exact same way, and don't save the model. I have tried manually creating labels (by getting the index of the category), but get this error java.lang

Pyspark Merge WrappedArrays Within a Dataframe

阅读更多关于 Pyspark Merge WrappedArrays Within a Dataframe

The current Pyspark dataframe has this structure (a list of WrappedArrays for col2): +---+---------------------------------------------------------------------+ |id |col2 | +---+---------------------------------------------------------------------+ |a |[WrappedArray(code2), WrappedArray(code1, code3)] | +---+---------------------------------------------------------------------+ |b |[WrappedArray(code5), WrappedArray(code6, code8)] | +---+---------------------------------------------------------------------+ This is the structure I would like to have (a flattened list for col2): +---+----------

Manipulating a dataframe within a Spark UDF

阅读更多关于 Manipulating a dataframe within a Spark UDF

问题 I have a UDF that filters and selects values from a dataframe, but it runs into "object not serializable" error. Details below. Suppose I have a dataframe df1 that has columns with names ("ID", "Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10"). I want sum a subset of the "Y" columns based on the matching "ID" and "Value" from another dataframe df2. I tried the following: val y_list = ("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10").map(c => col(c)) def udf_test(ID:

Join two DataFrames where the join key is different and only select some columns

阅读更多关于 Join two DataFrames where the join key is different and only select some columns

问题 What I would like to do is: Join two DataFrames A and B using their respective id columns a_id and b_id . I want to select all columns from A and two specific columns from B I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this. A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2) I know you could write A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id") to do this

java.lang.UnsupportedOperationException: Schema for type MyClass is not supported

阅读更多关于 java.lang.UnsupportedOperationException: Schema for type MyClass is not supported

问题 I am using Spark 1.5.0 and I have an issue while creating a dataframe from my rdd. Here is the code: case class MyC (myclass: MyClass) val df = rdd.map {t => MyC(t)}.toDF("cust") df.show() Here is the error message: Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type MyClass is not supported Any help with this will be greatly appreciated. 回答1: Spark uses reflection to infer dataframe schema, but cannot do so for arbitrary classes. I'm not sure if I can state an

How does Spark keep track of the splits in randomSplit?

阅读更多关于 How does Spark keep track of the splits in randomSplit?

This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD , but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split. If we look at the implementation of randomSplit: def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = { // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its // constituent partitions each time a split is materialized which could result in // overlapping splits. To prevent this, we explicitly

Filter dataframe by value NOT present in column of other dataframe [duplicate]

阅读更多关于 Filter dataframe by value NOT present in column of other dataframe [duplicate]

This question already has an answer here: Filter Spark DataFrame based on another DataFrame that specifies blacklist criteria 2 answers Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe. I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin" keyword, or one of the join methods. val df1 = Seq(("Hampstead", "London"), ("Spui", "Amsterdam"), ("Chittagong", "Chennai"))

Spark dataframe: Pivot and Group based on columns

阅读更多关于 Spark dataframe: Pivot and Group based on columns

问题 I have input dataframe as below with id, app, and customer Input dataframe +--------------------+-----+---------+ | id|app |customer | +--------------------+-----+---------+ |id1 | fw| WM | |id1 | fw| CS | |id2 | fw| CS | |id1 | fe| WM | |id3 | bc| TR | |id3 | bc| WM | +--------------------+-----+---------+ Expected output Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe Expected dataframe +--------------------+----------+--

PySpark: org.apache.spark.sql.AnalysisException: Attribute name … contains invalid character(s) among “ ,;{}()\\n\\t=”. Please use alias to rename it [duplicate]

阅读更多关于 PySpark: org.apache.spark.sql.AnalysisException: Attribute name … contains invalid character(s) among “ ,;{}()\\n\\t=”. Please use alias to rename it [duplicate]

This question already has an answer here: Spark Dataframe validating column names for parquet writes (scala) 4 answers I'm trying to load Parquet data into PySpark , where a column has a space in the name: df = spark.read.parquet('my_parquet_dump') df.select(df['Foo Bar'].alias('foobar')) Even though I have aliased the column, I'm still getting this error and error propagating from the JVM side of PySpark . I've attached the stack trace below. Is there a way I can load this parquet file into PySpark , without pre-processing the data in Scala, and without modifying the source parquet file? ----