spark-dataframe

Filter dataframe by value NOT present in column of other dataframe [duplicate]

徘徊边缘 提交于 2019-12-02 01:28:34
问题 This question already has answers here : Filter Spark DataFrame based on another DataFrame that specifies blacklist criteria (2 answers) Closed 3 years ago . Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe. I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin"

Spark ML StringIndexer Different Labels Training/Testing

别等时光非礼了梦想. 提交于 2019-12-02 01:08:05
I'm using Scala and am using StringIndexer to assign indices to each category in my training set. It assigns indices based on the frequency of each category. The problem is that in my testing data, the frequency of the categories are different and so StringIndexer assigns different indices to the categories, which prevents me from evaluating the model (Random Forest) correctly. I am processing the training/testing data in the exact same way, and don't save the model. I have tried manually creating labels (by getting the index of the category), but get this error java.lang

Pyspark Merge WrappedArrays Within a Dataframe

青春壹個敷衍的年華 提交于 2019-12-02 00:39:18
The current Pyspark dataframe has this structure (a list of WrappedArrays for col2): +---+---------------------------------------------------------------------+ |id |col2 | +---+---------------------------------------------------------------------+ |a |[WrappedArray(code2), WrappedArray(code1, code3)] | +---+---------------------------------------------------------------------+ |b |[WrappedArray(code5), WrappedArray(code6, code8)] | +---+---------------------------------------------------------------------+ This is the structure I would like to have (a flattened list for col2): +---+----------

Manipulating a dataframe within a Spark UDF

家住魔仙堡 提交于 2019-12-02 00:19:36
问题 I have a UDF that filters and selects values from a dataframe, but it runs into "object not serializable" error. Details below. Suppose I have a dataframe df1 that has columns with names ("ID", "Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10"). I want sum a subset of the "Y" columns based on the matching "ID" and "Value" from another dataframe df2. I tried the following: val y_list = ("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10").map(c => col(c)) def udf_test(ID:

Join two DataFrames where the join key is different and only select some columns

筅森魡賤 提交于 2019-12-02 00:10:41
问题 What I would like to do is: Join two DataFrames A and B using their respective id columns a_id and b_id . I want to select all columns from A and two specific columns from B I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this. A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2) I know you could write A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id") to do this

java.lang.UnsupportedOperationException: Schema for type MyClass is not supported

白昼怎懂夜的黑 提交于 2019-12-02 00:05:28
问题 I am using Spark 1.5.0 and I have an issue while creating a dataframe from my rdd. Here is the code: case class MyC (myclass: MyClass) val df = rdd.map {t => MyC(t)}.toDF("cust") df.show() Here is the error message: Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type MyClass is not supported Any help with this will be greatly appreciated. 回答1: Spark uses reflection to infer dataframe schema, but cannot do so for arbitrary classes. I'm not sure if I can state an

How does Spark keep track of the splits in randomSplit?

纵然是瞬间 提交于 2019-12-01 22:37:37
This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD , but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split. If we look at the implementation of randomSplit: def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = { // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its // constituent partitions each time a split is materialized which could result in // overlapping splits. To prevent this, we explicitly

Filter dataframe by value NOT present in column of other dataframe [duplicate]

一曲冷凌霜 提交于 2019-12-01 22:37:01
This question already has an answer here: Filter Spark DataFrame based on another DataFrame that specifies blacklist criteria 2 answers Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe. I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin" keyword, or one of the join methods. val df1 = Seq(("Hampstead", "London"), ("Spui", "Amsterdam"), ("Chittagong", "Chennai"))

Spark dataframe: Pivot and Group based on columns

巧了我就是萌 提交于 2019-12-01 22:01:04
问题 I have input dataframe as below with id, app, and customer Input dataframe +--------------------+-----+---------+ | id|app |customer | +--------------------+-----+---------+ |id1 | fw| WM | |id1 | fw| CS | |id2 | fw| CS | |id1 | fe| WM | |id3 | bc| TR | |id3 | bc| WM | +--------------------+-----+---------+ Expected output Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe Expected dataframe +--------------------+----------+--

PySpark: org.apache.spark.sql.AnalysisException: Attribute name … contains invalid character(s) among “ ,;{}()\\n\\t=”. Please use alias to rename it [duplicate]

雨燕双飞 提交于 2019-12-01 21:28:19
This question already has an answer here: Spark Dataframe validating column names for parquet writes (scala) 4 answers I'm trying to load Parquet data into PySpark , where a column has a space in the name: df = spark.read.parquet('my_parquet_dump') df.select(df['Foo Bar'].alias('foobar')) Even though I have aliased the column, I'm still getting this error and error propagating from the JVM side of PySpark . I've attached the stack trace below. Is there a way I can load this parquet file into PySpark , without pre-processing the data in Scala, and without modifying the source parquet file? ----