spark-dataframe | 易学教程

Spark 2.0, DataFrame, filter a string column, unequal operator (!==) is deprecated

阅读更多关于 Spark 2.0, DataFrame, filter a string column, unequal operator (!==) is deprecated

I am trying to filter a DataFrame by keeping only those rows that have a certain string column non-empty. The operation is the following: df.filter($"stringColumn" !== "") My compiler shows that the !== is deprecated since I moved to Spark 2.0.1 How can I check if a string column value is empty in Spark > 2.0? Use =!= as a replacement: df.filter($"stringColumn" =!= "") 来源： https://stackoverflow.com/questions/40154104/spark-2-0-dataframe-filter-a-string-column-unequal-operator-is-deprecat

Pyspark read delta/upsert dataset from csv files

阅读更多关于 Pyspark read delta/upsert dataset from csv files

I have a dataset that is updated periodically, that I receive as a series of CSV files giving the changes. I'd like a Dataframe that contains only the latest version of each row. Is there a way to load the whole dataset in Spark/pyspark that allows for parallelism? Example: File 1 (Key, Value) 1,ABC 2,DEF 3,GHI File 2 (Key, Value) 2,XYZ 4,UVW File 3 (Key, Value) 3,JKL 4,MNO Should result in: 1,ABC 2,XYZ 3,JKL 4,MNO I know I could do this by loading each file sequentially and then using an anti join (to kick out old values being replaced) and a union, but that doesn't let the workload be

How to use orderby() with descending order in Spark window functions?

阅读更多关于 How to use orderby() with descending order in Spark window functions?

I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks. This works fine for ascending order: def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={ val top_keys: List[String] = top_key.split(", ").map(_.trim).toList val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*) .orderBy(top_value) val rankCondition = "rn < "+top_x.toString val dfTop = df.withColumn("rn",row_number().over(w)) .where(rankCondition).drop("rn") return dfTop } But when I try to change it to orderBy

Apache Spark Window function with nested column

阅读更多关于 Apache Spark Window function with nested column

I'm not sure this is a bug (or just incorrect syntax). I searched around and didn't see this mentioned elsewhere so I'm asking here before filing a bug report. I'm trying to use a Window function partitioned on a nested column. I've created a small example below demonstrating the problem. import sqlContext.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") .withColumn("Data", struct("A", "B", "C")).drop("A").drop("B").drop("C") val winSpec = Window.partitionBy("Data

Flatten a DataFrame in Scala with different DataTypes inside

阅读更多关于 Flatten a DataFrame in Scala with different DataTypes inside

As you may know, a DataFrame can contain fields which are complex types, like structures (StructType) or arrays (ArrayType). You may need, as in my case, to map all the DataFrame data to a Hive table, with simple type fields (String, Integer...). I've been struggling with this issue for a long time, and I've finally found a solution I want to share. Also, I'm sure it could be improved, so feel free to reply with your own suggestions. It's based on this thread , but also works for ArrayType elements, not only StructType ones. It is a tail recursive function which receives a DataFrame, and

SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

阅读更多关于 SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

I read a parquet file from HDFS system: path<-"hdfs://part_2015" AppDF <- parquetFile(sqlContext, path) printSchema(AppDF) root |-- app: binary (nullable = true) |-- category: binary (nullable = true) |-- date: binary (nullable = true) |-- user: binary (nullable = true) class(AppDF) [1] "DataFrame" attr(,"package") [1] "SparkR" collect(AppDF) .....error: arguments imply differing number of rows: 46021, 39175, 62744, 27137 head(AppDF) .....error: arguments imply differing number of rows: 36, 30, 48 I've read some thread about this problem. But it's not my case. In fact, I just read a table from

How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?

阅读更多关于 How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?

问题 I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example { "id":1, "name":"some name", "problem_field": "{\"height\":180,\"weight\":80,}", } Expectedly, when using sqlContext.read.json it will create a DataFrame with with the 3 columns id, name and problem_field where problem_field is a String. I have no control over the input files and I'd prefer to be able to solve this problem

Trying to read and write parquet files from s3 with local spark

阅读更多关于 Trying to read and write parquet files from s3 with local spark

I'm trying to read and write parquet files from my local machine to S3 using spark. But I can't seem to configure my spark session properly to do so. Obviously there are configurations to be made, but I could not find a clear reference on how to do it. Currently my spark session reads local parquet mocks and is defined as such: val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate() I'm going to have to correct the post by himanshuIIITian slightly, (sorry). Use the s3a connector, not the older, obsolete, unmaintained, s3n. S3A is: faster, works

pyspark: isin vs join

阅读更多关于 pyspark: isin vs join

问题 What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically: Depending on the size of the given list of values, then with respect to runtime when is it best to use isin vs inner join vs broadcast ? This question is the spark analogue of the following question in Pig: Pig: efficient filtering by loaded list Additional context: Pyspark isin function 回答1: Considering import pyspark.sql.functions as psf There are two types of broadcasting: sc

join in a dataframe spark java

阅读更多关于 join in a dataframe spark java

First of all, thank you for the time in reading my question. My question is the following: In Spark with Java, i load in two dataframe the data of two csv files. These dataframes will have the following information. Dataframe Airport Id | Name | City ----------------------- 1 | Barajas | Madrid Dataframe airport_city_state City | state ---------------- Madrid | España I want to join these two dataframes so that it looks like this: dataframe result Id | Name | City | state -------------------------- 1 | Barajas | Madrid | España Where dfairport.city = dfaiport_city_state.city But I can not