apache-spark-sql | 易学教程

Spark scala dataframe: Merging multiple columns into single column

阅读更多关于 Spark scala dataframe: Merging multiple columns into single column

问题 I have a spark dataframe which looks something like below: +---+------+----+ | id|animal|talk| +---+------+----+ | 1| bat|done| | 2| mouse|mone| | 3| horse| gun| | 4| horse|some| +---+------+----+ I want to generate a new column, say merged which would look something like +---+-----------------------------------------------------------+ | id| merged columns | +---+-----------------------------------------------------------+ | 1| [{name: animal, value: bat}, {name: talk, value: done}] | | 2| [

Spark scala dataframe: Merging multiple columns into single column

阅读更多关于 Spark scala dataframe: Merging multiple columns into single column

Spark - Reading JSON from Partitioned Folders using Firehose

阅读更多关于 Spark - Reading JSON from Partitioned Folders using Firehose

问题 Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great. How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader? My next goal is for this to be a streaming DF, where new files persisted by Firehose into s3 naturally become part of the streaming dataframe

Create PySpark dataframe : sequence of months with year

阅读更多关于 Create PySpark dataframe : sequence of months with year

问题 Complete newbie here. I would like to create a dataframe using pyspark that will list month and year taking the current date and listing x number of lines. if i decide x=5 dataframe should like as below Calendar_Entry August 2019<br/> September 2019<br/> October 2019<br/> November 2019<br/> December 2019 回答1: Spark is not a tool for generating rows in a distributed way but rather for processing then distributed. Since your data is small anyway the best solution is probably to create the data

pyspark sql Add different Qtr start_date, End_date for exploded rows

阅读更多关于 pyspark sql Add different Qtr start_date, End_date for exploded rows

问题 I have a dataframe which has start_date, end_date, sales_target. I have added code to identify the number of quarters between the date range, and accordingly able to split the sales_target across the number of quarters, using some a UDF. df = sqlContext.createDataFrame([("2020-01-01","2020-12-31","15"),("2020-04-01","2020-12-31","11"),("2020-07-01","2020-12-31","3")], ["start_date","end_date","sales_target"]) +----------+----------+------------+ |start_date| end_date |sales_target| +---------

How to explode an array without duplicate records

阅读更多关于 How to explode an array without duplicate records

问题 This is continuation to the question here in pyspark sql Add different Qtr start_date, End_date for exploded rows. Thanks. I have the following dataframe which has a array list as a column. +--------------+------------+----------+----------+---+---------+-----------+----------+ customer_number|sales_target|start_date|end_date |noq|cf_values|new_sdt |new_edate | +--------------+------------+----------+----------+---+---------------------+----------+ |A011021 |15 |2020-01-01|2020-12-31|4 |[4,4

How to handle missing nested fields in spark?

阅读更多关于 How to handle missing nested fields in spark?

问题 Given the two case classes: case class Response( responseField: String ... items: List[Item]) case class Item( itemField: String ...) I am creating a Response dataset: val dataset = spark.read.format("parquet") .load(inputPath) .as[Response] .map(x => x) The issue arises when itemField is not present in any of the rows and spark will raise this error org.apache.spark.sql.AnalysisException: No such struct field itemField . If itemField was not nested I could handle it by doing dataset

Does Apache Spark SQL support MERGE clause?

阅读更多关于 Does Apache Spark SQL support MERGE clause?

问题 Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause? MERGE into <table> using ( select * from <table1> when matched then update... DELETE WHERE... when not matched then insert... ) 回答1: It does with Delta Lake as storage format : df.write.format("delta").save("/data/events") . DeltaTable.forPath(spark, "/data/events/") .as("events") .merge( updatesDF.as("updates"), "events.eventId = updates.eventId") .whenMatched .updateExpr( Map("data" -> "updates.data"))

How to pass variables in spark SQL, using python?

阅读更多关于 How to pass variables in spark SQL, using python?

问题 I am writing spark code in python. How do I pass a variable in a spark.sql query? q25 = 500 Q1 = spark.sql("SELECT col1 from table where col2>500 limit $q25 , 1") Currently the above code does not work? How do we pass variables? I have also tried, Q1 = spark.sql("SELECT col1 from table where col2>500 limit q25='{}' , 1".format(q25)) 回答1: You need to remove single quote and q25 in string formatting like this: Q1 = spark.sql("SELECT col1 from table where col2>500 limit {}, 1".format(q25))

How to detect when a pattern changes in a pyspark dataframe column

阅读更多关于 How to detect when a pattern changes in a pyspark dataframe column

问题 I have a dataframe like below: +-------------------+--------+-----------+ |DateTime |UID. |result | +-------------------+--------+-----------+ |2020-02-29 11:42:34|0000111D|30 | |2020-02-30 11:47:34|0000111D|30 | |2020-02-30 11:48:34|0000111D|30 | |2020-02-30 11:49:34|0000111D|30 | |2020-02-30 11:50:34|0000111D|30 | |2020-02-25 11:50:34|0000111D|29 | |2020-02-25 11:50:35|0000111D|29 | |2020-02-26 11:52:35|0000111D|29 | |2020-02-27 11:52:35|0000111D|29 | |2020-02-28 11:52:35|0000111D|29 |