apache-spark-sql

Spark scala dataframe: Merging multiple columns into single column

烈酒焚心 提交于 2020-06-23 01:42:47
问题 I have a spark dataframe which looks something like below: +---+------+----+ | id|animal|talk| +---+------+----+ | 1| bat|done| | 2| mouse|mone| | 3| horse| gun| | 4| horse|some| +---+------+----+ I want to generate a new column, say merged which would look something like +---+-----------------------------------------------------------+ | id| merged columns | +---+-----------------------------------------------------------+ | 1| [{name: animal, value: bat}, {name: talk, value: done}] | | 2| [

Spark scala dataframe: Merging multiple columns into single column

邮差的信 提交于 2020-06-23 01:41:11
问题 I have a spark dataframe which looks something like below: +---+------+----+ | id|animal|talk| +---+------+----+ | 1| bat|done| | 2| mouse|mone| | 3| horse| gun| | 4| horse|some| +---+------+----+ I want to generate a new column, say merged which would look something like +---+-----------------------------------------------------------+ | id| merged columns | +---+-----------------------------------------------------------+ | 1| [{name: animal, value: bat}, {name: talk, value: done}] | | 2| [

Spark - Reading JSON from Partitioned Folders using Firehose

二次信任 提交于 2020-06-22 11:50:52
问题 Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great. How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader? My next goal is for this to be a streaming DF, where new files persisted by Firehose into s3 naturally become part of the streaming dataframe

Create PySpark dataframe : sequence of months with year

☆樱花仙子☆ 提交于 2020-06-17 13:02:06
问题 Complete newbie here. I would like to create a dataframe using pyspark that will list month and year taking the current date and listing x number of lines. if i decide x=5 dataframe should like as below Calendar_Entry August 2019<br/> September 2019<br/> October 2019<br/> November 2019<br/> December 2019 回答1: Spark is not a tool for generating rows in a distributed way but rather for processing then distributed. Since your data is small anyway the best solution is probably to create the data

pyspark sql Add different Qtr start_date, End_date for exploded rows

流过昼夜 提交于 2020-06-17 09:41:51
问题 I have a dataframe which has start_date, end_date, sales_target. I have added code to identify the number of quarters between the date range, and accordingly able to split the sales_target across the number of quarters, using some a UDF. df = sqlContext.createDataFrame([("2020-01-01","2020-12-31","15"),("2020-04-01","2020-12-31","11"),("2020-07-01","2020-12-31","3")], ["start_date","end_date","sales_target"]) +----------+----------+------------+ |start_date| end_date |sales_target| +---------

How to explode an array without duplicate records

為{幸葍}努か 提交于 2020-06-17 09:38:06
问题 This is continuation to the question here in pyspark sql Add different Qtr start_date, End_date for exploded rows. Thanks. I have the following dataframe which has a array list as a column. +--------------+------------+----------+----------+---+---------+-----------+----------+ customer_number|sales_target|start_date|end_date |noq|cf_values|new_sdt |new_edate | +--------------+------------+----------+----------+---+---------------------+----------+ |A011021 |15 |2020-01-01|2020-12-31|4 |[4,4

How to handle missing nested fields in spark?

拥有回忆 提交于 2020-06-17 09:36:49
问题 Given the two case classes: case class Response( responseField: String ... items: List[Item]) case class Item( itemField: String ...) I am creating a Response dataset: val dataset = spark.read.format("parquet") .load(inputPath) .as[Response] .map(x => x) The issue arises when itemField is not present in any of the rows and spark will raise this error org.apache.spark.sql.AnalysisException: No such struct field itemField . If itemField was not nested I could handle it by doing dataset

Does Apache Spark SQL support MERGE clause?

柔情痞子 提交于 2020-06-16 04:34:21
问题 Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause? MERGE into <table> using ( select * from <table1> when matched then update... DELETE WHERE... when not matched then insert... ) 回答1: It does with Delta Lake as storage format : df.write.format("delta").save("/data/events") . DeltaTable.forPath(spark, "/data/events/") .as("events") .merge( updatesDF.as("updates"), "events.eventId = updates.eventId") .whenMatched .updateExpr( Map("data" -> "updates.data"))

How to pass variables in spark SQL, using python?

妖精的绣舞 提交于 2020-06-11 17:14:33
问题 I am writing spark code in python. How do I pass a variable in a spark.sql query? q25 = 500 Q1 = spark.sql("SELECT col1 from table where col2>500 limit $q25 , 1") Currently the above code does not work? How do we pass variables? I have also tried, Q1 = spark.sql("SELECT col1 from table where col2>500 limit q25='{}' , 1".format(q25)) 回答1: You need to remove single quote and q25 in string formatting like this: Q1 = spark.sql("SELECT col1 from table where col2>500 limit {}, 1".format(q25))

How to detect when a pattern changes in a pyspark dataframe column

血红的双手。 提交于 2020-06-11 10:39:08
问题 I have a dataframe like below: +-------------------+--------+-----------+ |DateTime |UID. |result | +-------------------+--------+-----------+ |2020-02-29 11:42:34|0000111D|30 | |2020-02-30 11:47:34|0000111D|30 | |2020-02-30 11:48:34|0000111D|30 | |2020-02-30 11:49:34|0000111D|30 | |2020-02-30 11:50:34|0000111D|30 | |2020-02-25 11:50:34|0000111D|29 | |2020-02-25 11:50:35|0000111D|29 | |2020-02-26 11:52:35|0000111D|29 | |2020-02-27 11:52:35|0000111D|29 | |2020-02-28 11:52:35|0000111D|29 |