apache-spark

How to extract column name and column type from SQL in pyspark

孤街醉人 提交于 2021-02-08 10:01:53
问题 The Spark SQL for Create query is like this - CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] where [x] means x is optional. I want the output as a tuple of

How to extract column name and column type from SQL in pyspark

心已入冬 提交于 2021-02-08 10:01:41
问题 The Spark SQL for Create query is like this - CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] where [x] means x is optional. I want the output as a tuple of

Calculate UDF once

匆匆过客 提交于 2021-02-08 10:00:12
问题 I want to have a UUID column in a pyspark dataframe that is calculated only once, so that I can select the column in a different dataframe and have the UUIDs be the same. However, the UDF for the UUID column is recalculated when I select the column. Here's what I'm trying to do: >>> uuid_udf = udf(lambda: str(uuid.uuid4()), StringType()) >>> a = spark.createDataFrame([[1, 2]], ['col1', 'col2']) >>> a = a.withColumn('id', uuid_udf()) >>> a.collect() [Row(col1=1, col2=2, id='5ac8f818-e2d8-4c50

Serialize table to nested JSON using Apache Spark

白昼怎懂夜的黑 提交于 2021-02-08 09:47:09
问题 I have a set of records like the following sample |ACCOUNTNO|VEHICLENUMBER|CUSTOMERID| +---------+-------------+----------+ | 10003014| MH43AJ411| 20000000| | 10003014| MH43AJ411| 20000001| | 10003015| MH12GZ3392| 20000002| I want to parse into JSON and it should be look like this: { "ACCOUNTNO":10003014, "VEHICLE": [ { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000}, { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001} ], "ACCOUNTNO":10003015, "VEHICLE": [ { "VEHICLENUMBER":"MH12GZ3392"

duplicating records between date gaps within a selected time interval in a PySpark dataframe

守給你的承諾、 提交于 2021-02-08 09:45:10
问题 I have a PySpark dataframe that keeps track of changes that occur in a product's price and status over months. This means that a new row is created only when a change occurred (in either status or price) compared to the previous month, like in the dummy data below ---------------------------------------- |product_id| status | price| month | ---------------------------------------- |1 | available | 5 | 2019-10| ---------------------------------------- |1 | available | 8 | 2020-08| ------------

INSERT & UPDATE MySql table using PySpark DataFrames and JDBC

故事扮演 提交于 2021-02-08 09:36:06
问题 I'm trying to insert and update some data on MySql using PySpark SQL DataFrames and JDBC connection. I've succeeded to insert new data using the SaveMode.Append. Is there a way to update the existing data and insert new data in MySql Table from PySpark SQL? My code to insert is: myDataFrame.write.mode(SaveMode.Append).jdbc(JDBCurl,mySqlTable,connectionProperties) If I change to SaveMode.Overwrite it deletes the full table and creates a new one, I'm looking for something like the "ON DUPLICATE

Spark Dataframe stat throwing Task not serializable

大城市里の小女人 提交于 2021-02-08 09:24:06
问题 What am I trying to do? (Context) I'm trying to calculate some stats for a dataframe/set in spark that is read from a directory with .parquet files about US flights between 2013 and 2015. To be more specific, I'm using approxQuantile method in DataFrameStatFunction that can be accessed calling stat method on a Dataset . See docu import airportCaseStudy.model.Flight import org.apache.spark.sql.SparkSession object CaseStudy { def main(args: Array[String]): Unit = { val spark: SparkSession =

How to compute the numerical difference between columns of different dataframes?

你。 提交于 2021-02-08 09:16:09
问题 Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally). For instance let us have the following datasets DataFrame A: +----+---+ | A | B | +----+---+ | 1| 0| | 1| 0| +----+---+ DataFrame B: ----+---+ | A | B | +----+---+ | 1| 0 | | 0| 0 | +----+---+ How to obtain B-A, i.e +----+---+ | c1 | c2| +----+---+ | 0| 0 | | -1| 0 | +----+

How to add a Map column to Spark dataset?

做~自己de王妃 提交于 2021-02-08 09:15:43
问题 I have a Java Map variable, say Map<String, String> singleColMap . I want to add this Map variable to a dataset as a new column value in Spark 2.2 (Java 1.8). I tried the below code but it is not working: ds.withColumn("cMap", lit(singleColMap).cast(MapType(StringType, StringType))) Can some one help on this? 回答1: You can use typedLit that was introducted in Spark 2.2.0 , from the documentation: The difference between this function and lit is that this function can handle parameterized scala

How to compute the numerical difference between columns of different dataframes?

元气小坏坏 提交于 2021-02-08 09:14:29
问题 Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally). For instance let us have the following datasets DataFrame A: +----+---+ | A | B | +----+---+ | 1| 0| | 1| 0| +----+---+ DataFrame B: ----+---+ | A | B | +----+---+ | 1| 0 | | 0| 0 | +----+---+ How to obtain B-A, i.e +----+---+ | c1 | c2| +----+---+ | 0| 0 | | -1| 0 | +----+