apache-spark | 易学教程

How to extract column name and column type from SQL in pyspark

阅读更多关于 How to extract column name and column type from SQL in pyspark

问题 The Spark SQL for Create query is like this - CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] where [x] means x is optional. I want the output as a tuple of

How to extract column name and column type from SQL in pyspark

阅读更多关于 How to extract column name and column type from SQL in pyspark

Calculate UDF once

阅读更多关于 Calculate UDF once

问题 I want to have a UUID column in a pyspark dataframe that is calculated only once, so that I can select the column in a different dataframe and have the UUIDs be the same. However, the UDF for the UUID column is recalculated when I select the column. Here's what I'm trying to do: >>> uuid_udf = udf(lambda: str(uuid.uuid4()), StringType()) >>> a = spark.createDataFrame([[1, 2]], ['col1', 'col2']) >>> a = a.withColumn('id', uuid_udf()) >>> a.collect() [Row(col1=1, col2=2, id='5ac8f818-e2d8-4c50

Serialize table to nested JSON using Apache Spark

阅读更多关于 Serialize table to nested JSON using Apache Spark

问题 I have a set of records like the following sample |ACCOUNTNO|VEHICLENUMBER|CUSTOMERID| +---------+-------------+----------+ | 10003014| MH43AJ411| 20000000| | 10003014| MH43AJ411| 20000001| | 10003015| MH12GZ3392| 20000002| I want to parse into JSON and it should be look like this: { "ACCOUNTNO":10003014, "VEHICLE": [ { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000}, { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001} ], "ACCOUNTNO":10003015, "VEHICLE": [ { "VEHICLENUMBER":"MH12GZ3392"

duplicating records between date gaps within a selected time interval in a PySpark dataframe

阅读更多关于 duplicating records between date gaps within a selected time interval in a PySpark dataframe

问题 I have a PySpark dataframe that keeps track of changes that occur in a product's price and status over months. This means that a new row is created only when a change occurred (in either status or price) compared to the previous month, like in the dummy data below ---------------------------------------- |product_id| status | price| month | ---------------------------------------- |1 | available | 5 | 2019-10| ---------------------------------------- |1 | available | 8 | 2020-08| ------------

INSERT & UPDATE MySql table using PySpark DataFrames and JDBC

阅读更多关于 INSERT & UPDATE MySql table using PySpark DataFrames and JDBC

问题 I'm trying to insert and update some data on MySql using PySpark SQL DataFrames and JDBC connection. I've succeeded to insert new data using the SaveMode.Append. Is there a way to update the existing data and insert new data in MySql Table from PySpark SQL? My code to insert is: myDataFrame.write.mode(SaveMode.Append).jdbc(JDBCurl,mySqlTable,connectionProperties) If I change to SaveMode.Overwrite it deletes the full table and creates a new one, I'm looking for something like the "ON DUPLICATE

Spark Dataframe stat throwing Task not serializable

阅读更多关于 Spark Dataframe stat throwing Task not serializable

问题 What am I trying to do? (Context) I'm trying to calculate some stats for a dataframe/set in spark that is read from a directory with .parquet files about US flights between 2013 and 2015. To be more specific, I'm using approxQuantile method in DataFrameStatFunction that can be accessed calling stat method on a Dataset . See docu import airportCaseStudy.model.Flight import org.apache.spark.sql.SparkSession object CaseStudy { def main(args: Array[String]): Unit = { val spark: SparkSession =

How to compute the numerical difference between columns of different dataframes?

阅读更多关于 How to compute the numerical difference between columns of different dataframes?

问题 Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally). For instance let us have the following datasets DataFrame A: +----+---+ | A | B | +----+---+ | 1| 0| | 1| 0| +----+---+ DataFrame B: ----+---+ | A | B | +----+---+ | 1| 0 | | 0| 0 | +----+---+ How to obtain B-A, i.e +----+---+ | c1 | c2| +----+---+ | 0| 0 | | -1| 0 | +----+

How to add a Map column to Spark dataset?

阅读更多关于 How to add a Map column to Spark dataset?

问题 I have a Java Map variable, say Map<String, String> singleColMap . I want to add this Map variable to a dataset as a new column value in Spark 2.2 (Java 1.8). I tried the below code but it is not working: ds.withColumn("cMap", lit(singleColMap).cast(MapType(StringType, StringType))) Can some one help on this? 回答1: You can use typedLit that was introducted in Spark 2.2.0 , from the documentation: The difference between this function and lit is that this function can handle parameterized scala

How to compute the numerical difference between columns of different dataframes?

阅读更多关于 How to compute the numerical difference between columns of different dataframes?