Save Spark dataframe as dynamic partitioned table in Hive

后端未结

关注

 6  987

不要未来只要你来 2020-12-02 09:25

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.sav

6条回答一向 (楼主) 2020-12-02 09:56 This worked for me using python and spark 2.1.0. Not sure if it's the best way to do this but it works... # WRITE DATA INTO A HIVE TABLE import pyspark from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .master("local[*]") \ .config("hive.exec.dynamic.partition", "true") \ .config("hive.exec.dynamic.partition.mode", "nonstrict") \ .enableHiveSupport() \ .getOrCreate() ### CREATE HIVE TABLE (with one row) spark.sql(""" CREATE TABLE IF NOT EXISTS hive_df (col1 INT, col2 STRING, partition_bin INT) USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (partition_bin) LOCATION 'hive_df' """) spark.sql(""" INSERT INTO hive_df PARTITION (partition_bin = 0) VALUES (0, 'init_record') """) ### ### CREATE NON HIVE TABLE (with one row) spark.sql(""" CREATE TABLE IF NOT EXISTS non_hive_df (col1 INT, col2 STRING, partition_bin INT) USING PARQUET PARTITIONED BY (partition_bin) LOCATION 'non_hive_df' """) spark.sql(""" INSERT INTO non_hive_df PARTITION (partition_bin = 0) VALUES (0, 'init_record') """) ### ### ATTEMPT DYNAMIC OVERWRITE WITH EACH TABLE spark.sql(""" INSERT OVERWRITE TABLE hive_df PARTITION (partition_bin) VALUES (0, 'new_record', 1) """) spark.sql(""" INSERT OVERWRITE TABLE non_hive_df PARTITION (partition_bin) VALUES (0, 'new_record', 1) """) spark.sql("SELECT * FROM hive_df").show() # 2 row dynamic overwrite spark.sql("SELECT * FROM non_hive_df").show() # 1 row full table overwrite 0 讨论(0) 查看其它6个回答发布评论: 提交评论加载中... 验证码看不清? 提交回复