Save Spark dataframe as dynamic partitioned table in Hive

后端 未结 6 987

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.sav

6条回答
  •  一向
    一向 (楼主)
    2020-12-02 09:56

    This worked for me using python and spark 2.1.0.

    Not sure if it's the best way to do this but it works...

    # WRITE DATA INTO A HIVE TABLE
    import pyspark
    from pyspark.sql import SparkSession
    
    spark = SparkSession \
        .builder \
        .master("local[*]") \
        .config("hive.exec.dynamic.partition", "true") \
        .config("hive.exec.dynamic.partition.mode", "nonstrict") \
        .enableHiveSupport() \
        .getOrCreate()
    
    ### CREATE HIVE TABLE (with one row)
    spark.sql("""
    CREATE TABLE IF NOT EXISTS hive_df (col1 INT, col2 STRING, partition_bin INT)
    USING HIVE OPTIONS(fileFormat 'PARQUET')
    PARTITIONED BY (partition_bin)
    LOCATION 'hive_df'
    """)
    spark.sql("""
    INSERT INTO hive_df PARTITION (partition_bin = 0)
    VALUES (0, 'init_record')
    """)
    ###
    
    ### CREATE NON HIVE TABLE (with one row)
    spark.sql("""
    CREATE TABLE IF NOT EXISTS non_hive_df (col1 INT, col2 STRING, partition_bin INT)
    USING PARQUET
    PARTITIONED BY (partition_bin)
    LOCATION 'non_hive_df'
    """)
    spark.sql("""
    INSERT INTO non_hive_df PARTITION (partition_bin = 0)
    VALUES (0, 'init_record')
    """)
    ###
    
    ### ATTEMPT DYNAMIC OVERWRITE WITH EACH TABLE
    spark.sql("""
    INSERT OVERWRITE TABLE hive_df PARTITION (partition_bin)
    VALUES (0, 'new_record', 1)
    """)
    spark.sql("""
    INSERT OVERWRITE TABLE non_hive_df PARTITION (partition_bin)
    VALUES (0, 'new_record', 1)
    """)
    
    spark.sql("SELECT * FROM hive_df").show() # 2 row dynamic overwrite
    spark.sql("SELECT * FROM non_hive_df").show() # 1 row full table overwrite
    

提交回复
热议问题