How to efficiently upload a large .tsv file to a Hive table with split columns in pyspark?

问题

I have a large (~10 milion lines) .tsv file with two columns, 'id' and 'group'. 'Group' column is actually a list of all groups a certain id belongs to, so the file looks like this:

id1     group1,group2
id2     group2,group3,group4
id3     group1
...

I need to upload it to a Hive table using pyspark, however I want to split the group column so that there is only one group in one row, so the resulting table looks like this:

id1    group1
id1    group2
id2    group2
id2    group3
id2    group4
id3    group1

I have tried reading the lines one by one, and just use python split() to split the columns and then create spark dataframe for each row and merge it with every iteration. My code works, but it is extremely inefficient, as it takes 2 minutes to process 1000 lines. My code below:

fields = [StructField('user_id', StringType(), True),StructField('group_id', StringType(), True)] 
membership_schema = StructType(fields) 

result_df = sqlContext.createDataFrame(sc.emptyRDD(), membership_schema)

with open('file.tsv','r') as f:
    for line in f:
        parts = line.split()
        id_part = parts[0]
        audience_parts = parts[1].split(',')
        for item in audience_parts:
            newRow = sqlContext.createDataFrame([(id_part,item)], membership_schema)
            result_df = result_df.union(newRow)
df_writer = DataFrameWriter(result_df)
df_writer.insertInto("my_table_in_hive")

Is there an easier and more efficient way to upload the whole file into the table without iterating through the lines?

Thanks for help.

回答1:

I had a look at plan for above code and it seems that it's scanning a lot and also does not offer you a parallelism with spark. you can use spark native methods to read file data into more partitions and control them to distribute data uniformly across the partitions.

df = sc.textFile(file_path,10).map(lambda x: x.split()).map(lambda x :(x[0],x[1].split(","))).toDF(['id','group'])
from pyspark.sql.functions import explode
newdf = df.withColumn("group", explode(df.group))

newdf.write.format("orc").option("header", "true").mode("overwrite").saveAsTable('db.yourHivetable')

Further you can increase or decrease the size of your partitions going into the explode or control your shuffle partitions.

spark.conf.set("spark.sql.files.maxPartitionBytes","30")
spark.conf.set("spark.sql.shuffle.partitions", "100")

来源：https://stackoverflow.com/questions/57411831/how-to-efficiently-upload-a-large-tsv-file-to-a-hive-table-with-split-columns-i

标签

python

Hive

pyspark