How to efficiently upload a large .tsv file to a Hive table with split columns in pyspark?

我怕爱的太早我们不能终老 提交于 2020-01-15 07:59:29

问题


I have a large (~10 milion lines) .tsv file with two columns, 'id' and 'group'. 'Group' column is actually a list of all groups a certain id belongs to, so the file looks like this:

id1     group1,group2
id2     group2,group3,group4
id3     group1
...

I need to upload it to a Hive table using pyspark, however I want to split the group column so that there is only one group in one row, so the resulting table looks like this:

id1    group1
id1    group2
id2    group2
id2    group3
id2    group4
id3    group1

I have tried reading the lines one by one, and just use python split() to split the columns and then create spark dataframe for each row and merge it with every iteration. My code works, but it is extremely inefficient, as it takes 2 minutes to process 1000 lines. My code below:

fields = [StructField('user_id', StringType(), True),StructField('group_id', StringType(), True)] 
membership_schema = StructType(fields) 

result_df = sqlContext.createDataFrame(sc.emptyRDD(), membership_schema)

with open('file.tsv','r') as f:
    for line in f:
        parts = line.split()
        id_part = parts[0]
        audience_parts = parts[1].split(',')
        for item in audience_parts:
            newRow = sqlContext.createDataFrame([(id_part,item)], membership_schema)
            result_df = result_df.union(newRow)
df_writer = DataFrameWriter(result_df)
df_writer.insertInto("my_table_in_hive")

Is there an easier and more efficient way to upload the whole file into the table without iterating through the lines?

Thanks for help.


回答1:


I had a look at plan for above code and it seems that it's scanning a lot and also does not offer you a parallelism with spark. you can use spark native methods to read file data into more partitions and control them to distribute data uniformly across the partitions.

df = sc.textFile(file_path,10).map(lambda x: x.split()).map(lambda x :(x[0],x[1].split(","))).toDF(['id','group'])
from pyspark.sql.functions import explode
newdf = df.withColumn("group", explode(df.group))

newdf.write.format("orc").option("header", "true").mode("overwrite").saveAsTable('db.yourHivetable')

Further you can increase or decrease the size of your partitions going into the explode or control your shuffle partitions.

spark.conf.set("spark.sql.files.maxPartitionBytes","30")
spark.conf.set("spark.sql.shuffle.partitions", "100")


来源:https://stackoverflow.com/questions/57411831/how-to-efficiently-upload-a-large-tsv-file-to-a-hive-table-with-split-columns-i

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!