Updating values in apache parquet file

后端未结

关注

 4  1843

醉话见心 2021-02-07 08:54

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parqu

4条回答

南旧 (楼主)

2021-02-07 09:11

Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):

http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

Copy & Paste from the blog:

when we need to edit the data, in our data structures (Parquet), that are immutable.

You can add partitions to Parquet files, but you can’t edit the data in place.

But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.

If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers) Refer this well written blog:

http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...