Updating values in apache parquet file

后端 未结 4 1843
醉话见心
醉话见心 2021-02-07 08:54

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parqu

4条回答
  •  南旧
    南旧 (楼主)
    2021-02-07 09:11

    Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):

    http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

    Copy & Paste from the blog:

    when we need to edit the data, in our data structures (Parquet), that are immutable.

    You can add partitions to Parquet files, but you can’t edit the data in place.

    But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.

    If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers) Refer this well written blog:

    http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

    Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.

提交回复
热议问题