I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parqu
Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Copy & Paste from the blog:
when we need to edit the data, in our data structures (Parquet), that are immutable.
You can add partitions to Parquet files, but you can’t edit the data in place.
But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.
If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers) Refer this well written blog:
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.