问题
I'm using the following code to create ParquetWriter and to write records to it.
ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE);
final GenericRecord record = new GenericData.Record(avroSchema);
parquetWriter.write(record);
But it only allows to create new files(at the specfied path). Is there a way to append data to an existing parquet file (at path)? Caching parquetWriter is not feasible in my case.
回答1:
There is a Spark API SaveMode called append: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html which I believe solves your problem.
Example of use:
df.write.mode('append').parquet('parquet_data_file')
回答2:
Parquet is a columnar file, It optimizes writing all columns together. If any edit it requires to rewrite the file.
From Wiki
A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. For our example table, the data would be stored in this fashion:
10:001,12:002,11:003,22:004;
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
40000:001,50000:002,44000:003,55000:004;
Some links
https://en.wikipedia.org/wiki/Column-oriented_DBMS
https://parquet.apache.org/
来源:https://stackoverflow.com/questions/39234391/how-to-append-data-to-an-existing-parquet-file