How to read and write Parquet files efficiently?

问题

I am working on a utility which reads multiple parquet files at a time and writing them into one single output file. the implementation is very straightforward. This utility reads parquet files from the directory, reads Group from all the file and put them into a list .Then uses ParquetWrite to write all these Groups into a single file.
After reading 600mb it throws Out of memory error for Java heap space. It also takes 15-20 minutes to read and write 500mb of data.

Is there a way to make this operation more efficient?

Read method looks like this:

ParquetFileReader reader = new ParquetFileReader(conf, path, ParquetMetadataConverter.NO_FILTER);
          ParquetMetadata readFooter = reader.getFooter();
          MessageType schema = readFooter.getFileMetaData().getSchema();
          ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);
          reader.close();
          PageReadStore pages = null;
          try {
            while (null != (pages = r.readNextRowGroup())) {
              long rows = pages.getRowCount();
              System.out.println("Number of rows: " + pages.getRowCount());

              MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
              RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
              for (int i = 0; i < rows; i++) {
                Group g = (Group) recordReader.read();
                //printGroup(g);
                groups.add(g);
              }
            }
          } finally {
            System.out.println("close the reader");

            r.close();
          }

Write method is like this:

for(Path file : files){
            groups.addAll(readData(file));
        }

        System.out.println("Number of groups from the parquet files "+groups.size());

        Configuration configuration = new Configuration();
        Map<String, String> meta = new HashMap<String, String>();
        meta.put("startkey", "1");
        meta.put("endkey", "2");
        GroupWriteSupport.setSchema(schema, configuration);
        ParquetWriter<Group> writer = new ParquetWriter<Group>(
                new Path(outputFile),
                new GroupWriteSupport(),
                CompressionCodecName.SNAPPY,
                2147483647,
                268435456,
                134217728,
                true,
                false,
                ParquetProperties.WriterVersion.PARQUET_2_0,
                configuration);
        System.out.println("Number of groups to write:"+groups.size());
        for(Group g : groups) {
            writer.write(g);
        }
        writer.close();

回答1:

I use these functions to merge parquet files, but it is in Scala. Anyway, it may give you good starting point.

import java.util

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.hadoop.{ParquetFileReader, ParquetFileWriter}
import org.apache.parquet.hadoop.util.{HadoopInputFile, HadoopOutputFile}
import org.apache.parquet.schema.MessageType

import scala.collection.JavaConverters._

object ParquetFileMerger {
    def mergeFiles(inputFiles: Seq[Path], outputFile: Path): Unit = {
        val conf = new Configuration()
        val mergedMeta = ParquetFileWriter.mergeMetadataFiles(inputFiles.asJava, conf).getFileMetaData
        val writer = new ParquetFileWriter(conf, mergedMeta.getSchema, outputFile, ParquetFileWriter.Mode.OVERWRITE)

        writer.start()
        inputFiles.foreach(input => writer.appendFile(HadoopInputFile.fromPath(input, conf)))
        writer.end(mergedMeta.getKeyValueMetaData)
    }

    def mergeBlocks(inputFiles: Seq[Path], outputFile: Path): Unit = {
        val conf = new Configuration()
        val parquetFileReaders = inputFiles.map(getParquetFileReader)
        val mergedSchema: MessageType =
            parquetFileReaders.
              map(_.getFooter.getFileMetaData.getSchema).
              reduce((a, b) => a.union(b))

        val writer = new ParquetFileWriter(HadoopOutputFile.fromPath(outputFile, conf), mergedSchema, ParquetFileWriter.Mode.OVERWRITE, 64*1024*1024, 8388608)

        writer.start()
        parquetFileReaders.foreach(_.appendTo(writer))
        writer.end(new util.HashMap[String, String]())
    }

    def getParquetFileReader(file: Path): ParquetFileReader = {
        ParquetFileReader.open(HadoopInputFile.fromPath(file, new Configuration()))
    }
}

回答2:

What you are trying to achieve is already possible using the merge command of parquet-tools. However, it is not recommended for merging small files, since it doesn't actually merge the row groups, only places them one after the another (exactly how you describe it in your question). The resulting file will probably have bad performance characteristics.

If you would like to implement it yourself nevertheless, you can either increase the heap size, or modify the code so that it does not read all of the files into memory before writing the new file, but instead reads them one by one (or even better, rowgroup by rowgroup), and immediately writes them to the new file. This way you will only need to keep in memory a single file or row group.

回答3:

I faced with the very same problem. On not very big files (up to 100 megabytes), the writing time could be up to 20 minutes. Try to use kite-sdk api. I know it looks as if abandoned but some things in it are done very efficiently. Also if you like Spring you can try spring-data-hadoop (which is some kind of a wrapper over kite-sdk-api). In my case the use of this libraries reduced the writing time to 2 minutes.

For example you can write in Parquet (using spring-data-hadoop but writing using kite-sdk-api looks quite similiar) in this manner:

final DatasetRepositoryFactory repositoryFactory = new DatasetRepositoryFactory();
repositoryFactory.setBasePath(basePath);
repositoryFactory.setConf(configuration);
repositoryFactory.setNamespace("my-parquet-file");

DatasetDefinition datasetDefinition = new DatasetDefinition(targetClass, true, Formats.PARQUET.getName());
try (DataStoreWriter<T> writer = new ParquetDatasetStoreWriter<>(clazz, datasetRepositoryFactory, datasetDefinition)) {
     for (T record : records) {
        writer.write(record);
     }
     writer.flush();
}

Of course you need to add some dependencies to your project (in my case this is spring-data-hadoop):

     <dependency>
        <groupId>org.springframework.data</groupId>
        <artifactId>spring-data-hadoop</artifactId>
        <version>${spring.hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.data</groupId>
        <artifactId>spring-data-hadoop-boot</artifactId>
        <version>${spring.hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.data</groupId>
        <artifactId>spring-data-hadoop-config</artifactId>
        <version>${spring.hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.data</groupId>
        <artifactId>spring-data-hadoop-store</artifactId>
        <version>${spring.hadoop.version}</version>
    </dependency>

If you absolutely want to do it using only native hadoop api, in any case it will be useful to take a look at the source code of these libraries in order to implement efficiently writing in parquet files.

回答4:

I have implemented something solution using Spark with pyspark script, below is sample code for same, here loading multiple parquet files from directory location, also if parquet files schema is little different in files we are merging that as well.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("App_name") \
        .getOrCreate() 

dataset_DF = spark.read.option("mergeSchema", "true").load("/dir/parquet_files/")

dataset_DF.write.parquet("file_name.parquet")

Hope it will be short solution.

来源：https://stackoverflow.com/questions/51328393/how-to-read-and-write-parquet-files-efficiently

标签

java

Hadoop

parquet