Apache Spark on HDFS: read 10k-100k of small files at once

问题

I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below:

// return a list of paths to small files
List<Sting> paths = getAllPaths(); 
// read up to 100000 small files at once into memory
sparkSession
    .read()
    .parquet(paths)
    .as(Encoders.kryo(SmallFileWrapper.class))
    .coalesce(numPartitions);

Problem

The number of small files is not a problem from the perspective of memory consumption. The problem is the speed of reading of that amount of files. It takes 38 seconds to read 490 small files, and 266 seconds to read 3420 files. I suppose it would take a lot to read 100.000 files.

Questions

Will HAR or sequence files speed up Apache Spark batch read of 10k-100k of small files? Why?

Will HAR or sequence files slow down persisting of that small files? Why?

P.S.

Batch read is the only operation required for that small files, I don't need to read them by id or anything else.

回答1:

From that post: How does the number of partitions affect `wholeTextFiles` and `textFiles`?

wholeTextFiles uses WholeTextFileInputFormat ... Because it extends CombineFileInputFormat, it will try to combine groups of smaller files into one partition ... Each record in the RDD ... has the entire contents of the file

Confirmation in the Spark 1.6.3 Java API documentation for SparkContext
http://spark.apache.org/docs/1.6.3/api/java/index.html

RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int minPartitions)
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.

Confirmation in the source code (branch 1.6) comments for class WholeTextFileInputFormat
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala

A org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat for reading whole text files. Each file is read as key-value pair, where the key is the file path and the value is the entire content of file.

For the record, Hadoop CombineInputFormat is the standard way to stuff multiple small files in a single Mapper; it can be used in Hive with properties hive.hadoop.supports.splittable.combineinputformat and hive.input.format.

Spark wholeTextFiles() reuses that Hadoop feature, with two drawbacks:
(a) you have to consume a whole directory, can't filter out files by name before loading them (you can only filter after loading)
(b) you have to post-process the RDD by splitting each file into multiple records, if required

That seems to be a viable solution nonetheless, cf. that post: Spark partitioning/cluster enforcing

Or, you can build your own custom file reader based on that same Hadoop CombineInputFormat, cf. that post: Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

来源：https://stackoverflow.com/questions/43895728/apache-spark-on-hdfs-read-10k-100k-of-small-files-at-once

标签

java

apache-spark

HDFS