Apache Spark on HDFS: read 10k-100k of small files at once

梦想的初衷 提交于 2019-12-30 11:53:29

问题


I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below:

// return a list of paths to small files
List<Sting> paths = getAllPaths(); 
// read up to 100000 small files at once into memory
sparkSession
    .read()
    .parquet(paths)
    .as(Encoders.kryo(SmallFileWrapper.class))
    .coalesce(numPartitions);

Problem

The number of small files is not a problem from the perspective of memory consumption. The problem is the speed of reading of that amount of files. It takes 38 seconds to read 490 small files, and 266 seconds to read 3420 files. I suppose it would take a lot to read 100.000 files.

Questions

Will HAR or sequence files speed up Apache Spark batch read of 10k-100k of small files? Why?

Will HAR or sequence files slow down persisting of that small files? Why?

P.S.

Batch read is the only operation required for that small files, I don't need to read them by id or anything else.


回答1:


From that post: How does the number of partitions affect `wholeTextFiles` and `textFiles`?

wholeTextFiles uses WholeTextFileInputFormat ... Because it extends CombineFileInputFormat, it will try to combine groups of smaller files into one partition ... Each record in the RDD ... has the entire contents of the file


Confirmation in the Spark 1.6.3 Java API documentation for SparkContext
http://spark.apache.org/docs/1.6.3/api/java/index.html

RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int minPartitions)
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.


Confirmation in the source code (branch 1.6) comments for class WholeTextFileInputFormat
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala

A org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat for reading whole text files. Each file is read as key-value pair, where the key is the file path and the value is the entire content of file.


For the record, Hadoop CombineInputFormat is the standard way to stuff multiple small files in a single Mapper; it can be used in Hive with properties hive.hadoop.supports.splittable.combineinputformat and hive.input.format.

Spark wholeTextFiles() reuses that Hadoop feature, with two drawbacks:
(a) you have to consume a whole directory, can't filter out files by name before loading them (you can only filter after loading)
(b) you have to post-process the RDD by splitting each file into multiple records, if required

That seems to be a viable solution nonetheless, cf. that post: Spark partitioning/cluster enforcing


Or, you can build your own custom file reader based on that same Hadoop CombineInputFormat, cf. that post: Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

来源:https://stackoverflow.com/questions/43895728/apache-spark-on-hdfs-read-10k-100k-of-small-files-at-once

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!