Spark lists all leaf node even in partitioned data

前端 未结 2 1856
盖世英雄少女心
盖世英雄少女心 2020-12-01 16:31

I have parquet data partitioned by date & hour, folder structure:

events_v3
  -- event_date=2015-01-01
    -- event_hour=2015-0         


        
2条回答
  •  忘掉有多难
    2020-12-01 17:04

    To clarify Gaurav's answer, that code snipped is from Hadoop branch-2, Probably not going to surface until Hadoop 2.9 (see HADOOP-13208); and someone needs to update Spark to use that feature (which won't harm code using HDFS, just won't show any speedup there).

    One thing to consider is: what makes a good file layout for Object Stores.

    • Don't have deep directory trees with only a few files per directory
    • Do have shallow trees with many files
    • Consider using the first few characters of a file for the most changing value (such as day/hour), rather than the last. Why? Some object stores appear to use the leading characters for their hashing, not the trailing ones ... if you give your names more uniqueness then they get spread out over more servers, with better bandwidth/less risk of throttling.
    • If you are using the Hadoop 2.7 libraries, switch to s3a:// over s3n://. It's already faster, and getting better every week, at least in the ASF source tree.

    Finally, Apache Hadoop, Apache Spark and related projects are all open source. Contributions are welcome. That's not just the code, it's documentation, testing, and, for this performance stuff, testing against your actual datasets. Even giving us details about what causes problems (and your dataset layouts) is interesting.

提交回复
热议问题