Control data locality in Impala by partitioning

那年仲夏 提交于 2019-12-04 12:53:58

About the slides you mention ("Co-located block replicas") - it's about an HDFS feature (HDFS-2576) implemented in Hadoop 2.1. It provides a Java API to give hints to HDFS as to where the blocks should be placed.

It's not used in Impala as of 2014, but it definitely seems like building some groundwork for that - as it would give Impala a performance equivalent of specifying distribution key in traditional MPP databases.

No, that completely defeats the purpose of having a distributed file system and MPP computing. It also creates a single point of failure and a bottleneck especially if you're talking about a 250GB table that is joined to itself. Exactly the kind of problems that Hadoop was designed to solve. Partitioning data creates sub-directories in HDFS on the namenode and that data is then replicated throughout the datanodes in the cluster.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!