How does Apache Spark know about HDFS data nodes?

问题

Imagine I do some Spark operations on a file hosted in HDFS. Something like this:

var file = sc.textFile("hdfs://...")
val items = file.map(_.split('\t'))
...

Because in the Hadoop world the code should go where the data is, right?

So my question is: How do Spark workers know of HDFS data nodes? How does Spark know on which Data Nodes to execute the code?

回答1:

Spark reuses Hadoop classes: when you call textFile, it creates a TextInputFormat which has a getSplits method (a split is roughly a partition or block), and then each InputSplit has getLocations and getLocationInfo method.

来源：https://stackoverflow.com/questions/28481693/how-does-apache-spark-know-about-hdfs-data-nodes

标签

Hadoop

apache-spark

HDFS

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!