How does Apache Spark know about HDFS data nodes?

爱⌒轻易说出口 提交于 2019-12-30 02:45:54

问题


Imagine I do some Spark operations on a file hosted in HDFS. Something like this:

var file = sc.textFile("hdfs://...")
val items = file.map(_.split('\t'))
...

Because in the Hadoop world the code should go where the data is, right?

So my question is: How do Spark workers know of HDFS data nodes? How does Spark know on which Data Nodes to execute the code?


回答1:


Spark reuses Hadoop classes: when you call textFile, it creates a TextInputFormat which has a getSplits method (a split is roughly a partition or block), and then each InputSplit has getLocations and getLocationInfo method.



来源:https://stackoverflow.com/questions/28481693/how-does-apache-spark-know-about-hdfs-data-nodes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!