Apache Spark DataSet API : head(n:Int) vs take(n:Int)

老子叫甜甜 提交于 2020-08-23 03:45:46

问题


Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int).

Dataset.Scala source contains

def take(n: Int): Array[T] = head(n) 

Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?


回答1:


I have experimented & found that head(n) and take(n) gives exactly same replica output. Both produces output in the form of ROW object only.

DF.head(2)

[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]

DF.take(2)

[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]




回答2:


The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains head https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.




回答3:


  package org.apache.spark.sql
  /* ... */

  def take(n: Int): Array[T] = head(n)



回答4:


I think this is because spark developers tends to give it a rich API, there also the two methods where and filter which does exactly the same thing.



来源:https://stackoverflow.com/questions/45138742/apache-spark-dataset-api-headnint-vs-takenint

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!