Apache Spark DataSet API : head(n:Int) vs take(n:Int)

问题

Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int).

Dataset.Scala source contains

def take(n: Int): Array[T] = head(n)

Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?

回答1:

I have experimented & found that head(n) and take(n) gives exactly same replica output. Both produces output in the form of ROW object only.

DF.head(2)

[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]

DF.take(2)

回答2:

The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains head https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.

回答3:

  package org.apache.spark.sql
  /* ... */

  def take(n: Int): Array[T] = head(n)

回答4:

I think this is because spark developers tends to give it a rich API, there also the two methods where and filter which does exactly the same thing.

来源：https://stackoverflow.com/questions/45138742/apache-spark-dataset-api-headnint-vs-takenint

标签

apache-spark

apache-spark-sql

spark-dataframe