问题
Apache Spark Dataset API has two methods i.e, head(n:Int)
and take(n:Int)
.
Dataset.Scala source contains
def take(n: Int): Array[T] = head(n)
Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?
回答1:
I have experimented & found that head(n) and take(n) gives exactly same replica output. Both produces output in the form of ROW object only.
DF.head(2)
[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]
DF.take(2)
[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]
回答2:
The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains head
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.
回答3:
package org.apache.spark.sql
/* ... */
def take(n: Int): Array[T] = head(n)
回答4:
I think this is because spark developers tends to give it a rich API, there also the two methods where
and filter
which does exactly the same thing.
来源:https://stackoverflow.com/questions/45138742/apache-spark-dataset-api-headnint-vs-takenint