Get specific row by using SparkR

问题

I have a dataset "data" in SparkR of type DataFrame. I want to get entry number 50 for example. In R I simply type data[50,] but when I do this in sparkR I get this message

"Error: object of type 'S4' is not subsettable"

What can I do to solve this ?

Furthermore: How can I add a column (of the same column-size) to the data?

回答1:

The only thing you can do is

all50 <- take(data,50)
row50 <- tail(all50,1)

SparkR has no row.names, hence you can not subset on an index. This approach works, but you do not want to use it on big datasets.

Also the second part of your question is not possible yet. You can only add columns based on numbers (e.g. a constant column) or by making transformations of columns that belong to your DataFrame. This was actually already asked in How to do bind two dataframe columns in sparkR?.

回答2:

Depending on previous transformations order of values in RDDs, which are the data containers behind the Spark Data Frames, is not guaranteed. Unless you explicitly order your data, for example using orderBy asking for the nth row is not even meaningful.

If you combine explicit order and a little bit of a raw SQL you can select a single row as follows:

sqlContext <- sparkRHive.init(sc)
df <- createDataFrame(sqlContext, mtcars)
registerTempTable(df, "df")

# First lets order data frame and add row number
df_ordered <- sql(
     sqlContext,
    "SELECT *, row_number() OVER (ORDER BY wt) as rn FROM df")

# It could be done with using nested SQL but where is more convinient
head(where(df_ordered, df_ordered$rn == 5))

Please note that window functions require HiveContext. Default sparkRSQL context you get in a SparkR shell won't work.

It is worth to note that Spark Data Frames (same as any RDD) are not designed with random access in mind and operations like a single value/row access are not obvious for a reason. Sorting a large dataset is an expensive process and without specific partitioner extracting a single may require a whole RDD scan.

来源：https://stackoverflow.com/questions/31676691/get-specific-row-by-using-sparkr

标签

apache-spark

sparkr