How to make a new DataFrame in sparkR

问题

In sparkR I have data as a DataFrame. I can attach one entry in data like this:

newdata <- filter(data, data$column == 1)

How can I attach more than just one?
Say I want to attach all elements in the vector list <- c(1,6,10,11,14) or if list is a DataFrame 1 6 10 11 14.

newdata <- filter(data, data$column == list)

If I do it like this I get an error.

回答1:

If you are ultimately trying to filter a spark DataFrame by a list of unique values, you can do this with a merge operation. If you are talking about going from a long to a wide data format, you need to ensure there are the same number of observations for each 'level' of the factor variable you are considering. If you want to subset a Spark dataframe by columns, you could also use a select statement, or build up a select statement by pasting data$blah into and then do the eval(parse(text=bigTextObject)) as @Wannes suggested. Maybe a function that generates a big select statement is what you want (if you are filtering by column name)...a merge is what you want if you are trying to extract values from a single column.

From what I understand, it seems as if you want to take a big Spark DataFrame with lots of columns and only take the ones you are interested in, as indicated by list in your question.

Here is a little function to generate the spark select statement:

list<- c(1,2,5,8,90,200)
listWithDataPrePended<- paste0('data', '$', list)
gettingCloser<- noquote(paste0(listWithDataPrePended, collapse = ','))
finalSelectStatement<- noquote(paste("select(data,", gettingCloser, ")"))
finalData<- eval(parse(text=finalSelectStatement))
finalData<- SparkR::collect(finalData)

Maybe this is what you're looking for...maybe not. Nonetheless, I hope it's helpful.

Good luck, nate

回答2:

The == list will not work, nor %in% list which would make more sense, but you can do it as follows (I have included an example data.frame):

dataLocal <- data.frame(column=c(rep(1,10),rep(2,10),rep(3,10)),column2=1:30)
data      <- createDataFrame(sqlContext,dataLocal)
newdata   <- filter(data, (data$column == 1)|(data$column == 2))

or more in general (now your list2 can be of arbitrary length)

list2 <- c(1,2)
listEquals  <- paste("(data$column == ",list2,")",sep="")
checkEquals <- paste(listEquals,collapse="|")
func  <- paste("filter(data, ",checkEquals,")",sep="")
newdata <- eval(parse(text=func))

Do not forget to run

collect(newdata)

to check the result.

来源：https://stackoverflow.com/questions/31743612/how-to-make-a-new-dataframe-in-sparkr

标签

sparkr