I have a dataframe containing a column named COL which is structured in this way:
VALUE1###VALUE2
The following code is working
library(sparklyr)
library(tidyr)
library(dplyr)
mParams<- collect(filter(input_DF, TYPE == ('MIN')))
mParams<- separate(mParams, COL, c('col1','col2'), '\\###', remove=FALSE)
If I remove the collect
, I get this error:
Error in UseMethod("separate_") :
no applicable method for 'separate_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"
Is there any alternative to achieve what I want, but without collecting everything on my spark driver?
You can use ft_regex_tokenizer followed by sdf_separate_column.
ft_regex_tokenizer will split a column into a vector type, based on a regex. sdf_separate_column will split this into multiple columns.
mydf %>%
ft_regex_tokenizer(input_col="mycolumn", output_col="mycolumnSplit", pattern=";") %>%
sdf_separate_column("mycolumnSplit", into=c("column1", "column2")
UPDATE: in recent versions of sparklyr, the parameters input.col and output.col have been renamed to input_col and output_col, respectively.
Sparklyr version 0.5 has just been released, and it contains the ft_regex_tokenizer()
function that can do that:
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false).
library(dplyr)
library(sparklyr)
ft_regex_tokenizer(input_DF, input.col = "COL", output.col = "ResultCols", pattern = '\\###')
The splitted column "ResultCols" will be a list.
来源:https://stackoverflow.com/questions/41810015/sparklyr-separate-one-spark-dataframe-column-into-two-columns