sparklyr: create new column with mutate function

只愿长相守 提交于 2019-12-06 03:25:23

I would strongly recommend you read the sparklyr documentation before proceeding. In particular, you're going to want to read the section on how R is translated to SQL (http://spark.rstudio.com/dplyr.html#sql_translation). In short, a very limited subset of R functions are available for use on sparklyr dataframes, and gsub is not one of those functions (but toupper is). If you really need gsub you're going to have to collect the data in to a local dataframe, then gsub it (you can still use mutate), then copy_to back to spark.

It may be worth adding that the available documentation states:

Hive Functions

Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions.

Hive

As stated in the documentation, a viable solution should be achievable with use of regexp_replace:

Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. For example, regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.

sparklyr approach

Considering the above it should be possible to combine sparklyr pipeline with regexp_replace to achieve effect cognate to applying gsub on the desired column. Tested code removing the - character within sparklyr in variable d could be build as follows:

aDataFrame %>% 
  mutate(clnD = regexp_replace(d, "-", "")) %>%
  # ...

where class(aDataFrame ) returns: "tbl_spark" ....

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!