问题
I'm very surprised if this kind of problems cannot be solved with sparklyr:
iris_tbl <- copy_to(sc, aDataFrame)
# date_vector is a character vector of element
# in this format: YYYY-MM-DD (year, month, day)
for (d in date_vector) {
...
aDataFrame %>% mutate(newValue=gsub("-","",d)))
...
}
I receive this error:
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:787)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun
But with this line:
aDataFrame %>% mutate(newValue=toupper("hello"))
things work. Some help?
回答1:
I would strongly recommend you read the sparklyr documentation before proceeding. In particular, you're going to want to read the section on how R is translated to SQL (http://spark.rstudio.com/dplyr.html#sql_translation). In short, a very limited subset of R functions are available for use on sparklyr dataframes, and gsub is not one of those functions (but toupper is). If you really need gsub you're going to have to collect the data in to a local dataframe, then gsub it (you can still use mutate), then copy_to back to spark.
回答2:
It may be worth adding that the available documentation states:
Hive Functions
Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions.
Hive
As stated in the documentation, a viable solution should be achievable with use of regexp_replace:
Returns the string resulting from replacing all substrings in
INITIAL_STRINGthat match the java regular expression syntax defined inPATTERNwith instances ofREPLACEMENT.For example,regexp_replace("foobar", "oo|ar", "")returns'fb.'Note that some care is necessary in using predefined character classes: using'\s'as the second argument will match the letters; '\\s'is necessary to match whitespace, etc.
sparklyr approach
Considering the above it should be possible to combine sparklyr pipeline with
regexp_replace to achieve effect cognate to applying gsub on the desired column. Tested code removing the - character within sparklyr in variable d could be build as follows:
aDataFrame %>%
mutate(clnD = regexp_replace(d, "-", "")) %>%
# ...
where class(aDataFrame ) returns: "tbl_spark" ....
来源:https://stackoverflow.com/questions/40285594/sparklyr-create-new-column-with-mutate-function