sparklyr

Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?

早过忘川 提交于 2019-11-27 23:18:58
Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing: # set R environment variables Sys.setenv(YARN_CONF_DIR=...) Sys.setenv(SPARK_CONF_DIR=...) Sys.setenv(LD_LIBRARY_PATH=...) Sys.setenv(SPARKR_SUBMIT_ARGS=...) spark_lib_dir <- ... # install specific library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths())) sc <- sparkR.init(master = "yarn-client") However when I swaped the last lines above with library(sparklyr) sc <- spark

how to train a ML model in sparklyr and predict new values on another dataframe?

一世执手 提交于 2019-11-27 22:33:38
Consider the following example dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"), doc_id = 1:4, class = c(1, 1, 1, 0)) dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE) > dtrain_spark # Source: table<dtrain> [?? x 3] # Database: spark_connection text doc_id class <chr> <int> <dbl> 1 Chinese Beijing Chinese 1 1 2 Chinese Chinese Shanghai 2 1 3 Chinese Macao 3 1 4 Tokyo Japan Chinese 4 0 Here I have the classic Naive Bayes example where class identifies documents falling into the China category. I am able to run a

Splitting <dbl [2]> result of Sparklyr as a spark object

随声附和 提交于 2019-11-27 08:18:10
问题 I have a problem with splitting the outcome of my random forest generated by Sparklyr. I'm using the following code to generate a model, which predict a {0 | 1} value and predict the outcome for a specified validation set. model <- ml_random_forest( tbl(sc,"train_set") , formulea) prediction <- sdf_predict( model, tbl(sc,"validation_set") ) %>% select(account_no, probability , prediction) This generated prediction object looks like: Source: query [3.744e+06 x 3] Database: spark connection

Sparklyr: Use group_by and then concatenate strings from rows in a group

喜欢而已 提交于 2019-11-27 07:06:58
问题 I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group. Here is a simple example that I think should work but doesn't: library(sparkylr) d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), x=c("200", "200", "200", "201", "201", "201"), y=c("This", "That", "The", "Other", "End", "End")) d_sdf <- copy_to(sc, d, "d") d_sdf %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " ")) What I'd like it to produce is: Source: local data frame [6 x 3]

Gather in sparklyr

烈酒焚心 提交于 2019-11-26 21:30:07
问题 I am using sparklyr to manipulate some data. Given a, a<-tibble(id = rep(c(1,10), each = 10), attribute1 = rep(c("This", "That", 'These', 'Those', "The", "Other", "Test", "End", "Start", 'Beginning'), 2), value = rep(seq(10,100, by = 10),2), average = rep(c(50,100),each = 10), upper_bound = rep(c(80, 130), each =10), lower_bound = rep(c(20, 70), each =10)) I would like use "gather" to manipulate the data, like this: b<- a %>% gather(key = type_data, value = value_data, -c(id:attribute1))

Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?

别说谁变了你拦得住时间么 提交于 2019-11-26 21:24:51
问题 Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing: # set R environment variables Sys.setenv(YARN_CONF_DIR=...) Sys.setenv(SPARK_CONF_DIR=...) Sys.setenv(LD_LIBRARY_PATH=...) Sys.setenv(SPARKR_SUBMIT_ARGS=...) spark_lib_dir <- ... # install specific library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths())) sc <- sparkR.init

How to train a ML model in sparklyr and predict new values on another dataframe?

萝らか妹 提交于 2019-11-26 16:47:45
问题 Consider the following example dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"), doc_id = 1:4, class = c(1, 1, 1, 0)) dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE) > dtrain_spark # Source: table<dtrain> [?? x 3] # Database: spark_connection text doc_id class <chr> <int> <dbl> 1 Chinese Beijing Chinese 1 1 2 Chinese Chinese Shanghai 2 1 3 Chinese Macao 3 1 4 Tokyo Japan Chinese 4 0 Here I have the classic Naive

Sparklyr: how to center a Spark table based on column?

流过昼夜 提交于 2019-11-26 06:49:16
问题 I have a Spark table: simx x0: num 1.00 2.00 3.00 ... x1: num 2.00 3.00 4.00 ... ... x788: num 2.00 3.00 4.00 ... and a handle named simX_tbl in the R environment that is connected to this simx table. I want to do a centering for this table, which is subtracting each column with its column means. For example, calculating x0 - mean(x0) , and so on. So far my best effort is: meanX <- simX_tbl %>% summarise_all(funs(\"mean\")) %>% collect() x_centered <- simX_tbl for(i in 1:789) { colName <-