sparklyr

R: How can I extract an element from a column of data in spark connection (sparklyr) in pipe

主宰稳场 提交于 2019-12-04 05:31:45
问题 I have a dataset as below. Because of its large amount of data, I uploaded it through the sparklyr package, so I can use only pipe statements. pos <- str_sub(csj$helpful,2) neg1 <- str_sub(csj$helpful,4) csj <- csj %>% mutate(neg=replace(helpful,stringr::str_sub(csj$helpful,4)==1,0)) csj <- csj %>% mutate(help=pos/neg) csj is.null(csj$helpful) I want to make a column named 'help' which is 'the first number of helpful column/2nd number of helpful column'. If the 2nd number is 0, I need to

Out of memory error when collecting data out of Spark cluster

徘徊边缘 提交于 2019-12-04 00:30:44
I know there are plenty of questions on SO about out of memory errors on Spark but I haven't found a solution to mine. I have a simple workflow: read in ORC files from Amazon S3 filter down to a small subset of rows select a small subset of columns collect into the driver node (so I can do additional operations in R ) When I run the above and then cache the table to spark memory it takes up <2GB - tiny compared to the memory available to my cluster - then I get an OOM error when I try to collect the data to my driver node. I have tried running on the following setups: local mode on a computer

SparklyR removing a Table from Spark Context

匿名 (未验证) 提交于 2019-12-03 08:59:04
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Would like to remove a single data table from the Spark Context ('sc'). I know a single cached table can be un-cached, but this isn't the same as removing an object from the sc -- as far as I can gather. library(sparklyr) library(dplyr) library(titanic) library(Lahman) spark_install(version = "2.0.0") sc <- spark_connect(master = "local") batting_tbl <- copy_to(sc, Lahman::Batting, "batting") titanic_tbl <- copy_to(sc, titanic_train, "titanic", overwrite = TRUE) src_tbls(sc) # [1] "batting" "titanic" tbl_cache(sc, "batting") # Speeds up

how to train a ML model in sparklyr and predict new values on another dataframe?

匿名 (未验证) 提交于 2019-12-03 01:12:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Consider the following example dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"), doc_id = 1:4, class = c(1, 1, 1, 0)) dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE) > dtrain_spark # Source: table<dtrain> [?? x 3] # Database: spark_connection text doc_id class <chr> <int> <dbl> 1 Chinese Beijing Chinese 1 1 2 Chinese Chinese Shanghai 2 1 3 Chinese Macao 3 1 4 Tokyo Japan Chinese 4 0 Here I have the classic Naive Bayes example where class identifies documents

Efficiently calculate row totals of a wide Spark DF

旧城冷巷雨未停 提交于 2019-12-02 04:27:19
问题 I have a wide spark data frame of a few thousand columns by about a million rows, for which I would like to calculate the row totals. My solution so far is below. I used: dplyr - sum of multiple columns using regular expressions and https://github.com/tidyverse/rlang/issues/116 library(sparklyr) library(DBI) library(dplyr) library(rlang) sc1 <- spark_connect(master = "local") wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200)) wide_sdf = sdf_copy_to(sc1, wide_df, overwrite =

R: How can I extract an element from a column of data in spark connection (sparklyr) in pipe

心已入冬 提交于 2019-12-02 03:04:18
I have a dataset as below. Because of its large amount of data, I uploaded it through the sparklyr package, so I can use only pipe statements. pos <- str_sub(csj$helpful,2) neg1 <- str_sub(csj$helpful,4) csj <- csj %>% mutate(neg=replace(helpful,stringr::str_sub(csj$helpful,4)==1,0)) csj <- csj %>% mutate(help=pos/neg) csj is.null(csj$helpful) I want to make a column named 'help' which is 'the first number of helpful column/2nd number of helpful column'. If the 2nd number is 0, I need to change the 2nd number to 1 and then divide it. The data frame name is csj . But it doesn't work. I'll be

Complete time-series with sparklyr

霸气de小男生 提交于 2019-12-02 01:36:32
问题 I'm trying to find missing minutes in my time-series-dataset. I wrote an R code for a local performance on a small sample: test <- dfv %>% mutate(timestamp = as.POSIXct(DaySecFrom.UTC.)) %>% complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = 'min'), ElemUID) But you can't use complete() from tidyr on a spark_tbl . Error in UseMethod("complete_") : no applicable method for 'complete_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')" here is some

Efficiently calculate row totals of a wide Spark DF

試著忘記壹切 提交于 2019-12-02 01:01:35
I have a wide spark data frame of a few thousand columns by about a million rows, for which I would like to calculate the row totals. My solution so far is below. I used: dplyr - sum of multiple columns using regular expressions and https://github.com/tidyverse/rlang/issues/116 library(sparklyr) library(DBI) library(dplyr) library(rlang) sc1 <- spark_connect(master = "local") wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200)) wide_sdf = sdf_copy_to(sc1, wide_df, overwrite = TRUE, name = "wide_sdf") col_eqn = paste0(colnames(wide_df), collapse = "+" ) # build up the SQL query

how to convert a timestamp into string (without changing timezone)?

ε祈祈猫儿з 提交于 2019-12-02 00:13:36
I have some unix times that I convert to timestamps in sparklyr and for some reasons I also need to convert them into strings. Unfortunately, it seems that during the conversion to string hive converts to EST (my locale). df_new <- spark_read_parquet(sc, "/mypath/parquet_*", overwrite = TRUE, name = "df_new", memory = FALSE, options = list(mergeSchema = "true")) > df_new %>% mutate(unix_t = from_utc_timestamp(timestamp(t) ,'UTC'), date_str = date_format(unix_t, 'yyyy-MM-dd HH:mm:ss z'), date_alt = to_date(from_utc_timestamp(timestamp(t) ,'UTC'))) %>% select(t, unix_t, date_str, date_alt) %>%

how to convert a timestamp into string (without changing timezone)?

╄→гoц情女王★ 提交于 2019-12-01 22:50:41
问题 I have some unix times that I convert to timestamps in sparklyr and for some reasons I also need to convert them into strings. Unfortunately, it seems that during the conversion to string hive converts to EST (my locale). df_new <- spark_read_parquet(sc, "/mypath/parquet_*", overwrite = TRUE, name = "df_new", memory = FALSE, options = list(mergeSchema = "true")) > df_new %>% mutate(unix_t = from_utc_timestamp(timestamp(t) ,'UTC'), date_str = date_format(unix_t, 'yyyy-MM-dd HH:mm:ss z'), date