sparklyr

sparklyr change all column names spark dataframe

孤街醉人 提交于 2019-12-01 11:23:14
I intended to change all column names. The current rename or select operation is too labouring. I dont know if anybody has a better solution. Examples as belwo: df <- data.frame(oldname1 = LETTERS, oldname2 = 1,...oldname200 = "APPLE") df_tbl <- copy_to(sc,df,"df") newnamelist <- paste("Name", 1:200, sep ="_") How do I assign newnamelist as the new colnames? I probably cant do this: df_new <- df_tbl %>% dplyr::select(Name_1 = oldname1, Name_2 = oldname2,....) You can use select_ with .dots : df <- copy_to(sc, iris) newnames <- paste("Name", 1:5, sep="_") df %>% select_(.dots=setNames(colnames

number of unique values sparklyr

偶尔善良 提交于 2019-12-01 10:48:23
the following example describes how you can't calculate the number of distinct values without aggregating the rows using dplyr with sparklyr. is there a work around that doesn't break the chain of commands? more generally, how can you use sql like window functions on sparklyr data frames. ## generating a data set set.seed(.328) df <- data.frame( ids = floor(runif(10, 1, 10)), cats = sample(letters[1:3], 10, replace = TRUE), vals = rnorm(10) ) ## copying to Spark df.spark <- copy_to(sc, df, "df_spark", overwrite = TRUE) # Source: table<df_spark> [?? x 3] # Database: spark_connection # ids cats

number of unique values sparklyr

徘徊边缘 提交于 2019-12-01 07:50:06
问题 the following example describes how you can't calculate the number of distinct values without aggregating the rows using dplyr with sparklyr. is there a work around that doesn't break the chain of commands? more generally, how can you use sql like window functions on sparklyr data frames. ## generating a data set set.seed(.328) df <- data.frame( ids = floor(runif(10, 1, 10)), cats = sample(letters[1:3], 10, replace = TRUE), vals = rnorm(10) ) ## copying to Spark df.spark <- copy_to(sc, df,

Is sample_n really a random sample when used with sparklyr?

纵饮孤独 提交于 2019-12-01 03:18:54
问题 I have 500 million rows in a spark dataframe. I'm interested in using sample_n from dplyr because it will allow me to explicitly specify the sample size I want. If I were to use sparklyr::sdf_sample() , I would first have to calculate the sdf_nrow() , then create the specified fraction of data sample_size / nrow , then pass this fraction to sdf_sample . This isn't a big deal, but the sdf_nrow() can take a while to complete. So, it would be ideal to use dplyr::sample_n() directly. However,

Sparklyr: how to explode a list column into their own columns in Spark table?

心不动则不痛 提交于 2019-12-01 00:22:55
My question is similar with the one in here , but I'm having problems implementing the answer, and I cannot comment in that thread. So, I have a big CSV file that contains a nested data, which contains 2 columns separated by whitespace (say first column is Y, second column is X). Column X itself is also a comma-separated value. 21.66 2.643227,1.2698358,2.6338573,1.8812188,3.8708665,... 35.15 3.422151,-0.59515584,2.4994135,-0.19701914,4.0771823,... 15.22 2.8302398,1.9080592,-0.68780196,3.1878228,4.6600842,... ... I want to read this CSV into 2 different Spark tables using sparklyr . So far this

How to use a predicate while reading from JDBC connection?

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-30 07:55:45
By default, spark_read_jdbc() reads an entire database table into Spark. I've used the following syntax to create these connections. library(sparklyr) library(dplyr) config <- spark_config() config$`sparklyr.shell.driver-class-path` <- "mysql-connector-java-5.1.43/mysql-connector-java-5.1.43-bin.jar" sc <- spark_connect(master = "local", version = "1.6.0", hadoop_version = 2.4, config = config) db_tbl <- sc %>% spark_read_jdbc(sc = ., name = "table_name", options = list(url = "jdbc:mysql://localhost:3306/schema_name", user = "root", password = "password", dbtable = "table_name")) However, I've

Convert Double to Date using Spark in R

依然范特西╮ 提交于 2019-11-29 17:02:26
I have an R data frame as below Date @AD.CC_CC @AD.CC_CC.1 @CL.CC_CC @CL.CC_CC.1 2018-02-05 -380 -380 -1580 -1580 2018-02-06 20 20 -280 -280 2018-02-07 -700 -700 -1730 -1730 2018-02-08 -460 -460 -1100 -1100 2018-02-09 260 260 -1780 -1780 2018-02-12 480 480 380 380 I use the copy_to function to copy the dataframe to Spark. After conversion it converts all the rows to double. # Source: lazy query [?? x 5] # Database: spark_connection Date AD_CC_CC AD_CC_CC_1 CL_CC_CC CL_CC_CC_1 <dbl> <dbl> <dbl> <dbl> <dbl> 17567 -380 -380 -1580 -1580 17568 20 20 -280 -280 17569 -700 -700 -1730 -1730 17570 -460

SparklyR separate one Spark DataFrame column into two columns

坚强是说给别人听的谎言 提交于 2019-11-29 16:15:32
I have a dataframe containing a column named COL which is structured in this way: VALUE1###VALUE2 The following code is working library(sparklyr) library(tidyr) library(dplyr) mParams<- collect(filter(input_DF, TYPE == ('MIN'))) mParams<- separate(mParams, COL, c('col1','col2'), '\\###', remove=FALSE) If I remove the collect , I get this error: Error in UseMethod("separate_") : no applicable method for 'separate_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')" Is there any alternative to achieve what I want, but without collecting everything on my spark driver?

R Shiny and Spark: how to free Spark resources?

痞子三分冷 提交于 2019-11-29 15:24:29
Say we have a Shiny app which is deployed on a Shiny Server. We expect that the app will be used several users via their web browser, as usual. The Shiny app's server.R includes some sparklyr package code which connects to a Spark cluster for classic filter , select , mutate , and arrange operations on data located on HDFS. Is it mandatory to disconnect from Spark: to include a spark_disconnect at the end of the server.R code to free resources ? I think we should never disconnect at let Spark handle the load for each arriving and leaving user. Can somebody please help me to confirm this ? TL

Using SparkR and Sparklyr simultaneously

两盒软妹~` 提交于 2019-11-29 14:30:40
As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both packages to get the full scope of functionality. As both packages essentially wrap references to Java instances of scala classes, it should be possible to use the packages in parallel, I guess. But is it actually possible? What are your best practices? These two packages use different mechanisms and are not designed for interoperability. Their internals