sparklyr | 易学教程

How can I train a random forest with a sparse matrix in Spark?

阅读更多关于 How can I train a random forest with a sparse matrix in Spark?

问题 Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense &

Sparklyr: how to calculate correlation coefficient between 2 Spark tables?

阅读更多关于 Sparklyr: how to calculate correlation coefficient between 2 Spark tables?

问题 I have these 2 Spark tables: simx x0: num 1.00 2.00 3.00 ... x1: num 2.00 3.00 4.00 ... ... x788: num 2.00 3.00 4.00 ... and simy y0: num 1.00 2.00 3.00 ... In both tables, each column has the same number of values. Both table x and y are saved into handle simX_tbl and simY_tbl respectively. The actual data size is quite big and may reach 40GB. I want to calculate the correlation coefficient of each column in simx with simy (let's say like cor(x0, y0, 'pearson') ). I searched everywhere and I

count number of unique elements in each columns with dplyr in sparklyr

阅读更多关于 count number of unique elements in each columns with dplyr in sparklyr

问题 I'm trying to count the number of unique elements in each column in the spark dataset s. However It seems that spark doesn't recognize tally() k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.))))) Error: org.apache.spark.sql.AnalysisException: undefined function TALLY It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.

changing the JVM timezone in sparklyr

阅读更多关于 changing the JVM timezone in sparklyr

问题 I am desperately trying to change the timezone of my JVM in Sparklyr (using spark 2.1.0 ). I want GMT everywhere. I am setting: config$`driver.extraJavaOptions` <-"Duser.timezone=GMT" in my spark_config() file but unfortunately, in the Spark UI I still see (under System Properties) that user.timezone is set to America/New_York . Any ideas? Thanks! 回答1: A few things: The name of the property is spark.driver.extraJavaOptions . The value is missing leading - . Should be -Duser.timezone=GMT . For

How to flatten the data of different data types by using Sparklyr package?

阅读更多关于 How to flatten the data of different data types by using Sparklyr package?

R Shiny and Spark: how to free Spark resources?

阅读更多关于 R Shiny and Spark: how to free Spark resources?

问题 Say we have a Shiny app which is deployed on a Shiny Server. We expect that the app will be used several users via their web browser, as usual. The Shiny app's server.R includes some sparklyr package code which connects to a Spark cluster for classic filter , select , mutate , and arrange operations on data located on HDFS. Is it mandatory to disconnect from Spark: to include a spark_disconnect at the end of the server.R code to free resources ? I think we should never disconnect at let Spark

Using SparkR and Sparklyr simultaneously

阅读更多关于 Using SparkR and Sparklyr simultaneously

问题 As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both packages to get the full scope of functionality. As both packages essentially wrap references to Java instances of scala classes, it should be possible to use the packages in parallel, I guess. But is it actually possible? What are your best practices?

Access table in other than default scheme (database) from sparklyr

阅读更多关于 Access table in other than default scheme (database) from sparklyr

问题 After I managed it to connect to our (new) cluster using sparklyr with yarn-client method, now I can show just the tables from the default scheme. How can I connect to scheme.table ? Using DBI it's working e.g. with the following line: dbGetQuery(sc, "SELECT * FROM scheme.table LIMIT 10") In HUE, I can show all tables from all schemes. ~g 回答1: You can either use a fully qualified name to register temporary view: spark_session(sc) %>% invoke("table", "my_database.my_table") %>% invoke(

Why can't I use double colon operator with dplyr when the dataset is in sparklyr?

阅读更多关于 Why can't I use double colon operator with dplyr when the dataset is in sparklyr?

问题 A reproducible example (adapted from @forestfanjoe's answer): library(dplyr) library(sparklyr) sc <- spark_connect(master = "local") df <- data.frame(id = 1:100, PaymentHistory = runif(n = 100, min = -1, max = 2)) df <- copy_to(sc, df, "payment") > head(df) # Source: spark<?> [?? x 2] id PaymentHistory * <int> <dbl> 1 1 -0.138 2 2 -0.249 3 3 -0.805 4 4 1.30 5 5 1.54 6 6 0.936 fix_PaymentHistory <- function(df){df %>% dplyr::mutate(PaymentHistory = dplyr::if_else(PaymentHistory < 0, 0, dplyr:

R: NaN came up after as.numeric() while using dplyr sparklyr / maniplating data with pipes in sparklyr connection [duplicate]

阅读更多关于 R: NaN came up after as.numeric() while using dplyr sparklyr / maniplating data with pipes in sparklyr connection [duplicate]

问题 This question already has an answer here : R: How can I extract an element from a column of data in spark connection (sparklyr) in pipe (1 answer) Closed last year . I began to use sparklyr to handle big size data, so I need to use only pipe lines. But while manupulating data frame I got in trouble and it seems by csj %>% head() below is how my data looks like. enter image description here What I want to do is first, I want to make a new column, lenght_of_review, with counting number of