sparklyr

How can I train a random forest with a sparse matrix in Spark?

假如想象 提交于 2019-12-22 07:45:06
问题 Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense &

Sparklyr: how to calculate correlation coefficient between 2 Spark tables?

回眸只為那壹抹淺笑 提交于 2019-12-21 23:43:37
问题 I have these 2 Spark tables: simx x0: num 1.00 2.00 3.00 ... x1: num 2.00 3.00 4.00 ... ... x788: num 2.00 3.00 4.00 ... and simy y0: num 1.00 2.00 3.00 ... In both tables, each column has the same number of values. Both table x and y are saved into handle simX_tbl and simY_tbl respectively. The actual data size is quite big and may reach 40GB. I want to calculate the correlation coefficient of each column in simx with simy (let's say like cor(x0, y0, 'pearson') ). I searched everywhere and I

count number of unique elements in each columns with dplyr in sparklyr

点点圈 提交于 2019-12-20 06:37:20
问题 I'm trying to count the number of unique elements in each column in the spark dataset s. However It seems that spark doesn't recognize tally() k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.))))) Error: org.apache.spark.sql.AnalysisException: undefined function TALLY It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.

changing the JVM timezone in sparklyr

岁酱吖の 提交于 2019-12-20 05:14:26
问题 I am desperately trying to change the timezone of my JVM in Sparklyr (using spark 2.1.0 ). I want GMT everywhere. I am setting: config$`driver.extraJavaOptions` <-"Duser.timezone=GMT" in my spark_config() file but unfortunately, in the Spark UI I still see (under System Properties) that user.timezone is set to America/New_York . Any ideas? Thanks! 回答1: A few things: The name of the property is spark.driver.extraJavaOptions . The value is missing leading - . Should be -Duser.timezone=GMT . For

How to flatten the data of different data types by using Sparklyr package?

别来无恙 提交于 2019-12-19 05:19:41
问题 Introduction R code is written by using Sparklyr package to create database schema. [Reproducible code and database is given] Existing Result root |-- contributors : string |-- created_at : string |-- entities (struct) | |-- hashtags (array) : [string] | |-- media (array) | | |-- additional_media_info (struct) | | | |-- description : string | | | |-- embeddable : boolean | | | |-- monetizable : bollean | | |-- diplay_url : string | | |-- id : long | | |-- id_str : string | |-- urls (array) |-

R Shiny and Spark: how to free Spark resources?

谁说胖子不能爱 提交于 2019-12-18 09:03:25
问题 Say we have a Shiny app which is deployed on a Shiny Server. We expect that the app will be used several users via their web browser, as usual. The Shiny app's server.R includes some sparklyr package code which connects to a Spark cluster for classic filter , select , mutate , and arrange operations on data located on HDFS. Is it mandatory to disconnect from Spark: to include a spark_disconnect at the end of the server.R code to free resources ? I think we should never disconnect at let Spark

Using SparkR and Sparklyr simultaneously

时光怂恿深爱的人放手 提交于 2019-12-18 08:48:02
问题 As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both packages to get the full scope of functionality. As both packages essentially wrap references to Java instances of scala classes, it should be possible to use the packages in parallel, I guess. But is it actually possible? What are your best practices?

Access table in other than default scheme (database) from sparklyr

送分小仙女□ 提交于 2019-12-18 08:27:57
问题 After I managed it to connect to our (new) cluster using sparklyr with yarn-client method, now I can show just the tables from the default scheme. How can I connect to scheme.table ? Using DBI it's working e.g. with the following line: dbGetQuery(sc, "SELECT * FROM scheme.table LIMIT 10") In HUE, I can show all tables from all schemes. ~g 回答1: You can either use a fully qualified name to register temporary view: spark_session(sc) %>% invoke("table", "my_database.my_table") %>% invoke(

Why can't I use double colon operator with dplyr when the dataset is in sparklyr?

江枫思渺然 提交于 2019-12-13 16:08:34
问题 A reproducible example (adapted from @forestfanjoe's answer): library(dplyr) library(sparklyr) sc <- spark_connect(master = "local") df <- data.frame(id = 1:100, PaymentHistory = runif(n = 100, min = -1, max = 2)) df <- copy_to(sc, df, "payment") > head(df) # Source: spark<?> [?? x 2] id PaymentHistory * <int> <dbl> 1 1 -0.138 2 2 -0.249 3 3 -0.805 4 4 1.30 5 5 1.54 6 6 0.936 fix_PaymentHistory <- function(df){df %>% dplyr::mutate(PaymentHistory = dplyr::if_else(PaymentHistory < 0, 0, dplyr:

R: NaN came up after as.numeric() while using dplyr sparklyr / maniplating data with pipes in sparklyr connection [duplicate]

大憨熊 提交于 2019-12-12 19:35:04
问题 This question already has an answer here : R: How can I extract an element from a column of data in spark connection (sparklyr) in pipe (1 answer) Closed last year . I began to use sparklyr to handle big size data, so I need to use only pipe lines. But while manupulating data frame I got in trouble and it seems by csj %>% head() below is how my data looks like. enter image description here What I want to do is first, I want to make a new column, lenght_of_review, with counting number of