sparklyr | 易学教程

Access table in other than default scheme (database) from sparklyr

阅读更多关于 Access table in other than default scheme (database) from sparklyr

After I managed it to connect to our (new) cluster using sparklyr with yarn-client method, now I can show just the tables from the default scheme. How can I connect to scheme.table ? Using DBI it's working e.g. with the following line: dbGetQuery(sc, "SELECT * FROM scheme.table LIMIT 10") In HUE, I can show all tables from all schemes. ~g You can either use a fully qualified name to register temporary view: spark_session(sc) %>% invoke("table", "my_database.my_table") %>% invoke("createOrReplaceTempView", "my_view") tbl(sc, "my_view") or use sql method to switch databases spark_session(sc) %>%

Running out of heap space in sparklyr, but have plenty of memory

阅读更多关于 Running out of heap space in sparklyr, but have plenty of memory

I am getting heap space errors on even fairly small datasets. I can be sure that I'm not running out of system memory. For example, consider a dataset containing about 20M rows and 9 columns, and that takes up 1GB on disk. I am playing with it on a Google Compute node with 30gb of memory. Let's say that I have this data in a dataframe called df . The following works fine, albeit somewhat slowly: library(tidyverse) uniques <- search_raw_lt %>% group_by(my_key) %>% summarise() %>% ungroup() The following throws java.lang.OutOfMemoryError: Java heap space . library(tidyverse) library(sparklyr) sc

How to use a predicate while reading from JDBC connection?

阅读更多关于 How to use a predicate while reading from JDBC connection?

问题 By default, spark_read_jdbc() reads an entire database table into Spark. I've used the following syntax to create these connections. library(sparklyr) library(dplyr) config <- spark_config() config$`sparklyr.shell.driver-class-path` <- "mysql-connector-java-5.1.43/mysql-connector-java-5.1.43-bin.jar" sc <- spark_connect(master = "local", version = "1.6.0", hadoop_version = 2.4, config = config) db_tbl <- sc %>% spark_read_jdbc(sc = ., name = "table_name", options = list(url = "jdbc:mysql:/

Transfer data from database to Spark using sparklyr

阅读更多关于 Transfer data from database to Spark using sparklyr

I have some data in a database, and I want to work with it in Spark, using sparklyr . I can use a DBI -based package to import the data from the database into R dbconn <- dbConnect(<some connection args>) data_in_r <- dbReadTable(dbconn, "a table") then copy the data from R to Spark using sconn <- spark_connect(<some connection args>) data_ptr <- copy_to(sconn, data_in_r) Copying twice is slow for big datasets. How can I copy data directly from the database into Spark? sparklyr has several spark_read_*() functions for import, but nothing database related. sdf_import() looks like a possibility,

SparkR vs sparklyr [closed]

阅读更多关于 SparkR vs sparklyr [closed]

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code? Best The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R: https://spark.apache.org/docs/2.0.1/sparkr.html#applying

Convert Double to Date using Spark in R

阅读更多关于 Convert Double to Date using Spark in R

问题 I have an R data frame as below Date @AD.CC_CC @AD.CC_CC.1 @CL.CC_CC @CL.CC_CC.1 2018-02-05 -380 -380 -1580 -1580 2018-02-06 20 20 -280 -280 2018-02-07 -700 -700 -1730 -1730 2018-02-08 -460 -460 -1100 -1100 2018-02-09 260 260 -1780 -1780 2018-02-12 480 480 380 380 I use the copy_to function to copy the dataframe to Spark. After conversion it converts all the rows to double. # Source: lazy query [?? x 5] # Database: spark_connection Date AD_CC_CC AD_CC_CC_1 CL_CC_CC CL_CC_CC_1 <dbl> <dbl> <dbl

SparklyR separate one Spark DataFrame column into two columns

阅读更多关于 SparklyR separate one Spark DataFrame column into two columns

问题 I have a dataframe containing a column named COL which is structured in this way: VALUE1###VALUE2 The following code is working library(sparklyr) library(tidyr) library(dplyr) mParams<- collect(filter(input_DF, TYPE == ('MIN'))) mParams<- separate(mParams, COL, c('col1','col2'), '\\###', remove=FALSE) If I remove the collect , I get this error: Error in UseMethod("separate_") : no applicable method for 'separate_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')" Is

Running out of heap space in sparklyr, but have plenty of memory

阅读更多关于 Running out of heap space in sparklyr, but have plenty of memory

问题 I am getting heap space errors on even fairly small datasets. I can be sure that I'm not running out of system memory. For example, consider a dataset containing about 20M rows and 9 columns, and that takes up 1GB on disk. I am playing with it on a Google Compute node with 30gb of memory. Let's say that I have this data in a dataframe called df . The following works fine, albeit somewhat slowly: library(tidyverse) uniques <- search_raw_lt %>% group_by(my_key) %>% summarise() %>% ungroup() The

How to filter on partial match using sparklyr

阅读更多关于 How to filter on partial match using sparklyr

问题 I'm new to sparklyr (but familiar with spark and pyspark), and I've got a really basic question. I'm trying to filter a column based on a partial match. In dplyr, i'd write my operation as so: businesses %>% filter(grepl('test', biz_name)) %>% head Running that code on a spark dataframe however gives me: Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GREPL'. This function is neither a registered temporary function nor a permanent function registered in the database

Gather in sparklyr

阅读更多关于 Gather in sparklyr

I am using sparklyr to manipulate some data. Given a, a<-tibble(id = rep(c(1,10), each = 10), attribute1 = rep(c("This", "That", 'These', 'Those', "The", "Other", "Test", "End", "Start", 'Beginning'), 2), value = rep(seq(10,100, by = 10),2), average = rep(c(50,100),each = 10), upper_bound = rep(c(80, 130), each =10), lower_bound = rep(c(20, 70), each =10)) I would like use "gather" to manipulate the data, like this: b<- a %>% gather(key = type_data, value = value_data, -c(id:attribute1)) However, "gather" is not available on sparklyr. I have seen some people using sdf_pivot to mimic "gather"