sparklyr | 易学教程

What is the equivalent of R's list() function in sparklyr?

阅读更多关于 What is the equivalent of R's list() function in sparklyr?

问题 Below is a sample R code. I would like to do the same in sparklyr. custTrans1 <- Pdt_table %>% group_by(Main_CustomerID) %>% summarise(Invoice = as.vector(list(Invoice_ID)),Industry = as.vector(list(Industry))) where Pdt_table is spark data frame and Main_CustomerID, Invoice_ID and Industry are variables. I would like to create list of the above variables and convert it to vector. How can I do it in sparklyr ? 回答1: You can use collect_list or collect_set: set.seed(1) df <- copy_to( sc, tibble

How to delete a Spark DataFrame using sparklyr?

阅读更多关于 How to delete a Spark DataFrame using sparklyr?

问题 I have created a Spark dataframe called "iris" using the below library(sparklyr) library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- copy_to(sc, iris) now I want to delete the Spark dataframe "iris" (not the dataframe in R) how do I do that? 回答1: This strictly depends on what you when say delete dataframe . You have to remember that in general, Spark data frames are not the same type of objects as you plain local data structures. Spark DataFrame is rather a description than a

how to find colums having missing data in sparklyr

阅读更多关于 how to find colums having missing data in sparklyr

问题 example sample data Si K Ca Ba Fe Type 71.78 0.06 8.75 0 0 1 72.73 0.48 7.83 0 0 1 72.99 0.39 7.78 0 0 1 72.61 0.57 na 0 0 na 73.08 0.55 8.07 0 0 1 72.97 0.64 8.07 0 na 1 73.09 na 8.17 0 0 1 73.24 0.57 8.24 0 0 1 72.08 0.56 8.3 0 0 1 72.99 0.57 8.4 0 0.11 1 na 0.67 8.09 0 0.24 1 we can load data into sparklyr with the following code sdf_copy_to(sc,sampledata) I am looking for a query that returns the columns having NA values for example like si k ca fe 1 1 1 2 回答1: This problem is actually a

CSV file creation Error in spark_expect_jobj_class

阅读更多关于 CSV file creation Error in spark_expect_jobj_class

问题 I want to create CSV file. While running following Spark R code it gives an error. sc <- spark_connect(master = "local", config = conf, version = '2.2.0') sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE, overwrite = TRUE) sdf_schema_viewer(sample_tbl) # to create db schema df <- spark_dataframe(sample_tbl) spark_write_table(df, path = "data.csv", header = TRUE, delimiter = ",", charset = "UTF-8", null_value = NULL, options = list(), mode =

Why doesn't ml_create_dummy_variables show new dummy variable columns in sparklyr

阅读更多关于 Why doesn't ml_create_dummy_variables show new dummy variable columns in sparklyr

问题 I'm trying to create a model matrix in sparklyr. There is a function ml_create_dummy_variables() for creating dummy variables for one categorical variable at a time. As far as I can tell there is no model.matrix() equivalent for creating a model matrix in one step. It's easy to use ml_create_dummy_variables() but I don't understand why the new dummy variables aren't stored in the Spark dataframe. Consider this example: ###create dummy data to figure out how model matrix formulas work in

Out of memory error when collecting data out of Spark cluster

阅读更多关于 Out of memory error when collecting data out of Spark cluster

问题 I know there are plenty of questions on SO about out of memory errors on Spark but I haven't found a solution to mine. I have a simple workflow: read in ORC files from Amazon S3 filter down to a small subset of rows select a small subset of columns collect into the driver node (so I can do additional operations in R ) When I run the above and then cache the table to spark memory it takes up <2GB - tiny compared to the memory available to my cluster - then I get an OOM error when I try to

Sparklyr - Change columns names in a Spark dataframe

阅读更多关于 Sparklyr - Change columns names in a Spark dataframe

问题 df <- data.frame(old1 = LETTERS, old2 = 1) df_tbl <- copy_to(sc,df,"df") df_tbl <- df_tbl %>% dplyr::rename(old1 = new1, old2 = new2) returns: > head(df_tbl) Error: `new1`, `new2` contains unknown variables Is there an easy way to change the column names using Sparklyr? 回答1: First of all you mixed the order: df_tbl %>% rename(new1 = old1, new2 = old2) but with Sparklyr you have to use select : df_tbl %>% select(new1 = old1, new2 = old2) 来源： https://stackoverflow.com/questions/45514906

Transfer data from database to Spark using sparklyr

阅读更多关于 Transfer data from database to Spark using sparklyr

问题 I have some data in a database, and I want to work with it in Spark, using sparklyr . I can use a DBI -based package to import the data from the database into R dbconn <- dbConnect(<some connection args>) data_in_r <- dbReadTable(dbconn, "a table") then copy the data from R to Spark using sconn <- spark_connect(<some connection args>) data_ptr <- copy_to(sconn, data_in_r) Copying twice is slow for big datasets. How can I copy data directly from the database into Spark? sparklyr has several

How to export sparklyr (Spark ML) models to PMML?

阅读更多关于 How to export sparklyr (Spark ML) models to PMML?

问题 I know that Spark ML pipelines can be exported to PMML using the JPMML-SparkML library. I am just struggling to find out how I could do it from R using sparklyr . I am aware of open github issue, where two ideas were raised: using Scala API, something like: model <- ml_kmeans(<...>) sparkapi::invoke(model$.model, "toPMML", "./myModelPMML.xml") leverage https://github.com/jpmml/jpmml-converter and the https://github.com/jpmml/jpmml-sparkml However I could not find any follow ups on that tips.

How to export sparklyr (Spark ML) models to PMML?

阅读更多关于 How to export sparklyr (Spark ML) models to PMML?