sparklyr

What is the equivalent of R's list() function in sparklyr?

戏子无情 提交于 2020-01-21 09:50:55
问题 Below is a sample R code. I would like to do the same in sparklyr. custTrans1 <- Pdt_table %>% group_by(Main_CustomerID) %>% summarise(Invoice = as.vector(list(Invoice_ID)),Industry = as.vector(list(Industry))) where Pdt_table is spark data frame and Main_CustomerID, Invoice_ID and Industry are variables. I would like to create list of the above variables and convert it to vector. How can I do it in sparklyr ? 回答1: You can use collect_list or collect_set: set.seed(1) df <- copy_to( sc, tibble

How to delete a Spark DataFrame using sparklyr?

人走茶凉 提交于 2020-01-16 16:47:32
问题 I have created a Spark dataframe called "iris" using the below library(sparklyr) library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- copy_to(sc, iris) now I want to delete the Spark dataframe "iris" (not the dataframe in R) how do I do that? 回答1: This strictly depends on what you when say delete dataframe . You have to remember that in general, Spark data frames are not the same type of objects as you plain local data structures. Spark DataFrame is rather a description than a

how to find colums having missing data in sparklyr

社会主义新天地 提交于 2020-01-14 11:58:31
问题 example sample data Si K Ca Ba Fe Type 71.78 0.06 8.75 0 0 1 72.73 0.48 7.83 0 0 1 72.99 0.39 7.78 0 0 1 72.61 0.57 na 0 0 na 73.08 0.55 8.07 0 0 1 72.97 0.64 8.07 0 na 1 73.09 na 8.17 0 0 1 73.24 0.57 8.24 0 0 1 72.08 0.56 8.3 0 0 1 72.99 0.57 8.4 0 0.11 1 na 0.67 8.09 0 0.24 1 we can load data into sparklyr with the following code sdf_copy_to(sc,sampledata) I am looking for a query that returns the columns having NA values for example like si k ca fe 1 1 1 2 回答1: This problem is actually a

CSV file creation Error in spark_expect_jobj_class

人盡茶涼 提交于 2020-01-05 13:09:28
问题 I want to create CSV file. While running following Spark R code it gives an error. sc <- spark_connect(master = "local", config = conf, version = '2.2.0') sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE, overwrite = TRUE) sdf_schema_viewer(sample_tbl) # to create db schema df <- spark_dataframe(sample_tbl) spark_write_table(df, path = "data.csv", header = TRUE, delimiter = ",", charset = "UTF-8", null_value = NULL, options = list(), mode =

Why doesn't ml_create_dummy_variables show new dummy variable columns in sparklyr

落爺英雄遲暮 提交于 2020-01-02 08:43:07
问题 I'm trying to create a model matrix in sparklyr. There is a function ml_create_dummy_variables() for creating dummy variables for one categorical variable at a time. As far as I can tell there is no model.matrix() equivalent for creating a model matrix in one step. It's easy to use ml_create_dummy_variables() but I don't understand why the new dummy variables aren't stored in the Spark dataframe. Consider this example: ###create dummy data to figure out how model matrix formulas work in

Out of memory error when collecting data out of Spark cluster

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-01 04:16:27
问题 I know there are plenty of questions on SO about out of memory errors on Spark but I haven't found a solution to mine. I have a simple workflow: read in ORC files from Amazon S3 filter down to a small subset of rows select a small subset of columns collect into the driver node (so I can do additional operations in R ) When I run the above and then cache the table to spark memory it takes up <2GB - tiny compared to the memory available to my cluster - then I get an OOM error when I try to

Sparklyr - Change columns names in a Spark dataframe

前提是你 提交于 2019-12-29 08:06:07
问题 df <- data.frame(old1 = LETTERS, old2 = 1) df_tbl <- copy_to(sc,df,"df") df_tbl <- df_tbl %>% dplyr::rename(old1 = new1, old2 = new2) returns: > head(df_tbl) Error: `new1`, `new2` contains unknown variables Is there an easy way to change the column names using Sparklyr? 回答1: First of all you mixed the order: df_tbl %>% rename(new1 = old1, new2 = old2) but with Sparklyr you have to use select : df_tbl %>% select(new1 = old1, new2 = old2) 来源: https://stackoverflow.com/questions/45514906

Transfer data from database to Spark using sparklyr

半城伤御伤魂 提交于 2019-12-29 05:36:05
问题 I have some data in a database, and I want to work with it in Spark, using sparklyr . I can use a DBI -based package to import the data from the database into R dbconn <- dbConnect(<some connection args>) data_in_r <- dbReadTable(dbconn, "a table") then copy the data from R to Spark using sconn <- spark_connect(<some connection args>) data_ptr <- copy_to(sconn, data_in_r) Copying twice is slow for big datasets. How can I copy data directly from the database into Spark? sparklyr has several

How to export sparklyr (Spark ML) models to PMML?

喜欢而已 提交于 2019-12-25 09:31:08
问题 I know that Spark ML pipelines can be exported to PMML using the JPMML-SparkML library. I am just struggling to find out how I could do it from R using sparklyr . I am aware of open github issue, where two ideas were raised: using Scala API, something like: model <- ml_kmeans(<...>) sparkapi::invoke(model$.model, "toPMML", "./myModelPMML.xml") leverage https://github.com/jpmml/jpmml-converter and the https://github.com/jpmml/jpmml-sparkml However I could not find any follow ups on that tips.

How to export sparklyr (Spark ML) models to PMML?

旧巷老猫 提交于 2019-12-25 09:30:32
问题 I know that Spark ML pipelines can be exported to PMML using the JPMML-SparkML library. I am just struggling to find out how I could do it from R using sparklyr . I am aware of open github issue, where two ideas were raised: using Scala API, something like: model <- ml_kmeans(<...>) sparkapi::invoke(model$.model, "toPMML", "./myModelPMML.xml") leverage https://github.com/jpmml/jpmml-converter and the https://github.com/jpmml/jpmml-sparkml However I could not find any follow ups on that tips.