sparklyr

Extract and Visualize Model Trees from Sparklyr

泪湿孤枕 提交于 2019-12-06 03:27:34
问题 Does anyone have any advice about how to convert the tree information from sparklyr's ml_decision_tree_classifier, ml_gbt_classifier, or ml_random_forest_classifier models into a.) a format that can be understood by other R tree-related libraries and (ultimately) b.) a visualization of the trees for non-technical consumption? This would include the ability to convert back to the actual feature names from the substituted string indexing values that are produced during the vector assembler. The

sparklyr: create new column with mutate function

只愿长相守 提交于 2019-12-06 03:25:23
I'm very surprised if this kind of problems cannot be solved with sparklyr: iris_tbl <- copy_to(sc, aDataFrame) # date_vector is a character vector of element # in this format: YYYY-MM-DD (year, month, day) for (d in date_vector) { ... aDataFrame %>% mutate(newValue=gsub("-","",d))) ... } I receive this error: Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86 at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup

Importing cassandra table into spark via sparklyr - possible to select only some columns?

本小妞迷上赌 提交于 2019-12-06 01:34:14
I've been working with sparklyr to bring large cassandra tables into spark, register these with R and conduct dplyr operations on them. I have been successfully importing cassandra tables with the code that looks like this: # import cassandra table into spark cass_df <- sparklyr:::spark_data_read_generic( sc, "org.apache.spark.sql.cassandra", "format", list(keyspace = "cass_keyspace", table = "cass_table") ) %>% invoke("load") # register table in R cass_tbl <- sparklyr:::spark_partition_register_df( sc, cass_df, name = "cass_table", repartition = 0, memory = TRUE) ) Some of these cassandra

SparklyR removing a Table from Spark Context

孤者浪人 提交于 2019-12-06 01:27:25
问题 Would like to remove a single data table from the Spark Context ('sc'). I know a single cached table can be un-cached, but this isn't the same as removing an object from the sc -- as far as I can gather. library(sparklyr) library(dplyr) library(titanic) library(Lahman) spark_install(version = "2.0.0") sc <- spark_connect(master = "local") batting_tbl <- copy_to(sc, Lahman::Batting, "batting") titanic_tbl <- copy_to(sc, titanic_train, "titanic", overwrite = TRUE) src_tbls(sc) # [1] "batting"

Unnest (seperate) multiple column values into new rows using Sparklyr

*爱你&永不变心* 提交于 2019-12-05 12:53:04
I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr . But I am looking to solve same problem in sparklyr . id <- c(1,1,1,1,1,2,2,2,3,3,3) name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E") value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3") dt <- data.frame(id,name,value) R solution: separate_rows(dt, name, sep=",") %>% separate_rows(value, sep=",") Desired Output from sparkframe(sparklyr package)- > final_result id name value 1 1 A 1 2 1 A 2 3 1 A

How can I train a random forest with a sparse matrix in Spark?

此生再无相见时 提交于 2019-12-05 08:32:39
Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense & Sensibility 0 3 by Jane Austen Sense & Sensibility 0 4 "" Sense & Sensibility 0 5 (1811) Sense &

How to implement Stanford CoreNLP wrapper for Apache Spark using sparklyr?

廉价感情. 提交于 2019-12-05 08:14:28
I am trying to create a R package so I can use the Stanford CoreNLP wrapper for Apache Spark (by databricks) from R. I am using the sparklyr package to connect to my local Spark instance. I created a package with the following dependency function spark_dependencies <- function(spark_version, scala_version, ...) { sparklyr::spark_dependency( jars = c( system.file( sprintf("stanford-corenlp-full/stanford-corenlp-3.6.0.jar"), package = "sparkNLP" ), system.file( sprintf("stanford-corenlp-full/stanford-corenlp-3.6.0-models.jar"), package = "sparkNLP" ), system.file( sprintf("stanford-corenlp-full

How to pass variables to functions called in spark_apply()?

让人想犯罪 __ 提交于 2019-12-04 19:45:01
I would like to be able to pass extra variables to functions that are called by spark_apply in sparklyr. For example: # setup library(sparklyr) sc <- spark_connect(master='local', packages=TRUE) iris2 <- iris[,1:(ncol(iris) - 1)] df1 <- sdf_copy_to(sc, iris2, repartition=5, overwrite=T) # This works fine res <- spark_apply(df1, function(x) kmeans(x, 3)$centers) # This does not k <- 3 res <- spark_apply(df1, function(x) kmeans(x, k)$centers) As an ugly workaround, I can do what I want by saving values into R packages, and then referencing them. i.e > myPackage::k_equals_three == 3 [1] TRUE #

How to aggregate Data by 3 minutes timestamps in sparklyr?

狂风中的少年 提交于 2019-12-04 17:09:21
I am using sparklyr for some quick analysis. I do have some issues in working with timestamps. I have two different dataframes: one with rows in 1-minute-interval and another with 3-minute-interval. First dataset: (1-minute-interval) id timefrom timeto value 10 "2017-06-06 10:30:00" "2017-06-06 10:31:00" 50 10 "2017-06-06 10:31:00" "2017-06-06 10:32:00" 80 10 "2017-06-06 10:32:00" "2017-06-06 10:33:00" 20 22 "2017-06-06 10:33:00" "2017-06-06 10:34:00" 30 22 "2017-06-06 10:34:00" "2017-06-06 10:35:00" 50 22 "2017-06-06 10:35:00" "2017-06-06 10:36:00" 50 Second dataset: (3-minute-interval) id

Extract and Visualize Model Trees from Sparklyr

邮差的信 提交于 2019-12-04 07:44:29
Does anyone have any advice about how to convert the tree information from sparklyr's ml_decision_tree_classifier, ml_gbt_classifier, or ml_random_forest_classifier models into a.) a format that can be understood by other R tree-related libraries and (ultimately) b.) a visualization of the trees for non-technical consumption? This would include the ability to convert back to the actual feature names from the substituted string indexing values that are produced during the vector assembler. The following code is copied liberally from a sparklyr blog post for the purposes of providing an example: