sparklyr | 易学教程

Extract and Visualize Model Trees from Sparklyr

阅读更多关于 Extract and Visualize Model Trees from Sparklyr

问题 Does anyone have any advice about how to convert the tree information from sparklyr's ml_decision_tree_classifier, ml_gbt_classifier, or ml_random_forest_classifier models into a.) a format that can be understood by other R tree-related libraries and (ultimately) b.) a visualization of the trees for non-technical consumption? This would include the ability to convert back to the actual feature names from the substituted string indexing values that are produced during the vector assembler. The

sparklyr: create new column with mutate function

阅读更多关于 sparklyr: create new column with mutate function

I'm very surprised if this kind of problems cannot be solved with sparklyr: iris_tbl <- copy_to(sc, aDataFrame) # date_vector is a character vector of element # in this format: YYYY-MM-DD (year, month, day) for (d in date_vector) { ... aDataFrame %>% mutate(newValue=gsub("-","",d))) ... } I receive this error: Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86 at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup

Importing cassandra table into spark via sparklyr - possible to select only some columns?

阅读更多关于 Importing cassandra table into spark via sparklyr - possible to select only some columns?

I've been working with sparklyr to bring large cassandra tables into spark, register these with R and conduct dplyr operations on them. I have been successfully importing cassandra tables with the code that looks like this: # import cassandra table into spark cass_df <- sparklyr:::spark_data_read_generic( sc, "org.apache.spark.sql.cassandra", "format", list(keyspace = "cass_keyspace", table = "cass_table") ) %>% invoke("load") # register table in R cass_tbl <- sparklyr:::spark_partition_register_df( sc, cass_df, name = "cass_table", repartition = 0, memory = TRUE) ) Some of these cassandra

SparklyR removing a Table from Spark Context

阅读更多关于 SparklyR removing a Table from Spark Context

问题 Would like to remove a single data table from the Spark Context ('sc'). I know a single cached table can be un-cached, but this isn't the same as removing an object from the sc -- as far as I can gather. library(sparklyr) library(dplyr) library(titanic) library(Lahman) spark_install(version = "2.0.0") sc <- spark_connect(master = "local") batting_tbl <- copy_to(sc, Lahman::Batting, "batting") titanic_tbl <- copy_to(sc, titanic_train, "titanic", overwrite = TRUE) src_tbls(sc) # [1] "batting"

Unnest (seperate) multiple column values into new rows using Sparklyr

阅读更多关于 Unnest (seperate) multiple column values into new rows using Sparklyr

I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr . But I am looking to solve same problem in sparklyr . id <- c(1,1,1,1,1,2,2,2,3,3,3) name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E") value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3") dt <- data.frame(id,name,value) R solution: separate_rows(dt, name, sep=",") %>% separate_rows(value, sep=",") Desired Output from sparkframe(sparklyr package)- > final_result id name value 1 1 A 1 2 1 A 2 3 1 A

How can I train a random forest with a sparse matrix in Spark?

阅读更多关于 How can I train a random forest with a sparse matrix in Spark?

Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense & Sensibility 0 3 by Jane Austen Sense & Sensibility 0 4 "" Sense & Sensibility 0 5 (1811) Sense &

How to implement Stanford CoreNLP wrapper for Apache Spark using sparklyr?

阅读更多关于 How to implement Stanford CoreNLP wrapper for Apache Spark using sparklyr?

I am trying to create a R package so I can use the Stanford CoreNLP wrapper for Apache Spark (by databricks) from R. I am using the sparklyr package to connect to my local Spark instance. I created a package with the following dependency function spark_dependencies <- function(spark_version, scala_version, ...) { sparklyr::spark_dependency( jars = c( system.file( sprintf("stanford-corenlp-full/stanford-corenlp-3.6.0.jar"), package = "sparkNLP" ), system.file( sprintf("stanford-corenlp-full/stanford-corenlp-3.6.0-models.jar"), package = "sparkNLP" ), system.file( sprintf("stanford-corenlp-full

How to pass variables to functions called in spark_apply()?

阅读更多关于 How to pass variables to functions called in spark_apply()?

I would like to be able to pass extra variables to functions that are called by spark_apply in sparklyr. For example: # setup library(sparklyr) sc <- spark_connect(master='local', packages=TRUE) iris2 <- iris[,1:(ncol(iris) - 1)] df1 <- sdf_copy_to(sc, iris2, repartition=5, overwrite=T) # This works fine res <- spark_apply(df1, function(x) kmeans(x, 3)$centers) # This does not k <- 3 res <- spark_apply(df1, function(x) kmeans(x, k)$centers) As an ugly workaround, I can do what I want by saving values into R packages, and then referencing them. i.e > myPackage::k_equals_three == 3 [1] TRUE #

How to aggregate Data by 3 minutes timestamps in sparklyr?

阅读更多关于 How to aggregate Data by 3 minutes timestamps in sparklyr?

I am using sparklyr for some quick analysis. I do have some issues in working with timestamps. I have two different dataframes: one with rows in 1-minute-interval and another with 3-minute-interval. First dataset: (1-minute-interval) id timefrom timeto value 10 "2017-06-06 10:30:00" "2017-06-06 10:31:00" 50 10 "2017-06-06 10:31:00" "2017-06-06 10:32:00" 80 10 "2017-06-06 10:32:00" "2017-06-06 10:33:00" 20 22 "2017-06-06 10:33:00" "2017-06-06 10:34:00" 30 22 "2017-06-06 10:34:00" "2017-06-06 10:35:00" 50 22 "2017-06-06 10:35:00" "2017-06-06 10:36:00" 50 Second dataset: (3-minute-interval) id

Extract and Visualize Model Trees from Sparklyr

阅读更多关于 Extract and Visualize Model Trees from Sparklyr

Does anyone have any advice about how to convert the tree information from sparklyr's ml_decision_tree_classifier, ml_gbt_classifier, or ml_random_forest_classifier models into a.) a format that can be understood by other R tree-related libraries and (ultimately) b.) a visualization of the trees for non-technical consumption? This would include the ability to convert back to the actual feature names from the substituted string indexing values that are produced during the vector assembler. The following code is copied liberally from a sparklyr blog post for the purposes of providing an example: