sparklyr | 易学教程

spark_apply works on one dataset, but fails on another (both datasets are of same type and structure)

阅读更多关于 spark_apply works on one dataset, but fails on another (both datasets are of same type and structure)

问题 I am working with sparklyr on databricks. The issue i am facing is that spark_apply() is throwing an error, when i run it on one dataset, but works fine when it is run on another dataset (of same structure and type). Am i missing something? The error message (reproduced) doesnt help much. Simple spark_apply function below: spark_apply(hr2, function(y) y*2) Schema and class of hr2 $LITRES $LITRES$name [1] "LITRES" $LITRES$type [1] "DoubleType" class(hr2) [1] "tbl_spark" "tbl_sql" "tbl_lazy"

Creating and applying ml_lib pipeline with external parameter in sparklyr

阅读更多关于 Creating and applying ml_lib pipeline with external parameter in sparklyr

问题 I am trying to create and apply a Spark ml_pipeline object that can handle an external parameter that will vary (typically a date). According to the Spark documentation, it seems possible: see part with ParamMap here I haven't tried exactly how to do it. I was thinking of something like this: table.df <- data.frame("a" = c(1,2,3)) table.sdf <- sdf_copy_to(sc, table.df) param = 5 param2 = 4 # operation declaration table2.sdf <- table.sdf %>% mutate(test = param) # pipeline creation pipeline_1

Sparklyr Ports File and Java Error MAC OS

阅读更多关于 Sparklyr Ports File and Java Error MAC OS

问题 > sc <- spark_connect(master = "local") Error in sparkapi::start_shell(master = master, spark_home = spark_home, : Failed to launch Spark shell. Ports file does not exist. Path: /Users/XXX/Library/Caches/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit Parameters: --jars, '/Users/XXX/Library/R/3.3/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /var/folders/dy/jy43zcgd7gv27qc0mzlxxvd1qt7rhg/T/

Sparklyr Ports File and Java Error MAC OS

阅读更多关于 Sparklyr Ports File and Java Error MAC OS

Calculating order statistics (percentile) using sparklyr

阅读更多关于 Calculating order statistics (percentile) using sparklyr

问题 A useful feature in dplyr is the ability to create calculated columns on the fly using mutate , one of those calculations is quantile , something that I used to be able to do with sparklyr with the function percentile , but for some reason it doesn't work anymore, here is a detailed example. first creating a sample data set: require(dplyr) require(sparklyr) # sc is a connection to spark my_df <- data.frame(col1 = sample(1:100,30)) %>% as_tibble() my_df # # A tibble: 30 x 1 # col1 # <int> # 1

How to use sdf_pivot() in sparklyr and concatenate strings?

阅读更多关于 How to use sdf_pivot() in sparklyr and concatenate strings?

问题 I am trying to use the sdf_pivot() function in sparklyr to "gather" a long format data frame into a wide format. The values of the variables are strings that I would like to concatenate. Here is a simple example that I think should work but doesn't: library(sparkylr) d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), x=c("200", "200", "200", "201", "201", "201"), y=c("This", "That", "The", "Other", "End", "End")) d_sdf <- copy_to(sc, d, "d") sdf_pivot(d_sdf, id ~ x, paste) What I'd like it

sparklyr can I pass format and path options into spark_write_table? or use saveAsTable with spark_write_orc?

阅读更多关于 sparklyr can I pass format and path options into spark_write_table? or use saveAsTable with spark_write_orc?

问题 Spark 2.0 with Hive Let's say I am trying to write a spark dataframe, irisDf to orc and save it to the hive metastore In Spark I would do that like this, irisDf.write.format("orc") .mode("overwrite") .option("path", "s3://my_bucket/iris/") .saveAsTable("my_database.iris") In sparklyr I can use the spark_write_table function, data("iris") iris_spark <- copy_to(sc, iris, name = "iris") output <- spark_write_table( iris ,name = 'my_database.iris' ,mode = 'overwrite' ) But this doesn't allow me

sql sparklyr sparkr dataframe conversions on databricks

阅读更多关于 sql sparklyr sparkr dataframe conversions on databricks

问题 I have the sql table on the databricks created using the following code %sql CREATE TABLE data USING CSV OPTIONS (header "true", inferSchema "true") LOCATION "url/data.csv" The following code converts that table to sparkr and r dataframe, respectively: %r library(SparkR) data_spark <- sql("SELECT * FROM data") data_r_df <- as.data.frame(data_spark) But I don't know how should I convert any or all of these dataframes into sparklyr dataframe to leverage parallelization of sparklyr? 回答1: Just sc

Sparklyr connection to S3 bucket throwing up error

阅读更多关于 Sparklyr connection to S3 bucket throwing up error

问题 I am trying to connect to S3 buckets from R sparklyr . I am able to read local files into spark context. However trying to connect with s3 seems to be issue, throws up a big dump of errors . Here is a list of code used. Note: A single s3 bucket has multiple csv files that follow the same schema. library( sparklyr ) library( tidyverse ) sparklyr :: spark_install ( version = "2.0.2" , hadoop_version = "2.7" ) sparklyr::spark_install( version = "2.0.2" , hadoop_version = "2.7" ) Sys.setenv ( AWS

change string in DF using hive command and mutate with sparklyr

阅读更多关于 change string in DF using hive command and mutate with sparklyr

问题 Using the Hive command regexp_extract I am trying to change the following strings from: 201703170455 to 2017-03-17:04:55 and from: 2017031704555675 to 2017-03-17:04:55.0010 I am doing this in sparklyr trying to use this code that works with gsub in R: newdf<-df%>%mutate(Time1 = regexp_extract(Time, "(....)(..)(..)(..)(..)", "\\1-\\2-\\3:\\4:\\5")) and this code: newdf<-df%>mutate(TimeTrans = regexp_extract("(....)(..)(..)(..)(..)(....)", "\\1-\\2-\\3:\\4:\\5.\\6")) but does not work at all.