sparklyr

spark_apply works on one dataset, but fails on another (both datasets are of same type and structure)

人盡茶涼 提交于 2019-12-24 12:21:51
问题 I am working with sparklyr on databricks. The issue i am facing is that spark_apply() is throwing an error, when i run it on one dataset, but works fine when it is run on another dataset (of same structure and type). Am i missing something? The error message (reproduced) doesnt help much. Simple spark_apply function below: spark_apply(hr2, function(y) y*2) Schema and class of hr2 $LITRES $LITRES$name [1] "LITRES" $LITRES$type [1] "DoubleType" class(hr2) [1] "tbl_spark" "tbl_sql" "tbl_lazy"

Creating and applying ml_lib pipeline with external parameter in sparklyr

柔情痞子 提交于 2019-12-24 11:24:20
问题 I am trying to create and apply a Spark ml_pipeline object that can handle an external parameter that will vary (typically a date). According to the Spark documentation, it seems possible: see part with ParamMap here I haven't tried exactly how to do it. I was thinking of something like this: table.df <- data.frame("a" = c(1,2,3)) table.sdf <- sdf_copy_to(sc, table.df) param = 5 param2 = 4 # operation declaration table2.sdf <- table.sdf %>% mutate(test = param) # pipeline creation pipeline_1

Sparklyr Ports File and Java Error MAC OS

懵懂的女人 提交于 2019-12-24 08:18:18
问题 > sc <- spark_connect(master = "local") Error in sparkapi::start_shell(master = master, spark_home = spark_home, : Failed to launch Spark shell. Ports file does not exist. Path: /Users/XXX/Library/Caches/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit Parameters: --jars, '/Users/XXX/Library/R/3.3/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /var/folders/dy/jy43zcgd7gv27qc0mzlxxvd1qt7rhg/T/

Sparklyr Ports File and Java Error MAC OS

柔情痞子 提交于 2019-12-24 08:17:14
问题 > sc <- spark_connect(master = "local") Error in sparkapi::start_shell(master = master, spark_home = spark_home, : Failed to launch Spark shell. Ports file does not exist. Path: /Users/XXX/Library/Caches/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit Parameters: --jars, '/Users/XXX/Library/R/3.3/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /var/folders/dy/jy43zcgd7gv27qc0mzlxxvd1qt7rhg/T/

Calculating order statistics (percentile) using sparklyr

血红的双手。 提交于 2019-12-24 03:43:10
问题 A useful feature in dplyr is the ability to create calculated columns on the fly using mutate , one of those calculations is quantile , something that I used to be able to do with sparklyr with the function percentile , but for some reason it doesn't work anymore, here is a detailed example. first creating a sample data set: require(dplyr) require(sparklyr) # sc is a connection to spark my_df <- data.frame(col1 = sample(1:100,30)) %>% as_tibble() my_df # # A tibble: 30 x 1 # col1 # <int> # 1

How to use sdf_pivot() in sparklyr and concatenate strings?

别来无恙 提交于 2019-12-24 03:42:41
问题 I am trying to use the sdf_pivot() function in sparklyr to "gather" a long format data frame into a wide format. The values of the variables are strings that I would like to concatenate. Here is a simple example that I think should work but doesn't: library(sparkylr) d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), x=c("200", "200", "200", "201", "201", "201"), y=c("This", "That", "The", "Other", "End", "End")) d_sdf <- copy_to(sc, d, "d") sdf_pivot(d_sdf, id ~ x, paste) What I'd like it

sparklyr can I pass format and path options into spark_write_table? or use saveAsTable with spark_write_orc?

拜拜、爱过 提交于 2019-12-23 23:25:29
问题 Spark 2.0 with Hive Let's say I am trying to write a spark dataframe, irisDf to orc and save it to the hive metastore In Spark I would do that like this, irisDf.write.format("orc") .mode("overwrite") .option("path", "s3://my_bucket/iris/") .saveAsTable("my_database.iris") In sparklyr I can use the spark_write_table function, data("iris") iris_spark <- copy_to(sc, iris, name = "iris") output <- spark_write_table( iris ,name = 'my_database.iris' ,mode = 'overwrite' ) But this doesn't allow me

sql sparklyr sparkr dataframe conversions on databricks

痞子三分冷 提交于 2019-12-23 20:55:20
问题 I have the sql table on the databricks created using the following code %sql CREATE TABLE data USING CSV OPTIONS (header "true", inferSchema "true") LOCATION "url/data.csv" The following code converts that table to sparkr and r dataframe, respectively: %r library(SparkR) data_spark <- sql("SELECT * FROM data") data_r_df <- as.data.frame(data_spark) But I don't know how should I convert any or all of these dataframes into sparklyr dataframe to leverage parallelization of sparklyr? 回答1: Just sc

Sparklyr connection to S3 bucket throwing up error

喜夏-厌秋 提交于 2019-12-23 19:11:07
问题 I am trying to connect to S3 buckets from R sparklyr . I am able to read local files into spark context. However trying to connect with s3 seems to be issue, throws up a big dump of errors . Here is a list of code used. Note: A single s3 bucket has multiple csv files that follow the same schema. library( sparklyr ) library( tidyverse ) sparklyr :: spark_install ( version = "2.0.2" , hadoop_version = "2.7" ) sparklyr::spark_install( version = "2.0.2" , hadoop_version = "2.7" ) Sys.setenv ( AWS

change string in DF using hive command and mutate with sparklyr

廉价感情. 提交于 2019-12-23 03:57:11
问题 Using the Hive command regexp_extract I am trying to change the following strings from: 201703170455 to 2017-03-17:04:55 and from: 2017031704555675 to 2017-03-17:04:55.0010 I am doing this in sparklyr trying to use this code that works with gsub in R: newdf<-df%>%mutate(Time1 = regexp_extract(Time, "(....)(..)(..)(..)(..)", "\\1-\\2-\\3:\\4:\\5")) and this code: newdf<-df%>mutate(TimeTrans = regexp_extract("(....)(..)(..)(..)(..)(....)", "\\1-\\2-\\3:\\4:\\5.\\6")) but does not work at all.