sparklyr

Egg/JAR equivalent for Sparklyr projects

隐身守侯 提交于 2020-06-26 14:17:26
问题 We have a SparklyR project which is set up like this # load functions source('./a.R') source('./b.R') source('./c.R') .... # main script computations sc -> spark_connect(...) read_csv(sc, s3://path) .... Running it on EMR spark-submit --deploy-mode client s3://path/to/my/script.R Running this script using spark-submit above fails since it seems to only take a single R script but we are sourcing functions from multiple files. Is there a way we can package this as an egg/jar file with all of

Egg/JAR equivalent for Sparklyr projects

旧巷老猫 提交于 2020-06-26 14:17:00
问题 We have a SparklyR project which is set up like this # load functions source('./a.R') source('./b.R') source('./c.R') .... # main script computations sc -> spark_connect(...) read_csv(sc, s3://path) .... Running it on EMR spark-submit --deploy-mode client s3://path/to/my/script.R Running this script using spark-submit above fails since it seems to only take a single R script but we are sourcing functions from multiple files. Is there a way we can package this as an egg/jar file with all of

Read CSV file from Azure Blob storage in Rstudio Server with spark_read_csv()

☆樱花仙子☆ 提交于 2020-04-30 12:28:48
问题 I have provisioned an Azure HDInsight cluster type ML Services (R Server), operating system Linux, version ML Services 9.3 on Spark 2.2 with Java 8 HDI 3.6. Within Rstudio Server I am trying to read in a csv file from my blob storage. Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client") Sys.setenv(YARN_CONF_DIR="/etc/hadoop/conf") Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf") Sys.setenv(SPARK_CONF_DIR="/etc/spark/conf") options(rsparkling.sparklingwater.version = "2.2.28") library(sparklyr

How to use Sparklyr to summarize Categorical Variable Level

本小妞迷上赌 提交于 2020-02-16 07:23:16
问题 For each categorical variable in dataset, I want to get counts and summary stats for each level. I can do this using dlookr R package using their diagnose_category() function. Since at work I don't have that package I recreated the function using dplyr. In sparklye I am able to get counts for a single variable at a time. Need help to extend it all categorical variable. Need Help: Implement the function via SparklyR Table 1: Final output needed: # A tibble: 20 x 6 variables levels N freq ratio

How to use Sparklyr to summarize Categorical Variable Level

时光总嘲笑我的痴心妄想 提交于 2020-02-16 07:21:22
问题 For each categorical variable in dataset, I want to get counts and summary stats for each level. I can do this using dlookr R package using their diagnose_category() function. Since at work I don't have that package I recreated the function using dplyr. In sparklye I am able to get counts for a single variable at a time. Need help to extend it all categorical variable. Need Help: Implement the function via SparklyR Table 1: Final output needed: # A tibble: 20 x 6 variables levels N freq ratio

Unable to connect Spark to Cassandra DB in RStudio

回眸只為那壹抹淺笑 提交于 2020-01-26 00:39:05
问题 I've spent the last week trying to figure out how to use sparlyr to get spark to connect to cassandra on our local cluster, and I've hit a wall - any help would be greatly appreciated. I'm the only one trying to use R/Rstudio to make this connection (everyone else uses Java on NetBeans and Maven), and am not sure what I need to do to make this work. The stack I'm using is: Ubuntu 16.04 (in a VM) sparklyr: 0.5.3 Spark: 2.0.0 Scala: 2.11 Cassandra: 3.7 relevant config.yml file settings: #

Sparklyr on RStudio EC2 with invoke error hadoopConfiguration standalone cluster

ⅰ亾dé卋堺 提交于 2020-01-24 13:20:48
问题 So I have a 1 master/2 slave standalone cluster on EC2. I am running rstudio from EC2 and after I run the following code: library(aws.s3) library(sparklyr) library(tidyverse) library(RCurl) Sys.setenv("AWS_ACCESS_KEY_ID" = "myaccesskeyid", "AWS_SECRET_ACCESS_KEY" = "myaccesskey", "SPARK_CONF_DIR" = "/home/rstudio/spark/spark-2.1.0-bin-hadoop2.7/bin/", "JAVA_HOME" = "/usr/lib/jvm/java-8-oracle" ) ctx <- spark_context(sc) jsc <- invoke_static(sc, "org.apache.spark.api.java.JavaSparkContext",

How to implement Stanford CoreNLP wrapper for Apache Spark using sparklyr?

久未见 提交于 2020-01-23 06:23:51
问题 I am trying to create a R package so I can use the Stanford CoreNLP wrapper for Apache Spark (by databricks) from R. I am using the sparklyr package to connect to my local Spark instance. I created a package with the following dependency function spark_dependencies <- function(spark_version, scala_version, ...) { sparklyr::spark_dependency( jars = c( system.file( sprintf("stanford-corenlp-full/stanford-corenlp-3.6.0.jar"), package = "sparkNLP" ), system.file( sprintf("stanford-corenlp-full