sparklyr | 易学教程

Difference in time taken for importing parquet files between SparkR and sparklyr

阅读更多关于 Difference in time taken for importing parquet files between SparkR and sparklyr

来源： https://stackoverflow.com/questions/64058553/difference-in-time-taken-for-importing-parquet-files-between-sparkr-and-sparklyr

Calling spark window functions in R using sparklyr

阅读更多关于 Calling spark window functions in R using sparklyr

来源： https://stackoverflow.com/questions/64000243/calling-spark-window-functions-in-r-using-sparklyr

Egg/JAR equivalent for Sparklyr projects

阅读更多关于 Egg/JAR equivalent for Sparklyr projects

问题 We have a SparklyR project which is set up like this # load functions source('./a.R') source('./b.R') source('./c.R') .... # main script computations sc -> spark_connect(...) read_csv(sc, s3://path) .... Running it on EMR spark-submit --deploy-mode client s3://path/to/my/script.R Running this script using spark-submit above fails since it seems to only take a single R script but we are sourcing functions from multiple files. Is there a way we can package this as an egg/jar file with all of

Egg/JAR equivalent for Sparklyr projects

阅读更多关于 Egg/JAR equivalent for Sparklyr projects

Read CSV file from Azure Blob storage in Rstudio Server with spark_read_csv()

阅读更多关于 Read CSV file from Azure Blob storage in Rstudio Server with spark_read_csv()

问题 I have provisioned an Azure HDInsight cluster type ML Services (R Server), operating system Linux, version ML Services 9.3 on Spark 2.2 with Java 8 HDI 3.6. Within Rstudio Server I am trying to read in a csv file from my blob storage. Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client") Sys.setenv(YARN_CONF_DIR="/etc/hadoop/conf") Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf") Sys.setenv(SPARK_CONF_DIR="/etc/spark/conf") options(rsparkling.sparklingwater.version = "2.2.28") library(sparklyr

How to use Sparklyr to summarize Categorical Variable Level

阅读更多关于 How to use Sparklyr to summarize Categorical Variable Level

问题 For each categorical variable in dataset, I want to get counts and summary stats for each level. I can do this using dlookr R package using their diagnose_category() function. Since at work I don't have that package I recreated the function using dplyr. In sparklye I am able to get counts for a single variable at a time. Need help to extend it all categorical variable. Need Help: Implement the function via SparklyR Table 1: Final output needed: # A tibble: 20 x 6 variables levels N freq ratio

How to use Sparklyr to summarize Categorical Variable Level

阅读更多关于 How to use Sparklyr to summarize Categorical Variable Level

Unable to connect Spark to Cassandra DB in RStudio

阅读更多关于 Unable to connect Spark to Cassandra DB in RStudio

问题 I've spent the last week trying to figure out how to use sparlyr to get spark to connect to cassandra on our local cluster, and I've hit a wall - any help would be greatly appreciated. I'm the only one trying to use R/Rstudio to make this connection (everyone else uses Java on NetBeans and Maven), and am not sure what I need to do to make this work. The stack I'm using is: Ubuntu 16.04 (in a VM) sparklyr: 0.5.3 Spark: 2.0.0 Scala: 2.11 Cassandra: 3.7 relevant config.yml file settings: #

Sparklyr on RStudio EC2 with invoke error hadoopConfiguration standalone cluster

阅读更多关于 Sparklyr on RStudio EC2 with invoke error hadoopConfiguration standalone cluster

问题 So I have a 1 master/2 slave standalone cluster on EC2. I am running rstudio from EC2 and after I run the following code: library(aws.s3) library(sparklyr) library(tidyverse) library(RCurl) Sys.setenv("AWS_ACCESS_KEY_ID" = "myaccesskeyid", "AWS_SECRET_ACCESS_KEY" = "myaccesskey", "SPARK_CONF_DIR" = "/home/rstudio/spark/spark-2.1.0-bin-hadoop2.7/bin/", "JAVA_HOME" = "/usr/lib/jvm/java-8-oracle" ) ctx <- spark_context(sc) jsc <- invoke_static(sc, "org.apache.spark.api.java.JavaSparkContext",

How to implement Stanford CoreNLP wrapper for Apache Spark using sparklyr?

阅读更多关于 How to implement Stanford CoreNLP wrapper for Apache Spark using sparklyr?

问题 I am trying to create a R package so I can use the Stanford CoreNLP wrapper for Apache Spark (by databricks) from R. I am using the sparklyr package to connect to my local Spark instance. I created a package with the following dependency function spark_dependencies <- function(spark_version, scala_version, ...) { sparklyr::spark_dependency( jars = c( system.file( sprintf("stanford-corenlp-full/stanford-corenlp-3.6.0.jar"), package = "sparkNLP" ), system.file( sprintf("stanford-corenlp-full