sparkr | 易学教程

Loading com.databricks.spark.csv via RStudio

阅读更多关于 Loading com.databricks.spark.csv via RStudio

问题 I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve. When launching the SparkR-shell ./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3 I can read a .csv-file as follows flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true") Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK

Loading com.databricks.spark.csv via RStudio

阅读更多关于 Loading com.databricks.spark.csv via RStudio

I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve. When launching the SparkR-shell ./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3 I can read a .csv-file as follows flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true") Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK_HOME) I get the following error message: 15/06/16 16:18:58 ERROR RBackendHandler: load on 1 failed

Difference between createOrReplaceTempView and registerTempTable

阅读更多关于 Difference between createOrReplaceTempView and registerTempTable

I am new to spark and was trying out a few commands in sparkSql using python when I came across these two commands: createOrReplaceTempView () and registerTempTable (). What is the difference between the two commands?. They seem to have same set of functionalities. registerTempTable is a part of the 1.x API and has been deprecated in Spark 2.0. createOrReplaceTempView and createTempView have been introduced in Spark 2.0, as a replacement for registerTempTable . Other than that registerTempTable and createOrReplaceTempView functionally equivalent and the former one calls the latter one. No

Using SparkR and Sparklyr simultaneously

阅读更多关于 Using SparkR and Sparklyr simultaneously

As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both packages to get the full scope of functionality. As both packages essentially wrap references to Java instances of scala classes, it should be possible to use the packages in parallel, I guess. But is it actually possible? What are your best practices? These two packages use different mechanisms and are not designed for interoperability. Their internals

How to handle null entries in SparkR

阅读更多关于 How to handle null entries in SparkR

I have a SparkSQL DataFrame. Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas? In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods. Thanks. SparkR Column provides a long list of useful methods including isNull and isNotNull : > people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA)) > people <- createDataFrame(sqlContext, people_local) > head(people) Id Age 1 1 21 2 2 18 3 3 NA > filter(people, isNotNull(people$Age)) %>% head() Id Age 1 1 21 2 2 18 3 3 30 > filter(people, isNull

How to load csv file into SparkR on RStudio?

阅读更多关于 How to load csv file into SparkR on RStudio?

How do you load csv file into SparkR on RStudio? Below are the steps I had to perform to run SparkR on RStudio. I have used read.df to read .csv not sure how else to write this. Not sure if this step is considered to create RDDs. #Set sys environment variables Sys.setenv(SPARK_HOME = "C:/Users/Desktop/spark/spark-1.4.1-bin-hadoop2.6") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) #Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') #Load libraries library(SparkR) library(magrittr) sc <- sparkR.init(master="local") sc

SparkR vs sparklyr [closed]

阅读更多关于 SparkR vs sparklyr [closed]

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code? Best The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R: https://spark.apache.org/docs/2.0.1/sparkr.html#applying

Empty output when reading a csv file into Rstudio using SparkR

阅读更多关于 Empty output when reading a csv file into Rstudio using SparkR

I'm a new user of SparkR. I'm trying to load a csv file into R using SparkR. Sys.setenv(SPARK_HOME="/usr/local/bin/spark-1.5.1-bin-hadoop2.6") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library(SparkR) sc <- sparkR.init(master="local", sparkPackages="com.databricks:spark-csv_2.11:1.0.3") sqlContext <- sparkRSQL.init(sc) I used a subset of nyc flights dataset just for testing. It only has 4 rows and 4 columns: gyear month day dep_time 2013 1 1 517 2013 1 1 533 2013 1 1 542 2013 1 1 544 n5 <- read.df(sqlContext, "/users/zhiyi.zhang/Downloads/n5.csv", "com

Writing R data frames returned from SparkR:::map

阅读更多关于 Writing R data frames returned from SparkR:::map

问题 I am using SparkR:::map and my function returns a large-ish R dataframe for each input row, each of the same shape. I would like to write these dataframes as parquet files without 'collect'ing them. Can I map write.df over my output list? Can I get the worker tasks to write the parquet instead? I now have a working . example. I am happy with this other than I did not expect the reduce to implicitly 'collect' as I wanted to write the resultant DF as Parquet. Also, I'm not convinced that :::map

How to handle null entries in SparkR

阅读更多关于 How to handle null entries in SparkR

问题 I have a SparkSQL DataFrame. Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas? In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods. Thanks. 回答1: SparkR Column provides a long list of useful methods including isNull and isNotNull : > people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA)) > people <- createDataFrame(sqlContext, people_local) > head(people) Id Age 1 1 21 2 2 18 3 3