sparkr

Loading com.databricks.spark.csv via RStudio

雨燕双飞 提交于 2019-11-30 13:31:45
问题 I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve. When launching the SparkR-shell ./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3 I can read a .csv-file as follows flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true") Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK

Loading com.databricks.spark.csv via RStudio

断了今生、忘了曾经 提交于 2019-11-30 07:37:25
I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve. When launching the SparkR-shell ./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3 I can read a .csv-file as follows flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true") Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK_HOME) I get the following error message: 15/06/16 16:18:58 ERROR RBackendHandler: load on 1 failed

Difference between createOrReplaceTempView and registerTempTable

时光总嘲笑我的痴心妄想 提交于 2019-11-30 06:44:18
I am new to spark and was trying out a few commands in sparkSql using python when I came across these two commands: createOrReplaceTempView () and registerTempTable (). What is the difference between the two commands?. They seem to have same set of functionalities. registerTempTable is a part of the 1.x API and has been deprecated in Spark 2.0. createOrReplaceTempView and createTempView have been introduced in Spark 2.0, as a replacement for registerTempTable . Other than that registerTempTable and createOrReplaceTempView functionally equivalent and the former one calls the latter one. No

Using SparkR and Sparklyr simultaneously

两盒软妹~` 提交于 2019-11-29 14:30:40
As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both packages to get the full scope of functionality. As both packages essentially wrap references to Java instances of scala classes, it should be possible to use the packages in parallel, I guess. But is it actually possible? What are your best practices? These two packages use different mechanisms and are not designed for interoperability. Their internals

How to handle null entries in SparkR

这一生的挚爱 提交于 2019-11-29 13:33:06
I have a SparkSQL DataFrame. Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas? In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods. Thanks. SparkR Column provides a long list of useful methods including isNull and isNotNull : > people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA)) > people <- createDataFrame(sqlContext, people_local) > head(people) Id Age 1 1 21 2 2 18 3 3 NA > filter(people, isNotNull(people$Age)) %>% head() Id Age 1 1 21 2 2 18 3 3 30 > filter(people, isNull

How to load csv file into SparkR on RStudio?

流过昼夜 提交于 2019-11-29 04:48:32
How do you load csv file into SparkR on RStudio? Below are the steps I had to perform to run SparkR on RStudio. I have used read.df to read .csv not sure how else to write this. Not sure if this step is considered to create RDDs. #Set sys environment variables Sys.setenv(SPARK_HOME = "C:/Users/Desktop/spark/spark-1.4.1-bin-hadoop2.6") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) #Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') #Load libraries library(SparkR) library(magrittr) sc <- sparkR.init(master="local") sc

SparkR vs sparklyr [closed]

江枫思渺然 提交于 2019-11-28 15:53:01
Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code? Best The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R: https://spark.apache.org/docs/2.0.1/sparkr.html#applying

Empty output when reading a csv file into Rstudio using SparkR

*爱你&永不变心* 提交于 2019-11-28 12:41:05
I'm a new user of SparkR. I'm trying to load a csv file into R using SparkR. Sys.setenv(SPARK_HOME="/usr/local/bin/spark-1.5.1-bin-hadoop2.6") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library(SparkR) sc <- sparkR.init(master="local", sparkPackages="com.databricks:spark-csv_2.11:1.0.3") sqlContext <- sparkRSQL.init(sc) I used a subset of nyc flights dataset just for testing. It only has 4 rows and 4 columns: gyear month day dep_time 2013 1 1 517 2013 1 1 533 2013 1 1 542 2013 1 1 544 n5 <- read.df(sqlContext, "/users/zhiyi.zhang/Downloads/n5.csv", "com

Writing R data frames returned from SparkR:::map

岁酱吖の 提交于 2019-11-28 10:00:42
问题 I am using SparkR:::map and my function returns a large-ish R dataframe for each input row, each of the same shape. I would like to write these dataframes as parquet files without 'collect'ing them. Can I map write.df over my output list? Can I get the worker tasks to write the parquet instead? I now have a working . example. I am happy with this other than I did not expect the reduce to implicitly 'collect' as I wanted to write the resultant DF as Parquet. Also, I'm not convinced that :::map

How to handle null entries in SparkR

此生再无相见时 提交于 2019-11-28 07:30:09
问题 I have a SparkSQL DataFrame. Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas? In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods. Thanks. 回答1: SparkR Column provides a long list of useful methods including isNull and isNotNull : > people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA)) > people <- createDataFrame(sqlContext, people_local) > head(people) Id Age 1 1 21 2 2 18 3 3