sparkr

How do I apply a function on each value of a column in a SPARKR DataFrame?

送分小仙女□ 提交于 2019-12-04 09:30:54
I am relatively new to SPARKR. I downloaded SPARK 1.4 and setup RStudio to use SPARKR library. However I want to know how I can apply a function to each value in a column of a distributed DataFrame, can someone please help? For example, This works perfectly myFunc <- function(x) { paste(x , "_hello")} c <- c("a", "b", "c") d <- lapply(c, myFunc) How to make this work for a Distributed DataFrame. The intention is to append "_hello" to each value of column Name of DF DF <- read.df(sqlContext, "TV_Flattened_2.csv", source = "com.databricks.spark.csv", header="true") SparkR:::lapply(DF$Name,

SparkR Error in sparkR.init(master=“local”) in RStudio

允我心安 提交于 2019-12-04 01:16:43
问题 I have installed the SparkR package from Spark distribution into the R library. I can call the following command and it seems to work properly: library(SparkR) However, when I try to get the Spark context using the following code, sc <- sparkR.init(master="local") It fails after some time with the following message: Error in sparkR.init(master = "local") : JVM is not ready after 10 seconds I have set JAVA_HOME, and I have a working RStudio where I can access other packages like ggplot2. I don

Unable to launch SparkR in RStudio

≯℡__Kan透↙ 提交于 2019-12-04 00:15:36
After long and difficult installation process of SparkR i getting into new problems of launching SparkR. My Settings R 3.2.0 RStudio 0.98.1103 Rtools 3.3 Spark 1.4.0 Java Version 8 SparkR 1.4.0 Windows 7 SP 1 64 Bit Now i try to use following code in R: library(devtools) library(SparkR) Sys.setenv(SPARK_MEM="1g") Sys.setenv(SPARK_HOME="C:/spark-1.4.0") sc <- sparkR.init(master="local") I recieve following: JVM is not ready after 10 seconds I was also trying to add some system variables like spark path or java path. Do you have any advices for me to fix that problems. The next step for me after

Install SparkR that comes with Spark 1.4

微笑、不失礼 提交于 2019-12-03 20:42:40
The newest version of Spark (1.4) now comes with SparkR. Does anyone know how to go about installing the SparkR implementation on Windows? The sparkR.R script is currently located in C:/spark-1.4.0/R/pkgs/R/ This appears to be a step in the right direction, but the instructions don't work for Windows as there is no sparkR directory as it relates to. @DavidArenburg put me on the right track. Following the Windows documentation in the C:\spark-1.4.0\R\WINDOWS.md, I installed RTools and added R.exe and RTools to my computers PATH. Then, I ran install-dev.bat in C:\spark-1.4.0\R This added the lib

SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

≡放荡痞女 提交于 2019-12-03 20:16:59
I read a parquet file from HDFS system: path<-"hdfs://part_2015" AppDF <- parquetFile(sqlContext, path) printSchema(AppDF) root |-- app: binary (nullable = true) |-- category: binary (nullable = true) |-- date: binary (nullable = true) |-- user: binary (nullable = true) class(AppDF) [1] "DataFrame" attr(,"package") [1] "SparkR" collect(AppDF) .....error: arguments imply differing number of rows: 46021, 39175, 62744, 27137 head(AppDF) .....error: arguments imply differing number of rows: 36, 30, 48 I've read some thread about this problem. But it's not my case. In fact, I just read a table from

SparkR - ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

匿名 (未验证) 提交于 2019-12-03 10:10:24
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: When trying to connect to Spark cluster using SparkR in RStudio: if (nchar(Sys.getenv("SPARK_HOME")) < 1) { Sys.setenv(SPARK_HOME = "/usr/lib/spark/spark-2.1.1-bin-hadoop2.6") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) } library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) # Starting a sparkR session sparkR.session(master = "spark://myIpAddress.eu-west-1.compute.internal:7077") I am getting the following error message: Spark package found in SPARK_HOME: /usr/lib/spark/spark-2.1.1-bin

Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

匿名 (未验证) 提交于 2019-12-03 08:46:08
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am new to Spark and would like to know if there are other options than those ones below to read data stored in a hdfs from RStudio using SparkR or if I use them correctly. The data could be any kind (pure text, csv, json, xml or any database containing relational tables) and of any size (1kb - several gb). I know that textFile(sc, path) should no more be used, but are there other possibilities to read such kinds of data besides the read.df function? The following code uses the read.df and jsonFile but jsonFile produces an error: Sys.setenv

Unable to launch SparkR in RStudio

匿名 (未验证) 提交于 2019-12-03 03:04:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: After long and difficult installation process of SparkR i getting into new problems of launching SparkR. My Settings R 3.2 . 0 RStudio 0.98 . 1103 Rtools 3.3 Spark 1.4 . 0 Java Version 8 SparkR 1.4 . 0 Windows 7 SP 1 64 Bit Now i try to use following code in R: library ( devtools ) library ( SparkR ) Sys . setenv ( SPARK_MEM = "1g" ) Sys . setenv ( SPARK_HOME = "C:/spark-1.4.0" ) sc <- sparkR . init ( master = "local" ) I recieve following: JVM is not ready after 10 seconds I was also trying to add some system variables like spark

Add column to DataFrame in sparkR

匿名 (未验证) 提交于 2019-12-03 02:24:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I would like to add a column filled with a character N in a DataFrame in SparkR. I would do it like that with non-SparkR code : df$new_column <- "N" But with SparkR, I get the following error : Error: class(value) == "Column" || is.null(value) is not TRUE I've tried insane things to manage it, I was able to create a column using another (existing) one with df <- withColumn(df, "new_column", df$existing_column) , but this simple thing, nope... Any help ? Thanks. 回答1: The straight solution will be to use SparkR::lit() function: df_new =

SparkR window function

匿名 (未验证) 提交于 2019-12-03 01:57:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I found from JIRA that 1.6 release of SparkR has implemented window functions including lag and rank , but over function is not implemented yet. How can I use window function like lag function without over in SparkR (not the SparkSQL way)? Can someone provide an example? 回答1: Spark 2.0.0+ SparkR provides DSL wrappers with over , window.partitionBy / partitionBy , window.orderBy / orderBy and rowsBetween / rangeBeteen functions. Spark Unfortunately it is not possible in 1.6.0. While some window functions, including lag , have been implemented