sparkr

How to get connected to an existing session of Spark

空扰寡人 提交于 2019-12-02 02:54:17
I installed spark ( spark-2.1.0-bin-hadoop2.7 ) locally with success. Running spark from terminal was successful through the command below: $ spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/01/08 12:30:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/01/08 12:30:30 WARN ObjectStore: Failed to get database global_temp, returning

How to unnest data with SparkR?

这一生的挚爱 提交于 2019-12-01 12:57:58
问题 Using SparkR how can nested arrays be "exploded along"? I've tried using explode like so: dat <- nested_spark_df %>% mutate(a=explode(metadata)) %>% head() but though the above does not cause an exception to be thrown, it does not promote the nested fields in metadata to the top level. Essentially I'm seeking behavior similar to that of Hive's LATERAL VIEW explode() functionality without relying on a HiveContext . Note that in the code snippet I'm using the NSE enabled via SparkRext . I think

How to use Jupyter + SparkR and custom R install

妖精的绣舞 提交于 2019-12-01 09:49:16
问题 I am using a Dockerized image and Jupyter notebook along with SparkR kernel. When I create a SparkR notebook, it uses an install of Microsoft R (3.3.2) instead of vanilla CRAN R install (3.2.3). The Docker image I'm using installs some custom R libraries and Python pacakages but I don't explicitly install Microsoft R. Regardless of whether or not I can remove Microsoft R or have it side-by-side, how I can get my SparkR kernel to use a custom installation of R ? Thanks in advance 回答1: Docker

should I pre-install cran r packages on worker nodes when using sparkr

百般思念 提交于 2019-12-01 08:10:08
I want to use r packages on cran such as forecast etc with sparkr and meet following two problems. Should I pre-install all those packages on worker nodes? But when I read the source code of spark this file , it seems that spark will automatically zip packages and distribute them to the workers via --jars or --packages. What should I do to make the dependencies available on workers? Suppose I need to use functions provided by forecast in a map transformation, how should I import the package. Do I need to do something like following, import the package in the map function, will it make multiple

Convert date to end of month in Spark

自古美人都是妖i 提交于 2019-12-01 05:12:41
问题 I have a Spark DataFrame as shown below: #Create DataFrame df <- data.frame(name = c("Thomas", "William", "Bill", "John"), dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08')) df <- createDataFrame(df) #Make sure df$dates column is in 'date' format df <- withColumn(df, 'dates', cast(df$dates, 'date')) name | dates -------------------- Thomas |2017-01-05 William |2017-02-23 Bill |2017-03-16 John |2017-04-08 I want to change dates to the end of month date, so they would look like

Duplicate columns in Spark Dataframe

守給你的承諾、 提交于 2019-12-01 04:45:31
I have a 10GB csv file in hadoop cluster with duplicate columns. I try to analyse it in SparkR so I use spark-csv package to parse it as DataFrame : df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header = "true", mode = "DROPMALFORMED" ) But since df have duplicate Email columns, if I want to select this column, it would error out: select(df, 'Email') 15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Reference 'Email' is ambiguous, could be: Email

SparkR Error in sparkR.init(master=“local”) in RStudio

家住魔仙堡 提交于 2019-12-01 04:16:51
I have installed the SparkR package from Spark distribution into the R library. I can call the following command and it seems to work properly: library(SparkR) However, when I try to get the Spark context using the following code, sc <- sparkR.init(master="local") It fails after some time with the following message: Error in sparkR.init(master = "local") : JVM is not ready after 10 seconds I have set JAVA_HOME, and I have a working RStudio where I can access other packages like ggplot2. I don't know why it is not working, and I don't even know where to investigate the issue. I had the same

Add column to DataFrame in sparkR

眉间皱痕 提交于 2019-12-01 02:48:36
I would like to add a column filled with a character N in a DataFrame in SparkR. I would do it like that with non-SparkR code : df$new_column <- "N" But with SparkR, I get the following error : Error: class(value) == "Column" || is.null(value) is not TRUE I've tried insane things to manage it, I was able to create a column using another (existing) one with df <- withColumn(df, "new_column", df$existing_column) , but this simple thing, nope... Any help ? Thanks. Dmitriy Selivanov The straight solution will be to use SparkR::lit() function: df_new = withColumn(df, "new_column_name", lit("N"))

Sparkr write DF as file csv/txt

我的梦境 提交于 2019-12-01 01:26:36
Hi I'm working on sparkR in yarn mode. I need to write a sparkr df to a csv/txt file. I saw that there is write.df but it writes parquet files. I tried to do this things RdataFrame<-collect(SparkRDF) write.table(RdataFrame, ..) But I got many WARN and some ERROR on contextCleaner. Is there any way ? Spark 2.0+ You can use write.text function: Save the content of the SparkDataFrame in a text file at the specified path. The SparkDataFrame must have only one column of string type with the name "value". Each row becomes a new line in the output file. write.text(df, path) or write.df with built-in

Add column to DataFrame in sparkR

荒凉一梦 提交于 2019-11-30 22:41:18
问题 I would like to add a column filled with a character N in a DataFrame in SparkR. I would do it like that with non-SparkR code : df$new_column <- "N" But with SparkR, I get the following error : Error: class(value) == "Column" || is.null(value) is not TRUE I've tried insane things to manage it, I was able to create a column using another (existing) one with df <- withColumn(df, "new_column", df$existing_column) , but this simple thing, nope... Any help ? Thanks. 回答1: The straight solution will