sparkr | 易学教程

How to get connected to an existing session of Spark

阅读更多关于 How to get connected to an existing session of Spark

I installed spark ( spark-2.1.0-bin-hadoop2.7 ) locally with success. Running spark from terminal was successful through the command below: $ spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/01/08 12:30:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/01/08 12:30:30 WARN ObjectStore: Failed to get database global_temp, returning

How to unnest data with SparkR?

阅读更多关于 How to unnest data with SparkR?

问题 Using SparkR how can nested arrays be "exploded along"? I've tried using explode like so: dat <- nested_spark_df %>% mutate(a=explode(metadata)) %>% head() but though the above does not cause an exception to be thrown, it does not promote the nested fields in metadata to the top level. Essentially I'm seeking behavior similar to that of Hive's LATERAL VIEW explode() functionality without relying on a HiveContext . Note that in the code snippet I'm using the NSE enabled via SparkRext . I think

How to use Jupyter + SparkR and custom R install

阅读更多关于 How to use Jupyter + SparkR and custom R install

问题 I am using a Dockerized image and Jupyter notebook along with SparkR kernel. When I create a SparkR notebook, it uses an install of Microsoft R (3.3.2) instead of vanilla CRAN R install (3.2.3). The Docker image I'm using installs some custom R libraries and Python pacakages but I don't explicitly install Microsoft R. Regardless of whether or not I can remove Microsoft R or have it side-by-side, how I can get my SparkR kernel to use a custom installation of R ? Thanks in advance 回答1: Docker

should I pre-install cran r packages on worker nodes when using sparkr

阅读更多关于 should I pre-install cran r packages on worker nodes when using sparkr

I want to use r packages on cran such as forecast etc with sparkr and meet following two problems. Should I pre-install all those packages on worker nodes? But when I read the source code of spark this file , it seems that spark will automatically zip packages and distribute them to the workers via --jars or --packages. What should I do to make the dependencies available on workers? Suppose I need to use functions provided by forecast in a map transformation, how should I import the package. Do I need to do something like following, import the package in the map function, will it make multiple

Convert date to end of month in Spark

阅读更多关于 Convert date to end of month in Spark

问题 I have a Spark DataFrame as shown below: #Create DataFrame df <- data.frame(name = c("Thomas", "William", "Bill", "John"), dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08')) df <- createDataFrame(df) #Make sure df$dates column is in 'date' format df <- withColumn(df, 'dates', cast(df$dates, 'date')) name | dates -------------------- Thomas |2017-01-05 William |2017-02-23 Bill |2017-03-16 John |2017-04-08 I want to change dates to the end of month date, so they would look like

Duplicate columns in Spark Dataframe

阅读更多关于 Duplicate columns in Spark Dataframe

I have a 10GB csv file in hadoop cluster with duplicate columns. I try to analyse it in SparkR so I use spark-csv package to parse it as DataFrame : df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header = "true", mode = "DROPMALFORMED" ) But since df have duplicate Email columns, if I want to select this column, it would error out: select(df, 'Email') 15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Reference 'Email' is ambiguous, could be: Email

SparkR Error in sparkR.init(master=“local”) in RStudio

阅读更多关于 SparkR Error in sparkR.init(master=“local”) in RStudio

I have installed the SparkR package from Spark distribution into the R library. I can call the following command and it seems to work properly: library(SparkR) However, when I try to get the Spark context using the following code, sc <- sparkR.init(master="local") It fails after some time with the following message: Error in sparkR.init(master = "local") : JVM is not ready after 10 seconds I have set JAVA_HOME, and I have a working RStudio where I can access other packages like ggplot2. I don't know why it is not working, and I don't even know where to investigate the issue. I had the same

Add column to DataFrame in sparkR

阅读更多关于 Add column to DataFrame in sparkR

I would like to add a column filled with a character N in a DataFrame in SparkR. I would do it like that with non-SparkR code : df$new_column <- "N" But with SparkR, I get the following error : Error: class(value) == "Column" || is.null(value) is not TRUE I've tried insane things to manage it, I was able to create a column using another (existing) one with df <- withColumn(df, "new_column", df$existing_column) , but this simple thing, nope... Any help ? Thanks. Dmitriy Selivanov The straight solution will be to use SparkR::lit() function: df_new = withColumn(df, "new_column_name", lit("N"))

Sparkr write DF as file csv/txt

阅读更多关于 Sparkr write DF as file csv/txt

Hi I'm working on sparkR in yarn mode. I need to write a sparkr df to a csv/txt file. I saw that there is write.df but it writes parquet files. I tried to do this things RdataFrame<-collect(SparkRDF) write.table(RdataFrame, ..) But I got many WARN and some ERROR on contextCleaner. Is there any way ? Spark 2.0+ You can use write.text function: Save the content of the SparkDataFrame in a text file at the specified path. The SparkDataFrame must have only one column of string type with the name "value". Each row becomes a new line in the output file. write.text(df, path) or write.df with built-in

Add column to DataFrame in sparkR

阅读更多关于 Add column to DataFrame in sparkR

问题 I would like to add a column filled with a character N in a DataFrame in SparkR. I would do it like that with non-SparkR code : df$new_column <- "N" But with SparkR, I get the following error : Error: class(value) == "Column" || is.null(value) is not TRUE I've tried insane things to manage it, I was able to create a column using another (existing) one with df <- withColumn(df, "new_column", df$existing_column) , but this simple thing, nope... Any help ? Thanks. 回答1: The straight solution will