sparkr | 易学教程

SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

阅读更多关于 SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

问题 I read a parquet file from HDFS system: path<-"hdfs://part_2015" AppDF <- parquetFile(sqlContext, path) printSchema(AppDF) root |-- app: binary (nullable = true) |-- category: binary (nullable = true) |-- date: binary (nullable = true) |-- user: binary (nullable = true) class(AppDF) [1] "DataFrame" attr(,"package") [1] "SparkR" collect(AppDF) .....error: arguments imply differing number of rows: 46021, 39175, 62744, 27137 head(AppDF) .....error: arguments imply differing number of rows: 36,

SparkR show Chinese character wrong

阅读更多关于 SparkR show Chinese character wrong

问题 I am new to SparkR, these days I encountered a problem that after convert a file contain Chinese character into SparkR, it would not shown properly anymore. Like this: city=c("北京","上海","杭州") A <- as.data.frame(city) A city 1 北京 2 上海 3 杭州 Then, I created a DataFram in SparkR based on that, and collect it out, eveything changed. collect(createDataFrame(sqlContext,A)) city 1 \027\xac 2 \nw 3 m\xde I don't know how to transfer them back to readable Chinese character, or even I hope I can get

SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, …) :

阅读更多关于 SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, …) :

问题 I am using RStudio. After creating session if i try to create dataframe using R data it gives error. Sys.setenv(SPARK_HOME = "E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7") Sys.setenv(HADOOP_HOME = "E:/winutils") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"') library(SparkR) sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="C:/Temp")) localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c

Duplicate columns in Spark Dataframe

阅读更多关于 Duplicate columns in Spark Dataframe

问题 I have a 10GB csv file in hadoop cluster with duplicate columns. I try to analyse it in SparkR so I use spark-csv package to parse it as DataFrame : df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header = "true", mode = "DROPMALFORMED" ) But since df have duplicate Email columns, if I want to select this column, it would error out: select(df, 'Email') 15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed Error in invokeJava(isStatic = FALSE, objId$id,

Sparkr write DF as file csv/txt

阅读更多关于 Sparkr write DF as file csv/txt

问题 Hi I'm working on sparkR in yarn mode. I need to write a sparkr df to a csv/txt file. I saw that there is write.df but it writes parquet files. I tried to do this things RdataFrame<-collect(SparkRDF) write.table(RdataFrame, ..) But I got many WARN and some ERROR on contextCleaner. Is there any way ? 回答1: Spark 2.0+ You can use write.text function: Save the content of the SparkDataFrame in a text file at the specified path. The SparkDataFrame must have only one column of string type with the

Difference between createOrReplaceTempView and registerTempTable

阅读更多关于 Difference between createOrReplaceTempView and registerTempTable

问题 I am new to spark and was trying out a few commands in sparkSql using python when I came across these two commands: createOrReplaceTempView() and registerTempTable(). What is the difference between the two commands?. They seem to have same set of functionalities. 回答1: registerTempTable is a part of the 1.x API and has been deprecated in Spark 2.0. createOrReplaceTempView and createTempView have been introduced in Spark 2.0, as a replacement for registerTempTable . Other than that

How do I read a Parquet in R and convert it to an R DataFrame?

阅读更多关于 How do I read a Parquet in R and convert it to an R DataFrame?

问题 I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr 回答1: You can use the arrow package for this. It is the same thing as in Python pyarrow but this nowadays also comes packaged for R without the need for Python. As it is not yet available on CRAN, you

Using SparkR and Sparklyr simultaneously

阅读更多关于 Using SparkR and Sparklyr simultaneously

问题 As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both packages to get the full scope of functionality. As both packages essentially wrap references to Java instances of scala classes, it should be possible to use the packages in parallel, I guess. But is it actually possible? What are your best practices?

How to load csv file into SparkR on RStudio?

阅读更多关于 How to load csv file into SparkR on RStudio?

问题 How do you load csv file into SparkR on RStudio? Below are the steps I had to perform to run SparkR on RStudio. I have used read.df to read .csv not sure how else to write this. Not sure if this step is considered to create RDDs. #Set sys environment variables Sys.setenv(SPARK_HOME = "C:/Users/Desktop/spark/spark-1.4.1-bin-hadoop2.6") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) #Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3"

Installing of SparkR

阅读更多关于 Installing of SparkR

问题 I have the last version of R - 3.2.1. Now I want to install SparkR on R. After I execute: > install.packages("SparkR") I got back: Installing package into ‘/home/user/R/x86_64-pc-linux-gnu-library/3.2’ (as ‘lib’ is unspecified) Warning in install.packages : package ‘SparkR’ is not available (for R version 3.2.1) I have also installed Spark on my machine Spark 1.4.0 How I can solve this problem? 回答1: You can install directly from a GitHub repository: if (!require('devtools')) install.packages(