sparkr

SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

痴心易碎 提交于 2019-12-21 05:34:12
问题 I read a parquet file from HDFS system: path<-"hdfs://part_2015" AppDF <- parquetFile(sqlContext, path) printSchema(AppDF) root |-- app: binary (nullable = true) |-- category: binary (nullable = true) |-- date: binary (nullable = true) |-- user: binary (nullable = true) class(AppDF) [1] "DataFrame" attr(,"package") [1] "SparkR" collect(AppDF) .....error: arguments imply differing number of rows: 46021, 39175, 62744, 27137 head(AppDF) .....error: arguments imply differing number of rows: 36,

SparkR show Chinese character wrong

送分小仙女□ 提交于 2019-12-20 05:36:10
问题 I am new to SparkR, these days I encountered a problem that after convert a file contain Chinese character into SparkR, it would not shown properly anymore. Like this: city=c("北京","上海","杭州") A <- as.data.frame(city) A city 1 北京 2 上海 3 杭州 Then, I created a DataFram in SparkR based on that, and collect it out, eveything changed. collect(createDataFrame(sqlContext,A)) city 1 \027\xac 2 \nw 3 m\xde I don't know how to transfer them back to readable Chinese character, or even I hope I can get

SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, …) :

戏子无情 提交于 2019-12-20 05:00:10
问题 I am using RStudio. After creating session if i try to create dataframe using R data it gives error. Sys.setenv(SPARK_HOME = "E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7") Sys.setenv(HADOOP_HOME = "E:/winutils") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"') library(SparkR) sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="C:/Temp")) localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c

Duplicate columns in Spark Dataframe

徘徊边缘 提交于 2019-12-19 07:23:30
问题 I have a 10GB csv file in hadoop cluster with duplicate columns. I try to analyse it in SparkR so I use spark-csv package to parse it as DataFrame : df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header = "true", mode = "DROPMALFORMED" ) But since df have duplicate Email columns, if I want to select this column, it would error out: select(df, 'Email') 15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed Error in invokeJava(isStatic = FALSE, objId$id,

Sparkr write DF as file csv/txt

淺唱寂寞╮ 提交于 2019-12-19 04:21:20
问题 Hi I'm working on sparkR in yarn mode. I need to write a sparkr df to a csv/txt file. I saw that there is write.df but it writes parquet files. I tried to do this things RdataFrame<-collect(SparkRDF) write.table(RdataFrame, ..) But I got many WARN and some ERROR on contextCleaner. Is there any way ? 回答1: Spark 2.0+ You can use write.text function: Save the content of the SparkDataFrame in a text file at the specified path. The SparkDataFrame must have only one column of string type with the

Difference between createOrReplaceTempView and registerTempTable

ⅰ亾dé卋堺 提交于 2019-12-18 12:24:17
问题 I am new to spark and was trying out a few commands in sparkSql using python when I came across these two commands: createOrReplaceTempView() and registerTempTable(). What is the difference between the two commands?. They seem to have same set of functionalities. 回答1: registerTempTable is a part of the 1.x API and has been deprecated in Spark 2.0. createOrReplaceTempView and createTempView have been introduced in Spark 2.0, as a replacement for registerTempTable . Other than that

How do I read a Parquet in R and convert it to an R DataFrame?

五迷三道 提交于 2019-12-18 11:01:36
问题 I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr 回答1: You can use the arrow package for this. It is the same thing as in Python pyarrow but this nowadays also comes packaged for R without the need for Python. As it is not yet available on CRAN, you

Using SparkR and Sparklyr simultaneously

时光怂恿深爱的人放手 提交于 2019-12-18 08:48:02
问题 As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both packages to get the full scope of functionality. As both packages essentially wrap references to Java instances of scala classes, it should be possible to use the packages in parallel, I guess. But is it actually possible? What are your best practices?

How to load csv file into SparkR on RStudio?

邮差的信 提交于 2019-12-18 04:18:05
问题 How do you load csv file into SparkR on RStudio? Below are the steps I had to perform to run SparkR on RStudio. I have used read.df to read .csv not sure how else to write this. Not sure if this step is considered to create RDDs. #Set sys environment variables Sys.setenv(SPARK_HOME = "C:/Users/Desktop/spark/spark-1.4.1-bin-hadoop2.6") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) #Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3"

Installing of SparkR

夙愿已清 提交于 2019-12-17 04:25:13
问题 I have the last version of R - 3.2.1. Now I want to install SparkR on R. After I execute: > install.packages("SparkR") I got back: Installing package into ‘/home/user/R/x86_64-pc-linux-gnu-library/3.2’ (as ‘lib’ is unspecified) Warning in install.packages : package ‘SparkR’ is not available (for R version 3.2.1) I have also installed Spark on my machine Spark 1.4.0 How I can solve this problem? 回答1: You can install directly from a GitHub repository: if (!require('devtools')) install.packages(