sparkr | 易学教程

Remove column names in a DataFrame

阅读更多关于 Remove column names in a DataFrame

问题 In sparkR I have a DataFrame data . When I type head(data) we get this output C0 C1 C2 C3 1 id user_id foreign_model_id machine_id 2 1 3145 4 12 3 2 4079 1 8 4 3 1174 7 1 5 4 2386 9 9 6 5 5524 1 7 I want to remove C0,C1,C2,C3 because they give me problems later one. For example when I use the filter function: filter(data,data$machine_id==1) can't run because of this. I have read the data like this data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv") 回答1: SparkR made

SparkR Stage X contains a task of very large size

阅读更多关于 SparkR Stage X contains a task of very large size

问题 I'm getting this warning when invoking createOrReplaceTempView with a R data frame: createOrReplaceTempView (as.Data.Frame(products), "prod") Do I should ignore this warning? This is inefficient? Thanks! 回答1: Those are just warnings. If you want to try and avoid them, repartition you data and call an action on it before registering a temp table and executing some function on the data. The repartition will cause a shuffle. For example, set.seed(123) df<- data.frame(thing1=rnorm(100000), thing2

How to call SparkMLLib algorithms using R or SparkR?

阅读更多关于 How to call SparkMLLib algorithms using R or SparkR?

问题 I am trying use to SparkR and R as front end to develop machine learning models. I want to make use Spark's MLLib which works on distributed data frames. Is there anyway to call spark MLLib algorithm from R? 回答1: Unfortunately no. We will have to wait Apache Spark 1.5 for sparkR-mllib bindings. 来源： https://stackoverflow.com/questions/31309983/how-to-call-sparkmllib-algorithms-using-r-or-sparkr

R package which imports SparkR (not on CRAN)

阅读更多关于 R package which imports SparkR (not on CRAN)

问题 This question is related to this: when you are writing a package, how to specify a dependency (either in Imports or Depends ) on an existing R package which is not on CRAN. I am writing an R package that imports SparkR , which is not in CRAN anymore (it is delivered with Spark in the R folder). I have tried adding the GitHub link to http://github.com/apache/spark/tree/master/R/pkg in the Additional_repositories field of my DESCRIPTION file, with no luck since the R CMD commands (install,

How to subset SparkR data frame

阅读更多关于 How to subset SparkR data frame

问题 Assume we have a dataset 'people' which contains ID and Age as a 2 times 3 matrix. Id = 1 2 3 Age= 21 18 30 In sparkR I want to create a new dataset people2 which contains all ID who are older than 18. In this case it's ID 1 and 3. In sparkR I would do this people2 <- people$Age > 18 but it does not work. How would you create the new dataset? 回答1: For those who appreciate R's multitude of options to do any given task, you can also use the SparkR::subset() function: > people <- createDataFrame

Add a column full of NAs in Sparkr

阅读更多关于 Add a column full of NAs in Sparkr

问题 How do I add a column full of NA in a SparkR DataFrame ? This doesn't work. > df <- data.frame(cola = 1:4) > sprkrDF <- createDataFrame(sqlContext, df) > sprkrDF$colb <- NA Error: class(value) == "Column" || is.null(value) is not TRUE Thanks NB : I want to add it directly to the SparkR DataFrame , so this is not the solution I'm looking for : > df <- data.frame(cola = 1:4, colb = NA) > sprkrDF <- createDataFrame(sqlContext, df) 回答1: We could use lit() to create a new column and fill it with

Working with duplicated columns in SparkR

阅读更多关于 Working with duplicated columns in SparkR

问题 I am working on a problem where I need to load a large number of CSVs and do some aggregations on them with SparkR. I need to infer the schema whenever I can (so detect integers etc). I need to assume that I can't hard-code the schema (unknown number of columns in each file or can't infer schema from column name alone). I can't infer the schema from a CSV file with a duplicated header value - it simply won't let you. I load them like so: df1 <- read.df(sqlContext, file, "com.databricks.spark

Loading data from on-premises hdfs to local SparkR

阅读更多关于 Loading data from on-premises hdfs to local SparkR

问题 I'm trying to load data from an on-premises hdfs to R-Studio with SparkR. When I do this: df_hadoop <- read.df(sqlContext, "hdfs://xxx.xx.xxx.xxx:xxxx/user/lam/lamr_2014_09.csv", source = "com.databricks.spark.csv") and then this: str(df_hadoop) I get this: Formal class 'DataFrame' [package "SparkR"] with 2 slots ..@ env: <environment: 0x000000000xxxxxxx> ..@ sdf:Class 'jobj' <environment: 0x000000000xxxxxx> This is not however the df I'm looking for, because there are 13 fields in the csv I

loading SparkR data frame in Hive

阅读更多关于 loading SparkR data frame in Hive

问题 I need to load the DataFrame created in SparkR to be loaded in Hive. #created a dataframe df_test df_test <- createDataFrame(sqlContext, data.frame(mon = c(1,2,3,4,5), year = c(2011,2012,2013,2014,2015))) #initialized the Hive context >sc <- sparkR.init() >hiveContext <- sparkRHive.init(sc) #used the saveAsTable fn to save dataframe "df_test" in hive table named "table_hive" >saveAsTable(df_test, "table_hive") 16/08/24 23:08:36 ERROR RBackendHandler: saveAsTable on 13 failed Error in

SparkR - cast to date format

阅读更多关于 SparkR - cast to date format

问题 How do I cast string to date with a specific format for a Spark dataframe? In dplyr, I would do this: df = data.frame(dt1 = c("22DEC16", "12JUN16"), x = c(10,20)) df = df %>% mutate(dt2 = as.Date(dt1, "%d%b%y")) > df dt1 x dt2 1 22DEC16 10 2016-12-22 2 12JUN16 20 2016-06-12 回答1: In Spark 2.2 or later: library(magrittr) df <- createDataFrame(data.frame(dt=c("22DEC16", "12JUN16"))) df %>% withColumn("parsed", to_date(.$dt, "ddMMMyy")) %>% head() dt parsed 1 22DEC16 2016-12-22 2 12JUN16 2016-06