sparkr | 易学教程

Unnest (seperate) multiple column values into new rows using Sparklyr

阅读更多关于 Unnest (seperate) multiple column values into new rows using Sparklyr

问题 I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr . But I am looking to solve same problem in sparklyr . id <- c(1,1,1,1,1,2,2,2,3,3,3) name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E") value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3") dt <- data.frame(id,name,value) R solution: separate_rows(dt, name, sep=",") %>% separate_rows(value, sep=",

Using apply functions in SparkR

阅读更多关于 Using apply functions in SparkR

问题 I am currently trying to implement some functions using sparkR version 1.5.1. I have seen older (version 1.3) examples, where people used the apply function on DataFrames, but it looks like this is no longer directly available. Example: x = c(1,2) xDF_R = data.frame(x) colnames(xDF_R) = c("number") xDF_S = createDataFrame(sqlContext,xDF_R) Now, I can use the function sapply on the data.frame object xDF_R$result = sapply(xDF_R$number, ppois, q=10) When I use a similar logic on the DataFrame

java.lang.OutOfMemoryError: Java heap space when SparkR collect

阅读更多关于 java.lang.OutOfMemoryError: Java heap space when SparkR collect

问题 My collected data size is 1.3g and all the configurations about driver memory are set to 3g. Why the out of memory is still happening?? This is my detail configuration of sparkR and OOM exception message. spark.default.confs=list(spark.cores.max="8",spark.executor.memory="15g", spark.driver.maxResultSize="3g",spark.driver.memory="3g",spark.driver.extraJavaOptions="-Xms3g") # sc <- sparkR.init(master="spark://10.58.70.155:7077",sparkEnvir = spark.default.confs) ERROR RBackendHandler: dfToCols

Convert string to date in sparkR

阅读更多关于 Convert string to date in sparkR

I have this data.frame in sparkR df <- data.frame(user_id=c(1,1,2,2), time=c("2015-7-10","2015-8-04","2015-8-8","2015-7-10")) I make this to a DataFrame dft <- createDataFrame(sqlContext, df) I want to convert the date (which is now a string) to a 'date'-type. I use the 'cast'-function dft$time <- cast(dft$time, 'date') But now when I use head(dft) I can see that 'time' only contain NA. Maybe there should be added something to the 'cast'-function or maybe there should be loaded a package before using it? Alternative one could use 'as.Date' on the data.frame but it takes time for large data. I

How to do Cross validation in sparkr

阅读更多关于 How to do Cross validation in sparkr

问题 I am working with movie lens dataset, I have a matrix(m X n) of user id as row and movie id as columns and I have done dimension reduction technique and matrix factorization to reduce my sparse matrix (m X k, where k < n ). I want to evaluate the performance using the k-nearest neighbor algorithm (not library , my own code) . I am using sparkR 1.6.2. I don't know how to split my dataset into training data and test data in sparkR. I have tried native R function (sample, subset,CARET) but it is

Convert string to date in sparkR

阅读更多关于 Convert string to date in sparkR

问题 I have this data.frame in sparkR df <- data.frame(user_id=c(1,1,2,2), time=c("2015-7-10","2015-8-04","2015-8-8","2015-7-10")) I make this to a DataFrame dft <- createDataFrame(sqlContext, df) I want to convert the date (which is now a string) to a 'date'-type. I use the 'cast'-function dft$time <- cast(dft$time, 'date') But now when I use head(dft) I can see that 'time' only contain NA. Maybe there should be added something to the 'cast'-function or maybe there should be loaded a package

Job fails on loading com.databricks.spark.csv in SparkR shell

阅读更多关于 Job fails on loading com.databricks.spark.csv in SparkR shell

问题 When I open the sparkR shell like below I am able to run the jobs successfully >bin/sparkR >rdf = data.frame(name =c("a", "b"), age =c(1,2)) >df = createDataFrame(sqlContext, rdf) >df DataFrame[name:string, age:double] Wherease when I include the package spark-csv while loading the sparkR shell, the job fails >bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3 >rdf = data.frame(name =c("a", "b"), age =c(1,2)) >df = createDataFrame(sqlContext, rdf) > rdf = data.frame(name =c("a", "b"),

Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

阅读更多关于 Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

问题 I am new to Spark and would like to know if there are other options than those ones below to read data stored in a hdfs from RStudio using SparkR or if I use them correctly. The data could be any kind (pure text, csv, json, xml or any database containing relational tables) and of any size (1kb - several gb). I know that textFile(sc, path) should no more be used, but are there other possibilities to read such kinds of data besides the read.df function? The following code uses the read.df and

大数据基础--R语言（刘鹏《大数据》课后习题答案）

阅读更多关于大数据基础--R语言（刘鹏《大数据》课后习题答案）

1.R语言是解释性语言还是编译性语言？　　解释性语言 2.简述R语言的基本功能。　　R语言是一套完整的数据处理、计算和制图软件系统，主要包括以下功能：　　　　（１）数据存储和处理功能，丰富的数据读取与存储能力，丰富的数据处理功能。　　　　（２）数组运算工具　　　　（３）完整连贯的统计分析工具　　　　（４）优秀的统计制图功能 3.R语言通常用在哪些领域？　　人工智能、统计分析、应用数学、计量经济、金融分析、财经分析、生物信息学、数据可视化与数据挖掘等。 4.R语言常用的分类和预测算法有哪些？　　（１）Ｋ－近邻算法，如果一个样本与特征空间中的K个最相似（特征空间最近邻）的样本中的大多数属于某一个类别，则该样本也属于这一类别。　　（２）决策树，是一种依托于分类、训练上的预测树，根据已知预测、归类未来。　　（３）支持向量机，是一个二分类的办法，即将数据集中的数据分为两类。 5.简述如何利用R程序包进行数据分析、建模和数据预测。　　数据集加载－＞数据集中的数据分析－＞无效数据处理－＞预测模型的构建－＞模型的评价与选择－＞实际需求预测－＞完成对应用需求的实现预测 6.如何使用“聚类”和“分类”对数据样本进行分组。　　 “聚类”和“分类”都可以从历史数据纪录中自动推导出给定数据的推广描述，从而能对未来数据进行预测。不同的是， “分类”算法需要用训练样本构造分类器

How do I apply a function on each value of a column in a SPARKR DataFrame?

阅读更多关于 How do I apply a function on each value of a column in a SPARKR DataFrame?

问题 I am relatively new to SPARKR. I downloaded SPARK 1.4 and setup RStudio to use SPARKR library. However I want to know how I can apply a function to each value in a column of a distributed DataFrame, can someone please help? For example, This works perfectly myFunc <- function(x) { paste(x , "_hello")} c <- c("a", "b", "c") d <- lapply(c, myFunc) How to make this work for a Distributed DataFrame. The intention is to append "_hello" to each value of column Name of DF DF <- read.df(sqlContext,