sparkr

SparkR collect method crashes with OutOfMemory on Java heap space

可紊 提交于 2019-12-05 21:07:02
With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines. My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0. SparkR is intalled on each machine, and basic tests are working on small files. Here is the script I use : Sys.setenv("SPARK_MEM" = "1g") sc <- sparkR.init("spark://xxxx:7077", sparkEnvir=list(spark.executor.memory="1g")) lines <- textFile(sc, "gs://xxxx/dir/") test <-

Convert date to end of month in Spark

久未见 提交于 2019-12-05 16:59:30
I have a Spark DataFrame as shown below: #Create DataFrame df <- data.frame(name = c("Thomas", "William", "Bill", "John"), dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08')) df <- createDataFrame(df) #Make sure df$dates column is in 'date' format df <- withColumn(df, 'dates', cast(df$dates, 'date')) name | dates -------------------- Thomas |2017-01-05 William |2017-02-23 Bill |2017-03-16 John |2017-04-08 I want to change dates to the end of month date, so they would look like shown below. How do I do this? Either SparkR or PySpark code is fine. name | dates --------------------

Unable to launch SparkR in RStudio

a 夏天 提交于 2019-12-05 13:38:23
问题 After long and difficult installation process of SparkR i getting into new problems of launching SparkR. My Settings R 3.2.0 RStudio 0.98.1103 Rtools 3.3 Spark 1.4.0 Java Version 8 SparkR 1.4.0 Windows 7 SP 1 64 Bit Now i try to use following code in R: library(devtools) library(SparkR) Sys.setenv(SPARK_MEM="1g") Sys.setenv(SPARK_HOME="C:/spark-1.4.0") sc <- sparkR.init(master="local") I recieve following: JVM is not ready after 10 seconds I was also trying to add some system variables like

Unnest (seperate) multiple column values into new rows using Sparklyr

*爱你&永不变心* 提交于 2019-12-05 12:53:04
I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr . But I am looking to solve same problem in sparklyr . id <- c(1,1,1,1,1,2,2,2,3,3,3) name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E") value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3") dt <- data.frame(id,name,value) R solution: separate_rows(dt, name, sep=",") %>% separate_rows(value, sep=",") Desired Output from sparkframe(sparklyr package)- > final_result id name value 1 1 A 1 2 1 A 2 3 1 A

Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

余生颓废 提交于 2019-12-05 10:54:16
I am new to Spark and would like to know if there are other options than those ones below to read data stored in a hdfs from RStudio using SparkR or if I use them correctly. The data could be any kind (pure text, csv, json, xml or any database containing relational tables) and of any size (1kb - several gb). I know that textFile(sc, path) should no more be used, but are there other possibilities to read such kinds of data besides the read.df function? The following code uses the read.df and jsonFile but jsonFile produces an error: Sys.setenv(SPARK_HOME = "C:\\Users\\--\\Downloads\\spark-1.5.0

[2019好程序员大数据教程]SparkGraphx从入门到精通(33集视频+源码+笔记)

懵懂的女人 提交于 2019-12-05 07:32:22
1、什么是 Spark GraphX ? Spark GraphX是一个分布式的图处理框架。社交网络中,用户与用户之间会存在错综复杂的联系,如微信、QQ、微博的用户之间的好友、关注等关系,构成了一张巨大的图,单机无法处理,只能使用分布式图处理框架处理,Spark GraphX就是一种分布式图处理框架。 2、 Spark GraphX 优点: 相对于其他分布式图计算框架,Graphx最大的贡献,也是大多数开发喜欢它的原因是,在Spark之上提供了一站式解决方案,可以方便且高效地完成图计算的一整套流水作业;即在实际开发中,可以使用核心模块来完成海量数据的清洗与与分析阶段,SQL模块来打通与数据仓库的通道,Streaming打造实时流处理通道,基于GraphX图计算算法来对网页中复杂的业务关系进行计算,最后使用MLLib以及SparkR来完成数据挖掘算法处理。 Spark GraphX的整体架构 (1)存储层和原语层:Graph类是图计算的核心类,内部含有VertexRDD、EdgeRDD和RDD。GraphImpl是Graph类的子类,实现了图操作。 (2)接口层:在底层RDD的基础之上实现Pragel模型,BSP模式的计算接口。 (3)算法层:基于Pregel接口实现了常用的图算法。包含:PageRank、SVDPlusPlus、TriangleCount

Using apply functions in SparkR

笑着哭i 提交于 2019-12-05 05:44:27
I am currently trying to implement some functions using sparkR version 1.5.1. I have seen older (version 1.3) examples, where people used the apply function on DataFrames, but it looks like this is no longer directly available. Example: x = c(1,2) xDF_R = data.frame(x) colnames(xDF_R) = c("number") xDF_S = createDataFrame(sqlContext,xDF_R) Now, I can use the function sapply on the data.frame object xDF_R$result = sapply(xDF_R$number, ppois, q=10) When I use a similar logic on the DataFrame xDF_S$result = sapply(xDF_S$number, ppois, q=10) I get the error message "Error in as.list.default(X) :

Install SparkR that comes with Spark 1.4

笑着哭i 提交于 2019-12-05 04:08:39
问题 The newest version of Spark (1.4) now comes with SparkR. Does anyone know how to go about installing the SparkR implementation on Windows? The sparkR.R script is currently located in C:/spark-1.4.0/R/pkgs/R/ This appears to be a step in the right direction, but the instructions don't work for Windows as there is no sparkR directory as it relates to. 回答1: @DavidArenburg put me on the right track. Following the Windows documentation in the C:\spark-1.4.0\R\WINDOWS.md, I installed RTools and

java.lang.OutOfMemoryError: Java heap space when SparkR collect

与世无争的帅哥 提交于 2019-12-04 20:17:31
My collected data size is 1.3g and all the configurations about driver memory are set to 3g. Why the out of memory is still happening?? This is my detail configuration of sparkR and OOM exception message. spark.default.confs=list(spark.cores.max="8",spark.executor.memory="15g", spark.driver.maxResultSize="3g",spark.driver.memory="3g",spark.driver.extraJavaOptions="-Xms3g") # sc <- sparkR.init(master="spark://10.58.70.155:7077",sparkEnvir = spark.default.confs) ERROR RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed java.lang.reflect.InvocationTargetException at sun

Extracting Class Probabilities from SparkR ML Classification Functions

我们两清 提交于 2019-12-04 10:06:51
I'm wondering if it's possible (using the built in features of SparkR or any other workaround), to extract the class probabilities of some of the classification algorithms that included in SparkR. Particular ones of interest are. spark.gbt() spark.mlp() spark.randomForest() Currently, when I use the predict function on these models I am able to extract the predictions, but not the actual probabilities or "confidence." I've seen several other questions that are similar to this topic, but none that are specific to SparkR, and many have not been answered in regards to Spark's most recent updates.