sparkr

SparkR 2.0 Classification: how to get performance matrices?

僤鯓⒐⒋嵵緔 提交于 2019-12-25 09:02:06
问题 How to get performance matrices in sparkR classification, e.g., F1 score, Precision, Recall, Confusion Matrix # Load training data df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") training <- df testing <- df # Fit a random forest classification model with spark.randomForest model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10) # Model summary summary(model) # Prediction predictions <- predict(model, testing) head(predictions) #

why SparkR isn't available in CRAN R package list? [duplicate]

久未见 提交于 2019-12-24 15:27:24
问题 This question already has answers here : Installing of SparkR (4 answers) Closed 4 years ago . I checked for sparkR package in CRAN package list through the following link. https://cran.r-project.org/web/packages/available_packages_by_date.html This list does not include sparkR, and therefore installing sparkR through install.packages("package_name") cannot be done. Why isn't sparkR listed in the package list? 回答1: Since Spark 1.4, SparkR is not a separate package anymore but has been

How to identify repeated occurrences of a string column in Hive?

谁说我不能喝 提交于 2019-12-24 07:24:06
问题 I have a view like this in Hive: id sequencenumber appname 242539622 1 A 242539622 2 A 242539622 3 A 242539622 4 B 242539622 5 B 242539622 6 C 242539622 7 D 242539622 8 D 242539622 9 D 242539622 10 B 242539622 11 B 242539622 12 D 242539622 13 D 242539622 14 F I'd like to have, per each id, the following view: id sequencenumber appname appname_c 242539622 1 A A 242539622 2 A A 242539622 3 A A 242539622 4 B B_1 242539622 5 B B_1 242539622 6 C C 242539622 7 D D_1 242539622 8 D D_1 242539622 9 D

sql sparklyr sparkr dataframe conversions on databricks

痞子三分冷 提交于 2019-12-23 20:55:20
问题 I have the sql table on the databricks created using the following code %sql CREATE TABLE data USING CSV OPTIONS (header "true", inferSchema "true") LOCATION "url/data.csv" The following code converts that table to sparkr and r dataframe, respectively: %r library(SparkR) data_spark <- sql("SELECT * FROM data") data_r_df <- as.data.frame(data_spark) But I don't know how should I convert any or all of these dataframes into sparklyr dataframe to leverage parallelization of sparklyr? 回答1: Just sc

How to do bind two dataframe columns in sparkR?

二次信任 提交于 2019-12-23 17:53:10
问题 How to bind two columns of dataframe in SparkR of spark 1.4 TIA, Arun 回答1: There is no way to do this. Here is a question on spark (1.3) in scala. The only way to be able to do this, is having some kind of row.numbering, because then you are able to join on row.number. Why? Because you can only join tables or add columns based on other already existing columns data1 <- createDataFrame(sqlContext, data.frame(a=c(1,2,3))) data2 <- createDataFrame(sqlContext, data.frame(b=c(2,3,4))) Then

SparkR collect method crashes with OutOfMemory on Java heap space

好久不见. 提交于 2019-12-22 10:53:00
问题 With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines. My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0. SparkR is intalled on each machine, and basic tests are working on small files. Here is the script I use : Sys.setenv("SPARK_MEM" = "1g") sc <- sparkR.init("spark:/

R SparkR - equivalent to melt function

两盒软妹~` 提交于 2019-12-22 00:13:16
问题 Is there a function similar to melt in SparkR library? Transform data with 1 row and 50 columns to 50 rows and 3 columns? 回答1: There is no built-in function that provides a similar functionality in SparkR. You can built your own with explode library(magrittr) df <- createDataFrame(data.frame( A = c('a', 'b', 'c'), B = c(1, 3, 5), C = c(2, 4, 6) )) melt <- function(df, id.vars, measure.vars, variable.name = "key", value.name = "value") { measure.vars.exploded <- purrr::map( measure.vars,

Sparklyr: how to calculate correlation coefficient between 2 Spark tables?

回眸只為那壹抹淺笑 提交于 2019-12-21 23:43:37
问题 I have these 2 Spark tables: simx x0: num 1.00 2.00 3.00 ... x1: num 2.00 3.00 4.00 ... ... x788: num 2.00 3.00 4.00 ... and simy y0: num 1.00 2.00 3.00 ... In both tables, each column has the same number of values. Both table x and y are saved into handle simX_tbl and simY_tbl respectively. The actual data size is quite big and may reach 40GB. I want to calculate the correlation coefficient of each column in simx with simy (let's say like cor(x0, y0, 'pearson') ). I searched everywhere and I

How best to handle converting a large local data frame to a SparkR data frame?

孤人 提交于 2019-12-21 20:35:50
问题 How can I convert a large local data frame to a SparkR data frame efficiently? On my local dev machine an ~ 650MB local data frame quickly exceeds available memory when I try to convert it to a SparkR data frame and I have a dev machine with 40GB of Ram. library(reshape2) years <- sample(1:10, 100, replace = T) storms <- sample(1:10, 100, replace = T) wind_speeds <- matrix(ncol = 316387, nrow = 100, data = sample(0:250, 31638700, replace = T)) df <- data.frame(year=years, storm=storms, ws =

SparkR: dplyr-style split-apply-combine on DataFrame

你。 提交于 2019-12-21 06:05:28
问题 Under the previous RDD paradigm, I could specify a key and then map an operation to RDD elements corresponding to each key. I don't see a clear way to do this with DataFrame in SparkR as of 1.5.1. What I would like to do is something like a dplyr operation: new.df <- old.df %>% group_by("column1") %>% do(myfunc(.)) I currently have a large SparkR DataFrame of the form: timestamp value id 2015-09-01 05:00:00.0 1.132 24 2015-09-01 05:10:00.0 null 24 2015-09-01 05:20:00.0 1.129 24 2015-09-01 05