sparkr | 易学教程

SparkR 2.0 Classification: how to get performance matrices?

阅读更多关于 SparkR 2.0 Classification: how to get performance matrices?

问题 How to get performance matrices in sparkR classification, e.g., F1 score, Precision, Recall, Confusion Matrix # Load training data df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") training <- df testing <- df # Fit a random forest classification model with spark.randomForest model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10) # Model summary summary(model) # Prediction predictions <- predict(model, testing) head(predictions) #

why SparkR isn't available in CRAN R package list? [duplicate]

阅读更多关于 why SparkR isn't available in CRAN R package list? [duplicate]

问题 This question already has answers here : Installing of SparkR (4 answers) Closed 4 years ago . I checked for sparkR package in CRAN package list through the following link. https://cran.r-project.org/web/packages/available_packages_by_date.html This list does not include sparkR, and therefore installing sparkR through install.packages("package_name") cannot be done. Why isn't sparkR listed in the package list? 回答1: Since Spark 1.4, SparkR is not a separate package anymore but has been

How to identify repeated occurrences of a string column in Hive?

阅读更多关于 How to identify repeated occurrences of a string column in Hive?

问题 I have a view like this in Hive: id sequencenumber appname 242539622 1 A 242539622 2 A 242539622 3 A 242539622 4 B 242539622 5 B 242539622 6 C 242539622 7 D 242539622 8 D 242539622 9 D 242539622 10 B 242539622 11 B 242539622 12 D 242539622 13 D 242539622 14 F I'd like to have, per each id, the following view: id sequencenumber appname appname_c 242539622 1 A A 242539622 2 A A 242539622 3 A A 242539622 4 B B_1 242539622 5 B B_1 242539622 6 C C 242539622 7 D D_1 242539622 8 D D_1 242539622 9 D

sql sparklyr sparkr dataframe conversions on databricks

阅读更多关于 sql sparklyr sparkr dataframe conversions on databricks

问题 I have the sql table on the databricks created using the following code %sql CREATE TABLE data USING CSV OPTIONS (header "true", inferSchema "true") LOCATION "url/data.csv" The following code converts that table to sparkr and r dataframe, respectively: %r library(SparkR) data_spark <- sql("SELECT * FROM data") data_r_df <- as.data.frame(data_spark) But I don't know how should I convert any or all of these dataframes into sparklyr dataframe to leverage parallelization of sparklyr? 回答1: Just sc

How to do bind two dataframe columns in sparkR?

阅读更多关于 How to do bind two dataframe columns in sparkR?

问题 How to bind two columns of dataframe in SparkR of spark 1.4 TIA, Arun 回答1: There is no way to do this. Here is a question on spark (1.3) in scala. The only way to be able to do this, is having some kind of row.numbering, because then you are able to join on row.number. Why? Because you can only join tables or add columns based on other already existing columns data1 <- createDataFrame(sqlContext, data.frame(a=c(1,2,3))) data2 <- createDataFrame(sqlContext, data.frame(b=c(2,3,4))) Then

SparkR collect method crashes with OutOfMemory on Java heap space

阅读更多关于 SparkR collect method crashes with OutOfMemory on Java heap space

问题 With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines. My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0. SparkR is intalled on each machine, and basic tests are working on small files. Here is the script I use : Sys.setenv("SPARK_MEM" = "1g") sc <- sparkR.init("spark:/

R SparkR - equivalent to melt function

阅读更多关于 R SparkR - equivalent to melt function

问题 Is there a function similar to melt in SparkR library? Transform data with 1 row and 50 columns to 50 rows and 3 columns? 回答1: There is no built-in function that provides a similar functionality in SparkR. You can built your own with explode library(magrittr) df <- createDataFrame(data.frame( A = c('a', 'b', 'c'), B = c(1, 3, 5), C = c(2, 4, 6) )) melt <- function(df, id.vars, measure.vars, variable.name = "key", value.name = "value") { measure.vars.exploded <- purrr::map( measure.vars,

Sparklyr: how to calculate correlation coefficient between 2 Spark tables?

阅读更多关于 Sparklyr: how to calculate correlation coefficient between 2 Spark tables?

问题 I have these 2 Spark tables: simx x0: num 1.00 2.00 3.00 ... x1: num 2.00 3.00 4.00 ... ... x788: num 2.00 3.00 4.00 ... and simy y0: num 1.00 2.00 3.00 ... In both tables, each column has the same number of values. Both table x and y are saved into handle simX_tbl and simY_tbl respectively. The actual data size is quite big and may reach 40GB. I want to calculate the correlation coefficient of each column in simx with simy (let's say like cor(x0, y0, 'pearson') ). I searched everywhere and I

How best to handle converting a large local data frame to a SparkR data frame?

阅读更多关于 How best to handle converting a large local data frame to a SparkR data frame?

问题 How can I convert a large local data frame to a SparkR data frame efficiently? On my local dev machine an ~ 650MB local data frame quickly exceeds available memory when I try to convert it to a SparkR data frame and I have a dev machine with 40GB of Ram. library(reshape2) years <- sample(1:10, 100, replace = T) storms <- sample(1:10, 100, replace = T) wind_speeds <- matrix(ncol = 316387, nrow = 100, data = sample(0:250, 31638700, replace = T)) df <- data.frame(year=years, storm=storms, ws =

SparkR: dplyr-style split-apply-combine on DataFrame

阅读更多关于 SparkR: dplyr-style split-apply-combine on DataFrame

问题 Under the previous RDD paradigm, I could specify a key and then map an operation to RDD elements corresponding to each key. I don't see a clear way to do this with DataFrame in SparkR as of 1.5.1. What I would like to do is something like a dplyr operation: new.df <- old.df %>% group_by("column1") %>% do(myfunc(.)) I currently have a large SparkR DataFrame of the form: timestamp value id 2015-09-01 05:00:00.0 1.132 24 2015-09-01 05:10:00.0 null 24 2015-09-01 05:20:00.0 1.129 24 2015-09-01 05