sparkr | 易学教程

How to apply a function to each row in SparkR?

阅读更多关于 How to apply a function to each row in SparkR?

问题 I have a file in CSV format which contains a table with column "id", "timestamp", "action", "value" and "location". I want to apply a function to each row of the table and I've already written the code in R as follows: user <- read.csv(file_path,sep = ";") num <- nrow(user) curLocation <- "1" for(i in 1:num) { row <- user[i,] if(user$action != "power") curLocation <- row$value user[i,"location"] <- curLocation } The R script works fine and now I want to apply it SparkR. However, I couldn't

How can I convert groupedData into Dataframe in R

阅读更多关于 How can I convert groupedData into Dataframe in R

问题 Consider I have the below dataframe AccountId,CloseDate 1,2015-05-07 2,2015-05-09 3,2015-05-01 4,2015-05-07 1,2015-05-09 1,2015-05-12 2,2015-05-12 3,2015-05-01 3,2015-05-01 3,2015-05-02 4,2015-05-17 1,2015-05-12 I want to group it based on AccountId and then I want to add another column naming date_diff which will contain the difference in CloseDate between the current row and previous row. Please note that I want this date_diff to be calculated only for rows having same AccountId. So I need

How to count number of missing values for each column of a data frame with SparkR?

阅读更多关于 How to count number of missing values for each column of a data frame with SparkR?

问题 I am processing a 2,5 GB csv file containing 1,1 million lines and 1000 numeric columns that seem to be sparsely populated. I currently execute Spark on a 1-core VM with 8 GB of RAM, and the data has been split into 16 partitions. I tried something like the following, but it takes ages: ldf <- dapplyCollect( df, function(df.partition) { apply(df.partition, 2, function(col) {sum(is.na(col))}) }) 回答1: Here's one way to do it, using sparklyr and dplyr . For the sake of a reproducible example, I

SparkR and Packages

阅读更多关于 SparkR and Packages

问题 How do one call packages from spark to be utilized for data operations with R? example i am trying to access my test.csv in hdfs as below Sys.setenv(SPARK_HOME="/opt/spark14") library(SparkR) sc <- sparkR.init(master="local") sqlContext <- sparkRSQL.init(sc) flights <- read.df(sqlContext,"hdfs://sandbox.hortonWorks.com:8020 /user/root/test.csv","com.databricks.spark.csv", header="true") but getting error as below: Caused by: java.lang.RuntimeException: Failed to load class for data source:

Getting application ID from SparkR to create Spark UI url

阅读更多关于 Getting application ID from SparkR to create Spark UI url

问题 From the SparkR shell, I'd like to generate a link to view the Spark UI while in Yarn mode. Normally the Spark UI is at port 4040, but in Yarn mode apparently it is at something like [host]:9046/proxy/application_1234567890123_0001/ , where the last part of the path is the unique applicationId. Other SO answers show how to get the applicationID for the Scala and Python shells. How do we get the applicationID from SparkR? As a stab in the dark I tried SparkR:::callJMethod(sc, "applicationId")

R read ORC file from S3

阅读更多关于 R read ORC file from S3

问题 We will be hosting an EMR cluster (with spot instances) on AWS running on top of an S3 bucket. Data will be stored in this bucket in ORC format. However, we want to use R as well as some kind of a sandbox environment, reading the same data. I've got the package aws.s3 (cloudyr) running correctly: I can read csv files without a problem, but it seems not to allow me to convert the orc files into something readable. The two options I founnd online were - SparkR - dataconnector (vertica) Since

Get specific row by using SparkR

阅读更多关于 Get specific row by using SparkR

问题 I have a dataset "data" in SparkR of type DataFrame. I want to get entry number 50 for example. In R I simply type data[50,] but when I do this in sparkR I get this message "Error: object of type 'S4' is not subsettable" What can I do to solve this ? Furthermore: How can I add a column (of the same column-size) to the data? 回答1: The only thing you can do is all50 <- take(data,50) row50 <- tail(all50,1) SparkR has no row.names, hence you can not subset on an index. This approach works, but you

How do I run R script for sparkR?

阅读更多关于 How do I run R script for sparkR?

问题 I am running sparkR 2.0.0 from the terminal, and I can run R commands. However, how do I create a .r script and be able to run in it within the spark session. 回答1: SparkR uses standard R interpreter so the same rules apply. If you want to execute external script inside current session use source function. ## Welcome to ## ____ __ ## / __/__ ___ _____/ /__ ## _\ \/ _ \/ _ `/ __/ '_/ ## /___/ .__/\_,_/_/ /_/\_\ version 2.1.0-SNAPSHOT ## /_/ ## ## ## SparkSession available as 'spark'. > sink(

Reading Text file in SparkR 1.4.0

阅读更多关于 Reading Text file in SparkR 1.4.0

问题 Does anyone know how to read a text file in SparkR version 1.4.0? Are there any Spark packages available for that? 回答1: Spark 1.6+ You can use text input format to read text file as a DataFrame : read.df(sqlContext=sqlContext, source="text", path="README.md") Spark <= 1.5 Short answer is you don't. SparkR 1.4 has been almost completely stripped from low level API, leaving only a limited subset of Data Frame operations. As you can read on an old SparkR webpage: As of April 2015, SparkR has

How to do map and reduce in SparkR

阅读更多关于 How to do map and reduce in SparkR

问题 How do I do map and reduce operations using SparkR? All I can find is stuff about SQL queries. Is there a way to do map and reduce using SQL? 回答1: See Writing R data frames returned from SparkR:::map for an example (the question itself). In short, the blog post referred to by sph21 is out of date. As of the current date, both map and reduce have been hidden in SparkR as private methods - there are open tickets to resolve that issue. 来源： https://stackoverflow.com/questions/31012765/how-to-do