sparkr | 易学教程

Starting SparkR session using external config file

阅读更多关于 Starting SparkR session using external config file

问题 I have an RStudio driver instance which is connected to a Spark Cluster. I wanted to know if there is any way to actually connect to Spark cluster from RStudio using an external configuration file which can specify the number of executors, memory and other spark parameters. I know we can do it using the below command sparkR.session(sparkConfig = list(spark.cores.max='2',spark.executor.memory = '8g')) I am specifically looking for a method which takes spark parameters from an external file to

SparkR: Cannot read data at deployed workers, but ok with local machine

阅读更多关于 SparkR: Cannot read data at deployed workers, but ok with local machine

问题 New to spark and spakrR. For Hadoop, only have a file called winutils/bin/winutils.exe . Running system: OS Windows 10 Java Java version "1.8.0_101" Java(TM) SE Runtime Environment (build 1.8.0_101-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode) R platform: x86_64-w64-mingw32 arch: x86_64 os: mingw32 RStudio: Version 1.0.20 – © 2009-2016 RStudio, Inc. Spark 2.0.0 I can read data on my local machine, but on the deployed workers, I cannot do that. Could anybody help me?

How to write to JDBC source with SparkR 1.6.0?

阅读更多关于 How to write to JDBC source with SparkR 1.6.0?

问题 With SparkR 1.6.0 I can read from a JDBC source with the following code, jdbc_url <- "jdbc:mysql://localhost:3306/dashboard?user=<username>&password=<password>" df <- sqlContext %>% loadDF(source = "jdbc", url = jdbc_url, driver = "com.mysql.jdbc.Driver", dbtable = "db.table_name") But after performing a calculation, when I try to write the data back to the database I've hit a roadblock as attempting... write.df(df = df, path = "NULL", source = "jdbc", url = jdbc_url, driver = "com.mysql.jdbc

How to write to JDBC source with SparkR 1.6.0?

阅读更多关于 How to write to JDBC source with SparkR 1.6.0?

SparkR foreach loop

阅读更多关于 SparkR foreach loop

问题 In Java/Scala/Python implementations of Spark, one can simply call the foreach method of RDD or DataFrame types in order to parallelize the iterations over a dataset. In SparkR I can't find such instruction. What would be the proper way to iterate over the rows of a DataFrame ? I could only find the gapply and dapply functions, but I don't want to calculate new column values, I just want to do something by taking one element from a list, in parallel. My previous attempt was with lapply

Reading csv data into SparkR after writing it out from a DataFrame

阅读更多关于 Reading csv data into SparkR after writing it out from a DataFrame

问题 I followed the example in this post to write out a DataFrame as a csv to an AWS S3 bucket. The result was not a single file but rather a folder with many .csv files. I'm now having trouble reading in this folder as a DataFrame in SparkR. Below is what I've tried but they do not result in the same DataFrame that I wrote out. write.df(df, 's3a://bucket/df', source="csv") #Creates a folder named df in S3 bucket df_in1 <- read.df("s3a://bucket/df", source="csv") df_in2 <- read.df("s3a://bucket/df

Trying to find R equivalent for SetConf from Java

阅读更多关于 Trying to find R equivalent for SetConf from Java

问题 In Java, you can do something like: sc.setConf('spark.sql.parquet.binaryAsString','true') What would the equivalent be in R? I've looked at the methods available to the sc object, and can't find any obvious way of doing this Thanks 回答1: You can set environment variables during SparkContext initialization. sparkR.init has a number of optional arguments including: sparkEnvir - a list of environment variables to set on worker nodes. sparkExecutorEnv - a list of environment variables to be used

SparkR Job 100 Minutes Timeout

阅读更多关于 SparkR Job 100 Minutes Timeout

问题 I have written a bit complex sparkR script and run it using spark-submit. What script basically do is read a big hive/impala parquet based table row by row and generate new parquet file having same number of rows. But it seems the job stops after exactly around 100 Minutes which seems some timeout. For up to 500K rows script works perfectly (Because it needs less than 100 Minutes) For 1, 2 or 3 or more million rows script exits after 100 Minutes. I checked all possible parameter having values

Extracting Class Probabilities from SparkR ML Classification Functions

阅读更多关于 Extracting Class Probabilities from SparkR ML Classification Functions

问题 I'm wondering if it's possible (using the built in features of SparkR or any other workaround), to extract the class probabilities of some of the classification algorithms that included in SparkR. Particular ones of interest are. spark.gbt() spark.mlp() spark.randomForest() Currently, when I use the predict function on these models I am able to extract the predictions, but not the actual probabilities or "confidence." I've seen several other questions that are similar to this topic, but none

should I pre-install cran r packages on worker nodes when using sparkr

阅读更多关于 should I pre-install cran r packages on worker nodes when using sparkr

问题 I want to use r packages on cran such as forecast etc with sparkr and meet following two problems. Should I pre-install all those packages on worker nodes? But when I read the source code of spark this file, it seems that spark will automatically zip packages and distribute them to the workers via --jars or --packages. What should I do to make the dependencies available on workers? Suppose I need to use functions provided by forecast in a map transformation, how should I import the package.