sparkr

Starting SparkR session using external config file

我的未来我决定 提交于 2020-01-16 18:01:05
问题 I have an RStudio driver instance which is connected to a Spark Cluster. I wanted to know if there is any way to actually connect to Spark cluster from RStudio using an external configuration file which can specify the number of executors, memory and other spark parameters. I know we can do it using the below command sparkR.session(sparkConfig = list(spark.cores.max='2',spark.executor.memory = '8g')) I am specifically looking for a method which takes spark parameters from an external file to

SparkR: Cannot read data at deployed workers, but ok with local machine

核能气质少年 提交于 2020-01-15 09:05:31
问题 New to spark and spakrR. For Hadoop, only have a file called winutils/bin/winutils.exe . Running system: OS Windows 10 Java Java version "1.8.0_101" Java(TM) SE Runtime Environment (build 1.8.0_101-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode) R platform: x86_64-w64-mingw32 arch: x86_64 os: mingw32 RStudio: Version 1.0.20 – © 2009-2016 RStudio, Inc. Spark 2.0.0 I can read data on my local machine, but on the deployed workers, I cannot do that. Could anybody help me?

How to write to JDBC source with SparkR 1.6.0?

陌路散爱 提交于 2020-01-15 07:10:33
问题 With SparkR 1.6.0 I can read from a JDBC source with the following code, jdbc_url <- "jdbc:mysql://localhost:3306/dashboard?user=<username>&password=<password>" df <- sqlContext %>% loadDF(source = "jdbc", url = jdbc_url, driver = "com.mysql.jdbc.Driver", dbtable = "db.table_name") But after performing a calculation, when I try to write the data back to the database I've hit a roadblock as attempting... write.df(df = df, path = "NULL", source = "jdbc", url = jdbc_url, driver = "com.mysql.jdbc

How to write to JDBC source with SparkR 1.6.0?

放肆的年华 提交于 2020-01-15 07:10:08
问题 With SparkR 1.6.0 I can read from a JDBC source with the following code, jdbc_url <- "jdbc:mysql://localhost:3306/dashboard?user=<username>&password=<password>" df <- sqlContext %>% loadDF(source = "jdbc", url = jdbc_url, driver = "com.mysql.jdbc.Driver", dbtable = "db.table_name") But after performing a calculation, when I try to write the data back to the database I've hit a roadblock as attempting... write.df(df = df, path = "NULL", source = "jdbc", url = jdbc_url, driver = "com.mysql.jdbc

SparkR foreach loop

最后都变了- 提交于 2020-01-15 03:26:14
问题 In Java/Scala/Python implementations of Spark, one can simply call the foreach method of RDD or DataFrame types in order to parallelize the iterations over a dataset. In SparkR I can't find such instruction. What would be the proper way to iterate over the rows of a DataFrame ? I could only find the gapply and dapply functions, but I don't want to calculate new column values, I just want to do something by taking one element from a list, in parallel. My previous attempt was with lapply

Reading csv data into SparkR after writing it out from a DataFrame

独自空忆成欢 提交于 2020-01-07 06:38:14
问题 I followed the example in this post to write out a DataFrame as a csv to an AWS S3 bucket. The result was not a single file but rather a folder with many .csv files. I'm now having trouble reading in this folder as a DataFrame in SparkR. Below is what I've tried but they do not result in the same DataFrame that I wrote out. write.df(df, 's3a://bucket/df', source="csv") #Creates a folder named df in S3 bucket df_in1 <- read.df("s3a://bucket/df", source="csv") df_in2 <- read.df("s3a://bucket/df

Trying to find R equivalent for SetConf from Java

风格不统一 提交于 2020-01-04 05:38:05
问题 In Java, you can do something like: sc.setConf('spark.sql.parquet.binaryAsString','true') What would the equivalent be in R? I've looked at the methods available to the sc object, and can't find any obvious way of doing this Thanks 回答1: You can set environment variables during SparkContext initialization. sparkR.init has a number of optional arguments including: sparkEnvir - a list of environment variables to set on worker nodes. sparkExecutorEnv - a list of environment variables to be used

SparkR Job 100 Minutes Timeout

|▌冷眼眸甩不掉的悲伤 提交于 2020-01-01 15:50:13
问题 I have written a bit complex sparkR script and run it using spark-submit. What script basically do is read a big hive/impala parquet based table row by row and generate new parquet file having same number of rows. But it seems the job stops after exactly around 100 Minutes which seems some timeout. For up to 500K rows script works perfectly (Because it needs less than 100 Minutes) For 1, 2 or 3 or more million rows script exits after 100 Minutes. I checked all possible parameter having values

Extracting Class Probabilities from SparkR ML Classification Functions

天涯浪子 提交于 2020-01-01 12:14:07
问题 I'm wondering if it's possible (using the built in features of SparkR or any other workaround), to extract the class probabilities of some of the classification algorithms that included in SparkR. Particular ones of interest are. spark.gbt() spark.mlp() spark.randomForest() Currently, when I use the predict function on these models I am able to extract the predictions, but not the actual probabilities or "confidence." I've seen several other questions that are similar to this topic, but none

should I pre-install cran r packages on worker nodes when using sparkr

若如初见. 提交于 2019-12-30 10:28:14
问题 I want to use r packages on cran such as forecast etc with sparkr and meet following two problems. Should I pre-install all those packages on worker nodes? But when I read the source code of spark this file, it seems that spark will automatically zip packages and distribute them to the workers via --jars or --packages. What should I do to make the dependencies available on workers? Suppose I need to use functions provided by forecast in a map transformation, how should I import the package.