sparkr | 易学教程

Apache Spark 3.0 中的向量化执行

阅读更多关于 Apache Spark 3.0 中的向量化执行

R 是数据科学中最流行的计算机语言之一，专门用于统计分析和一些扩展，如用于数据处理和机器学习任务的 RStudio addins 和其他 R 包。此外，它使数据科学家能够轻松地可视化他们的数据集。通过在 Apache Spark 中使用 Spark R，可以很容易地扩展 R 代码。要交互式地运行作业，可以通过运行 R shell 轻松地在分布式集群中运行 R 的作业。当 Spark R 不需要与 R 进程交互时，其性能实际上与 Scala、Java 和 Python 等其他语言 API 相同。但是，当 SparkR 作业与本机 R 函数或数据类型交互时，会性能显著下降。如果在 Spark 和 R 之间使用 Apache Arrow 来进行数据交换，其性能会有很大的提升。这篇博客文章概述了 SparkR 中 Spark 和 R 的交互，并对比了没有向量化执行和有向量化执行的性能差异。文章目录 1 Spark 和 R 交互 2 原始实现（Native implementation） 3 向量化执行（Vectorized implementation） 4 基准测试结果 Spark 和 R 交互 SparkR 不仅支持丰富的 ML 和类似 SQL 的 API 集合，而且还支持用于直接与 R 代码进行交互的一组 API。例如，Spark DataFrame 和 R

Egg/JAR equivalent for Sparklyr projects

阅读更多关于 Egg/JAR equivalent for Sparklyr projects

问题 We have a SparklyR project which is set up like this # load functions source('./a.R') source('./b.R') source('./c.R') .... # main script computations sc -> spark_connect(...) read_csv(sc, s3://path) .... Running it on EMR spark-submit --deploy-mode client s3://path/to/my/script.R Running this script using spark-submit above fails since it seems to only take a single R script but we are sourcing functions from multiple files. Is there a way we can package this as an egg/jar file with all of

Egg/JAR equivalent for Sparklyr projects

阅读更多关于 Egg/JAR equivalent for Sparklyr projects

Access Azure blob storage from R notebook

阅读更多关于 Access Azure blob storage from R notebook

问题 in python this is how I would access a csv from Azure blobs storage_account_name = "testname" storage_account_access_key = "..." file_location = "wasb://example@testname.blob.core.windows.net/testfile.csv" spark.conf.set( "fs.azure.account.key."+storage_account_name+".blob.core.windows.net", storage_account_access_key) df = spark.read.format('csv').load(file_location, header = True, inferSchema = True) How can I do this in R? I cannot find any documentation... 回答1: The AzureStor package

Access Azure blob storage from R notebook

阅读更多关于 Access Azure blob storage from R notebook

Getting last value of group in Spark

阅读更多关于 Getting last value of group in Spark

问题 I have a SparkR DataFrame as shown below: #Create R data.frame custId <- c(rep(1001, 5), rep(1002, 3), 1003) date <- c('2013-08-01','2014-01-01','2014-02-01','2014-03-01','2014-04-01','2014-02-01','2014-03-01','2014-04-01','2014-04-01') desc <- c('New','New','Good','New', 'Bad','New','Good','Good','New') newcust <- c(1,1,0,1,0,1,0,0,1) df <- data.frame(custId, date, desc, newcust) #Create SparkR DataFrame df <- createDataFrame(df) display(df) custId| date | desc | newcust --------------------

Spark大数据分析框架的核心部件

阅读更多关于 Spark大数据分析框架的核心部件

3 月，跳不动了？>>> Spark大数据分析框架的核心部件 Spark大数据分析框架的核心部件包含RDD内存数据结构、Streaming流计算框架、GraphX图计算与网状数据挖掘、MLlib机器学习支持框架、Spark SQL数据检索语言、Tachyon文件系统、SparkR计算引擎等主要部件。这里做一个简单的介绍。一、RDD内存数据结构大数据分析系统一般包括数据获取、数据清洗、数据处理、数据分析、报表输出等子系统。Spark为了方便数据处理、提升性能，专门引入了RDD数据内存结构，这一点与R的机制非常类似。用户程序只需要访问RDD的结构，与存储系统的数据调度、交换都由提供者驱动去实现。RDD可以与Haoop的HBase、HDFS等交互，用作数据存储系统，当然也可以通过扩展支持很多其它的数据存储系统。因为有了RDD，应用模型就与物理存储分离开来，而且能够更容易地处理大量数据记录遍历搜索的情况，这一点非常重要。因为Hadoop的结构主要适用于顺序处理，要翻回去反复检索数据的话效率就非常低下，而且缺乏一个统一的实现框架，由算法开发者自己去想办法实现。毫无疑问，这具有相当大的难度。RDD的出现，使这一问题得到了一定程度的解决。但正因为RDD是核心部件、实现难度大，这一块的性能、容量、稳定性直接决定着其它算法的实现程度。从目前看，还是经常会出现RDD占用的内存过载出问题的情况。

Summing multiple columns in Spark

阅读更多关于 Summing multiple columns in Spark

问题 How can I sum multiple columns in Spark? For example, in SparkR the following code works to get the sum of one column, but if I try to get the sum of both columns in df , I get an error. # Create SparkDataFrame df <- createDataFrame(faithful) # Use agg to sum total waiting times head(agg(df, totalWaiting = sum(df$waiting))) ##This works # Use agg to sum total of waiting and eruptions head(agg(df, total = sum(df$waiting, df$eruptions))) ##This doesn't work Either SparkR or PySpark code will

Starting SparkR session using external config file

阅读更多关于 Starting SparkR session using external config file

问题 I have an RStudio driver instance which is connected to a Spark Cluster. I wanted to know if there is any way to actually connect to Spark cluster from RStudio using an external configuration file which can specify the number of executors, memory and other spark parameters. I know we can do it using the below command sparkR.session(sparkConfig = list(spark.cores.max='2',spark.executor.memory = '8g')) I am specifically looking for a method which takes spark parameters from an external file to

Starting SparkR session using external config file

阅读更多关于 Starting SparkR session using external config file