rdd | 易学教程

Convert JSON objects to RDD

阅读更多关于 Convert JSON objects to RDD

I don't know if this question is a repetition but somehow all the answers I came across don't seem to work for me (maybe I'm doing something wrong). I have a class defined thus: case class myRec( time: String, client_title: String, made_on_behalf: Double, country: String, email_address: String, phone: String) and a sample Json file that contains records or objects in the form [{...}{...}{...}...] i.e [{"time": "2015-05-01 02:25:47", "client_title": "Mr.", "made_on_behalf": 0, "country": "Brussel", "email_address": "15e29034@gmail.com"}, {"time": "2015-05-01 04:15:03", "client_title": "Mr.",

How to convert RDD to DataFrame in Spark Streaming, not just Spark

阅读更多关于 How to convert RDD to DataFrame in Spark Streaming, not just Spark

How can I convert RDD to DataFrame in Spark Streaming , not just Spark ? I saw this example, but it requires SparkContext . val sqlContext = new SQLContext(sc) import sqlContext.implicits._ rdd.toDF() In my case I have StreamingContext . Should I then create SparkContext inside foreach ? It looks too crazy... So, how to deal with this issue? My final goal (if it might be useful) is to save the DataFrame in Amazon S3 using rdd.toDF.write.format("json").saveAsTextFile("s3://iiiii/ttttt.json"); , which is not possible for RDD without converting it to DataFrame (as I know). myDstream.foreachRDD {

spark-on-yarn 学习

阅读更多关于 spark-on-yarn 学习

1. hdfs存文件的时候会把文件切割成block，block分布在不同节点上，目前设置replicate=3，每个block会出现在3个节点上。 2. Spark以RDD概念为中心运行，RDD代表抽象数据集。以代码为例： sc.textFile(“abc.log”) textFile()函数会创建一个RDD对象，可以认为这个RDD对象代表”abc.log”文件数据，通过操作RDD对象完成对文件数据的操作。 3. RDD包含1个或多个partition分区，每个分区对应文件数据的一部分。在spark读取hdfs的场景下，spark把hdfs的block读到内存就会抽象为spark的partition。所以，RDD对应文件，而partition对应文件的block，partition的个数等于block的个数，这么做的目的是为了并行操作文件数据。由于block是分布在不同节点上的，所以对partition的操作也是分散在不同节点。 4. RDD是只读的，不可变数据集，所以每次对RDD操作都会产生一个新的RDD对象。同样，partition也是只读的。 sc.textFile("abc.log").map() 代码中textFile()会构建出一个NewHadoopRDD，map()函数运行后会构建出一个MapPartitionsRDD。这里的map()函数已经是一个分布式操作

Spark理论总结

阅读更多关于 Spark理论总结

一，Spark专业术语 1，Application 指的是用户编写的Spark应用程序、代码，包含了Driver功能代码和分布在集群中多个节点运行的Executor代码。 Spark应用程序，由一个或者多个job组成（因为代码中可能会调用多次Action）每个job就是一个RDD执行一个Action. 2，Driver Program Spark中的Driver即运行在Application的main函数并创建的SparkContext，其中创建SparkContext的目的是为了准备Spark应用的运行环境。在Spark中由SparkContext负责和ClusterManager通信，进行资源的申请，任务的分配和监控等。 SparkContext向RM或者Master申请资源，运行Executor进程（线程池），当Executor部分运行完成后，Driver负责向SparkContext关闭。 3，Cluster Manager 指的是在集群上获取资源的外部服务，常用的有 Standalone，Spark原生的资源管理器，由Master负责资源的分配。 Hadoop Yarn，由yarn中的ResourceManager负责资源的分配。 4，Worker计算节点集群中的节点，可以分配资源并运行Executor进程。

Spark select top values in RDD

阅读更多关于 Spark select top values in RDD

The original dataset is: # (numbersofrating,title,avg_rating) newRDD =[(3,'monster',4),(4,'minions 3D',5),....] I want to select top N avg_ratings in newRDD.I use the following code,it has an error. selectnewRDD = (newRDD.map(x, key =lambda x: x[2]).sortBy(......)) TypeError: map() takes no keyword arguments The expected data should be: # (numbersofrating,title,avg_rating) selectnewRDD =[(4,'minions 3D',5),(3,'monster',4)....] You can use either top or takeOrdered with key argument: newRDD.top(2, key=lambda x: x[2]) or newRDD.takeOrdered(2, key=lambda x: -x[2]) Note that top is taking elements

Spark小总结

阅读更多关于 Spark小总结

Spark编程模型 RDD RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是Spark中最基本的数据抽象，它代表一个不可变、可分区、里面的元素可并行计算的集合。RDD具有数据流模型的特点：自动容错、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中，后续的查询能够重用工作集，这极大地提升了查询速度。 RDD的特征（1）分区（Partition）：一个数据分片列表。能够将数据切分，切分好的数据能够进行并行计算，是数据集的原子组成部分。用户可以在创建RDD时指定RDD的分片个数，如果没有指定，那么就会采用默认值。默认值就是程序所分配到的CPU Core的数目。（2）函数（Compute）：一个计算RDD每个分片的函数。RDD的计算是以分片为单位的，每个RDD都会实现compute函数以达到这个目的。compute函数会对迭代器进行复合，不需要保存每次计算的结果（3）依赖（Dependency）：RDD的每次转换都会生成一个新的RDD，所以RDD之间就会形成类似于流水线一样的前后依赖关系。在部分分区数据丢失时，Spark可以通过这个依赖关系重新计算丢失的分区数据，而不是对RDD的所有分区进行重新计算。（4）优先位置（可选）：一个列表，存储存取每个Partition的优先位置（preferred

How to partition a RDD

阅读更多关于 How to partition a RDD

问题 I have a text file consisting of a large number of random floating values separated by spaces. I am loading this file into a RDD in scala. How does this RDD get partitioned? Also, is there any method to generate custom partitions such that all partitions have equal number of elements along with an index for each partition? val dRDD = sc.textFile("hdfs://master:54310/Data/input*") keyval=dRDD.map(x =>process(x.trim().split(' ').map(_.toDouble),query_norm,m,r)) Here I am loading multiple text

Spark: Not enough space to cache red in container while still a lot of total storage memory

阅读更多关于 Spark: Not enough space to cache red in container while still a lot of total storage memory

问题 I have a 30 node cluster, each node has 32 core, 240 G memory (AWS cr1.8xlarge instance). I have the following configurations: --driver-memory 200g --driver-cores 30 --executor-memory 70g --executor-cores 8 --num-executors 90 I can see from the job tracker that I still have a lot of total storage memory left, but in one of the containers, I got the following message saying Storage limit = 28.3 GB. I am wondering where does this 28.3 GB came from? My memoryFraction for storage is 0.45 And how

Compare data in two RDD in spark

阅读更多关于 Compare data in two RDD in spark

问题 I am able to print data in two RDD with the below code. usersRDD.foreach(println) empRDD.foreach(println) I need to compare data in two RDDs. How can I iterate and compare field data in one RDD with field data in another RDD. Eg: iterate the records and check if name and age in userRDD has a matching record in empRDD , if no put in separate RDD. I tried with userRDD.substract(empRDD) but it was comparing all the fields. 回答1: You need to key the data in each RDD so that there is something to

How to convert a case-class-based RDD into a DataFrame?

阅读更多关于 How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass) , but my DataFrame ends up empty. Here's my Scala code: // sc is the SparkContext, while sqlContext is the SQLContext. // Define the case class and raw data case class Dog(name: String) val data = Array( Dog("Rex"), Dog("Fido") ) // Create an RDD from the raw data val dogRDD = sc.parallelize(data) // Print the RDD for debugging (this works, shows 2 dogs) dogRDD.collect().foreach(println) // Create

订阅 rdd