rdd | 易学教程

DateFrame&Dataset

阅读更多关于 DateFrame&Dataset

DateFrame产生背景 DateFrame不是Spark SQL提出的，早期是在R、Pandas语言就已经有了。 Spark RDD API 和 MapReduce API 给大数据生态圈提供基于通用语言（Java、Python、Scala等）的，并且简单易用的API。 Spark处理代码量很少 R/Pandas语言局限性非常强只支持单机处理 DateFrame概述 DataSet是一个分部式数据集 DataFrame是一个DataSet，是以列（列名、列的类型、列值）的形式构成的分布式数据集，按照列赋予不同的名称。可以理解为一关系数据库的一张表。为查询、过滤、聚合和其他处理提供了一些抽象在R和Pandas是用作单机处理小数据，它把这些经验作用到处理大数据分布式平台上。 Spark1.3之前交SchemaRDD，1.3之后改名为DataFrame DateFrame和RDD的对比 RDD: java/scala运行在jvm python运行在 Python Runtime DataFrame java/scala/python转换成逻辑计划Logic Plant DataFrame基本API操作这里使用的是本地文件，文件是之前使用过的spark路径下的数据，从服务器路径 /home/hadoop/app/spark-2.2.0-bin-hadoop2.6

Spark Kafka Streaming CommitAsync Error [duplicate]

阅读更多关于 Spark Kafka Streaming CommitAsync Error [duplicate]

问题 This question already has an answer here : Exception while accessing KafkaOffset from RDD (1 answer) Closed last year . I am new to Scala and RDD concept. Reading message from kafka using Kafka stream api in Spark and trying to commit after business work. but I am getting error. Note: Using repartition for Parallel work How to read offset from stream APi and commit it to Kafka ? scalaVersion := "2.11.8" val sparkVersion = "2.2.0" val connectorVersion = "2.0.7" val kafka_stream_version = "1.6

How to access lookup(broadcast) RDD(or dataset) into other RDD map function

阅读更多关于 How to access lookup(broadcast) RDD(or dataset) into other RDD map function

问题 I am new to spark and scala and just started learning ... I am using spark 1.0.0 on CDH 5.1.3 I got a broadcasted rdd named dbTableKeyValueMap: RDD[(String, String)], I want to use dbTableKeyValueMap to deal with my fileRDD( each row has 300+ columns). This is the code: val get = fileRDD.map({x => val tmp = dbTableKeyValueMap.lookup(x) tmp }) Running this locally hangs and/or after sometime gives error : scala.MatchError: null at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions

Get Top 3 values for every key in a RDD in Spark

阅读更多关于 Get Top 3 values for every key in a RDD in Spark

I'm a beginner with Spark and I am trying to create an RDD that contains the top 3 values for every key, (Not just the top 3 values). My current RDD contains thousands of entries in the following format: (key, String, value) So imagine I had an RDD with content like this: [("K1", "aaa", 6), ("K1", "bbb", 3), ("K1", "ccc", 2), ("K1", "ddd", 9), ("B1", "qwe", 4), ("B1", "rty", 7), ("B1", "iop", 8), ("B1", "zxc", 1)] I can currently display the top 3 values in the RDD like so: ("K1", "ddd", 9) ("B1", "iop", 8) ("B1", "rty", 7) Using: top3RDD = rdd.takeOrdered(3, key = lambda x: x[2]) Instead what

Apache Spark: Join two RDDs with different partitioners

阅读更多关于 Apache Spark: Join two RDDs with different partitioners

问题 I have 2 rdds with different set of partitioners. case class Person(name: String, age: Int, school: String) case class School(name: String, address: String) rdd1 is the RDD of Person , which I have partitioned based on age of the person, and then converted the key to school . val rdd1: RDD[Person] = rdd1.keyBy(person => (person.age, person)) .partitionBy(new HashPartitioner(10)) .mapPartitions(persons => persons.map{case(age,person) => (person.school, person) }) rdd2 is the RDD of School

都9102年了，还在谈论Hadoop？

阅读更多关于都9102年了，还在谈论Hadoop？

Hadoop当初作为一种大数据技术横空出世，经过多年的发展，Hadoop已经不单单指某一个技术，而是一个完整的大数据生态。 Hadoop的本质是分布式系统，因为单台机器无法完成大数据的存储、处理，所以需要将数据分别存放在不同的机器，并且能够让用户像访问单台机器的数据一样去访问、操作这些数据。为了实现这个任务，Hadoop当年提出了两个概念：HDFS与MapReduce。 HDFS 即分布式的数据存储方案，它的作用是将大量数据存放在一个由多台机器组成的集群中，每个机器存放一部分数据。假设左边是我们要存储的数据集，HDFS集群包含存储的节点，即右边的Data Node1、2、3，以及一个 Name Node ，用于存放各个数据块所在的位置。比如我们现在需要访问蓝色数据块以及绿色数据块，分为以下几个步骤：客户端向Name Node发出请求，获取蓝色数据块与绿色数据块的位置 Name Node返回Data Node1与Data Node2的地址客户端访问Data Node1与Data Node2 如果我们要在集群中增加一个数据，步骤如下：客户端向Name Node发出写入请求 Name Node确认请求，并返回Data Node地址开始向目的地址写入数据，相应的机器在写入成功后返回写入成功的确认信息客户端向Name Node发送确认信息可以看出，整个集群最关键的节点是Name

Spark Accumulator value not read by task

阅读更多关于 Spark Accumulator value not read by task

问题 I am initializing an accumulator final Accumulator<Integer> accum = sc.accumulator(0); And then while in map function , I'm trying to increment the accumulator , then using the accumulator value in setting a variable. JavaRDD<UserSetGet> UserProfileRDD1 = temp.map(new Function<String, UserSetGet>() { @Override public UserSetGet call(String arg0) throws Exception { UserSetGet usg = new UserSetGet(); accum.add(1); usg.setPid(accum.value().toString(); } }); But Im getting the following error. 16

Convert RDD of Array(Row) to RDD of Row?

阅读更多关于 Convert RDD of Array(Row) to RDD of Row?

I have such data in a file and I'd like to do some statistics using Spark. File content: aaa|bbb|ccc ddd|eee|fff|ggg I need to assign each line an id. I read them as rdd and use zipWithIndex() . Then they should be like: (0, aaa|bbb|ccc) (1, ddd|eee|fff|ggg) I need to make each string associated with the id. I can get the RDD of Array(Row), but can't jump out of the array. How should I modify my code? import org.apache.spark.sql.{Row, SparkSession} val fileRDD = spark.sparkContext.textFile(filePath) val fileWithIdRDD = fileRDD.zipWithIndex() // make the line like this: (0, aaa), (0, bbb), (0,

Find average by department in spark groupBy in Java 1.8

阅读更多关于 Find average by department in spark groupBy in Java 1.8

I have a below data set where first column is department and second is for salary. I want to calculate the avg of salary by department. IT 2000000 HR 2000000 IT 1950000 HR 2200000 Admin 1900000 IT 1900000 IT 2200000 I performed below operation JavaPairRDD<String, Iterable<Long>> rddY = employees.groupByKey(); System.out.println("<=========================RDDY collect==================>" + rddY.collect()); and got below output: <=========================RDDY collect==================>[(IT,[2000000, 1950000, 1900000, 2200000]), (HR,[2000000, 2200000]), (Admin,[1900000])] What I need is I want to

Parsing multiline records in Scala

阅读更多关于 Parsing multiline records in Scala

Here is my RDD[String] M1 module1 PIP a Z A PIP b Z B PIP c Y n4 M2 module2 PIP a I n4 PIP b O D PIP c O n5 and so on. Basically, I need a RDD of key (containing the second word on line1) and values of the subsequent PIP lines that can be iterated upon. I've tried the following val usgPairRDD = usgRDD.map(x => (x.split("\\n")(0), x)) but this gives me the following output (,) (M1 module1,M1 module1) (PIP a Z A,PIP a Z A) (PIP b Z B,PIP b Z B) (PIP c Y n4,PIP c Y n4) (,) (M2 module2,M2 module2) (PIP a I n4,PIP a I n4) (PIP b O D,PIP b O D) (PIP c O n5,PIP c O n5) Instead, I'd like the output to