rdd

DateFrame&Dataset

ぃ、小莉子 提交于 2019-12-06 16:20:59
DateFrame产生背景 DateFrame不是Spark SQL提出的,早期是在R、Pandas语言就已经有了。 Spark RDD API 和 MapReduce API 给大数据生态圈提供基于通用语言(Java、Python、Scala等)的,并且简单易用的API。 Spark处理代码量很少 R/Pandas语言 局限性非常强 只支持单机处理 DateFrame概述 DataSet是一个分部式数据集 DataFrame是一个DataSet,是以列(列名、列的类型、列值)的形式构成的分布式数据集,按照列赋予不同的名称。 可以理解为一关系数据库的一张表。 为查询、过滤、聚合和其他处理提供了一些抽象 在R和Pandas是用作单机处理小数据,它把这些经验作用到处理大数据分布式平台上。 Spark1.3之前交SchemaRDD,1.3之后改名为DataFrame DateFrame和RDD的对比 RDD: java/scala运行在jvm python运行在 Python Runtime DataFrame java/scala/python转换成逻辑计划Logic Plant DataFrame基本API操作 这里使用的是本地文件,文件是之前使用过的spark路径下的数据,从服务器路径 /home/hadoop/app/spark-2.2.0-bin-hadoop2.6

Spark Kafka Streaming CommitAsync Error [duplicate]

无人久伴 提交于 2019-12-06 16:12:55
问题 This question already has an answer here : Exception while accessing KafkaOffset from RDD (1 answer) Closed last year . I am new to Scala and RDD concept. Reading message from kafka using Kafka stream api in Spark and trying to commit after business work. but I am getting error. Note: Using repartition for Parallel work How to read offset from stream APi and commit it to Kafka ? scalaVersion := "2.11.8" val sparkVersion = "2.2.0" val connectorVersion = "2.0.7" val kafka_stream_version = "1.6

How to access lookup(broadcast) RDD(or dataset) into other RDD map function

▼魔方 西西 提交于 2019-12-06 15:04:14
问题 I am new to spark and scala and just started learning ... I am using spark 1.0.0 on CDH 5.1.3 I got a broadcasted rdd named dbTableKeyValueMap: RDD[(String, String)], I want to use dbTableKeyValueMap to deal with my fileRDD( each row has 300+ columns). This is the code: val get = fileRDD.map({x => val tmp = dbTableKeyValueMap.lookup(x) tmp }) Running this locally hangs and/or after sometime gives error : scala.MatchError: null at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions

Get Top 3 values for every key in a RDD in Spark

夙愿已清 提交于 2019-12-06 12:56:01
I'm a beginner with Spark and I am trying to create an RDD that contains the top 3 values for every key, (Not just the top 3 values). My current RDD contains thousands of entries in the following format: (key, String, value) So imagine I had an RDD with content like this: [("K1", "aaa", 6), ("K1", "bbb", 3), ("K1", "ccc", 2), ("K1", "ddd", 9), ("B1", "qwe", 4), ("B1", "rty", 7), ("B1", "iop", 8), ("B1", "zxc", 1)] I can currently display the top 3 values in the RDD like so: ("K1", "ddd", 9) ("B1", "iop", 8) ("B1", "rty", 7) Using: top3RDD = rdd.takeOrdered(3, key = lambda x: x[2]) Instead what

Apache Spark: Join two RDDs with different partitioners

一曲冷凌霜 提交于 2019-12-06 12:48:01
问题 I have 2 rdds with different set of partitioners. case class Person(name: String, age: Int, school: String) case class School(name: String, address: String) rdd1 is the RDD of Person , which I have partitioned based on age of the person, and then converted the key to school . val rdd1: RDD[Person] = rdd1.keyBy(person => (person.age, person)) .partitionBy(new HashPartitioner(10)) .mapPartitions(persons => persons.map{case(age,person) => (person.school, person) }) rdd2 is the RDD of School

都9102年了,还在谈论Hadoop?

别来无恙 提交于 2019-12-06 12:19:02
Hadoop当初作为一种大数据技术横空出世,经过多年的发展,Hadoop已经不单单指某一个技术,而是一个完整的大数据生态。 Hadoop的本质是分布式系统,因为单台机器无法完成大数据的存储、处理,所以需要将数据分别存放在不同的机器,并且能够让用户像访问单台机器的数据一样去访问、操作这些数据。为了实现这个任务,Hadoop当年提出了两个概念:HDFS与MapReduce。 HDFS 即分布式的数据存储方案,它的作用是将大量数据存放在一个由多台机器组成的集群中,每个机器存放一部分数据。 假设左边是我们要存储的数据集,HDFS集群包含存储的节点,即右边的Data Node1、2、3,以及一个 Name Node ,用于存放各个数据块所在的位置。比如我们现在需要访问蓝色数据块以及绿色数据块,分为以下几个步骤: 客户端向Name Node发出请求,获取蓝色数据块与绿色数据块的位置 Name Node返回Data Node1与Data Node2的地址 客户端访问Data Node1与Data Node2 如果我们要在集群中增加一个数据,步骤如下: 客户端向Name Node发出写入请求 Name Node确认请求,并返回Data Node地址 开始向目的地址写入数据,相应的机器在写入成功后返回写入成功的确认信息 客户端向Name Node发送确认信息 可以看出,整个集群最关键的节点是Name

Spark Accumulator value not read by task

三世轮回 提交于 2019-12-06 12:00:19
问题 I am initializing an accumulator final Accumulator<Integer> accum = sc.accumulator(0); And then while in map function , I'm trying to increment the accumulator , then using the accumulator value in setting a variable. JavaRDD<UserSetGet> UserProfileRDD1 = temp.map(new Function<String, UserSetGet>() { @Override public UserSetGet call(String arg0) throws Exception { UserSetGet usg = new UserSetGet(); accum.add(1); usg.setPid(accum.value().toString(); } }); But Im getting the following error. 16

Convert RDD of Array(Row) to RDD of Row?

我只是一个虾纸丫 提交于 2019-12-06 11:44:40
I have such data in a file and I'd like to do some statistics using Spark. File content: aaa|bbb|ccc ddd|eee|fff|ggg I need to assign each line an id. I read them as rdd and use zipWithIndex() . Then they should be like: (0, aaa|bbb|ccc) (1, ddd|eee|fff|ggg) I need to make each string associated with the id. I can get the RDD of Array(Row), but can't jump out of the array. How should I modify my code? import org.apache.spark.sql.{Row, SparkSession} val fileRDD = spark.sparkContext.textFile(filePath) val fileWithIdRDD = fileRDD.zipWithIndex() // make the line like this: (0, aaa), (0, bbb), (0,

Find average by department in spark groupBy in Java 1.8

流过昼夜 提交于 2019-12-06 10:57:15
I have a below data set where first column is department and second is for salary. I want to calculate the avg of salary by department. IT 2000000 HR 2000000 IT 1950000 HR 2200000 Admin 1900000 IT 1900000 IT 2200000 I performed below operation JavaPairRDD<String, Iterable<Long>> rddY = employees.groupByKey(); System.out.println("<=========================RDDY collect==================>" + rddY.collect()); and got below output: <=========================RDDY collect==================>[(IT,[2000000, 1950000, 1900000, 2200000]), (HR,[2000000, 2200000]), (Admin,[1900000])] What I need is I want to

Parsing multiline records in Scala

南楼画角 提交于 2019-12-06 09:47:05
Here is my RDD[String] M1 module1 PIP a Z A PIP b Z B PIP c Y n4 M2 module2 PIP a I n4 PIP b O D PIP c O n5 and so on. Basically, I need a RDD of key (containing the second word on line1) and values of the subsequent PIP lines that can be iterated upon. I've tried the following val usgPairRDD = usgRDD.map(x => (x.split("\\n")(0), x)) but this gives me the following output (,) (M1 module1,M1 module1) (PIP a Z A,PIP a Z A) (PIP b Z B,PIP b Z B) (PIP c Y n4,PIP c Y n4) (,) (M2 module2,M2 module2) (PIP a I n4,PIP a I n4) (PIP b O D,PIP b O D) (PIP c O n5,PIP c O n5) Instead, I'd like the output to