rdd

Spark sql 简单使用

对着背影说爱祢 提交于 2019-12-24 15:32:07
一、认识Spark sql 1、什么是Sparksql? spark sql是spark的一个模块,主要用于进行结构化数据的处理,它提供的最核心抽象就是DataFrame。 2、SparkSQL的作用? 提供一个编程抽象(DataFrame),并且作为分布式SQL查询引擎 DataFrame:它可以根据很多源进行构建,包括:结构化的数据文件、hive中的表,外部的关系型数据库、以及RDD 3、运行原理 将SparkSQL转化为RDD,然后提交到集群执行 4、特点 容易整合、统一的数据访问方式、兼容Hive、标准的数据连接 5、SparkSession SparkSession是Spark 2.0引如的新概念。SparkSession为用户提供了统一的切入点,来让用户学习spark的各项功能。   在spark的早期版本中,SparkContext是spark的主要切入点,由于RDD是主要的API,我们通过sparkcontext来创建和操作RDD。对于每个其他的API,我们需要使用不同的context。例如,对于Streming,我们需要使用StreamingContext;对于sql,使用sqlContext;对于Hive,使用hiveContext。但是随着DataSet和DataFrame的API逐渐成为标准的API,就需要为他们建立接入点。所以在spark2.0中

Spark(十二)SparkSQL简单使用

谁说胖子不能爱 提交于 2019-12-24 15:31:49
一、SparkSQL的进化之路 1.0以前: Shark 1.1.x开始:SparkSQL(只是测试性的) SQL 1.3.x: SparkSQL(正式版本)+Dataframe 1.5.x: SparkSQL 钨丝计划 1.6.x: SparkSQL+DataFrame+DataSet(测试版本) 2.x: SparkSQL+DataFrame+DataSet(正式版本) SparkSQL:还有其他的优化 StructuredStreaming(DataSet) Spark on Hive和Hive on Spark Spark on Hive: Hive只作为储存角色, Spark负责sql解析优化,执行。 Hive on Spark: Hive 即作为存储又 负责sql的解析优化,Spark负责执行。 二、认识SparkSQL 2.1 什么是SparkSQL? spark SQL是spark的一个模块,主要用于进行结构化数据的处理。它提供的最核心的编程抽象就是DataFrame。 2.2 SparkSQL的作用 提供一个编程抽象(DataFrame) 并且作为分布式 SQL 查询引擎 DataFrame:它可以根据很多源进行构建,包括: 结构化的数据文件,hive中的表,外部的关系型数据库,以及RDD 2.3 运行原理 将 Spark SQL 转化为 RDD,

Spark SQL基本概念与基本用法

非 Y 不嫁゛ 提交于 2019-12-24 15:31:17
1. Spark SQL概述 1.1 什么是Spark SQL Spark SQL是Spark用来处理结构化数据的一个模块,它提供了两个编程抽象分别叫做DataFrame和DataSet,它们用于作为分布式SQL查询引擎。从下图可以查看RDD、DataFrames与DataSet的关系。 1.2 为什么要学习Spark SQL Hive,它是将Hive SQL转换成MapReduce,然后提交到集群上执行的,大大简化了编写MapReduce程序的复杂性,而且MapReduce这种计算模型执行效率比较慢。类比Hive,Spark SQL,它时将Spark SQL转换成RDD,然后提交到集群上执行,执行效率非常快! 2. DataFrames 2.1 什么是DataFrames 与RDD类似,DataFrame也是一个分布式数据容器。然而DataFrame更像传统数据库的二维表格,除了数据以外,还记录数据的结构信息,即schema。同时,与Hive类似,DataFrame也支持嵌套数据类型(struct、array和map)。从API易用性的角度上看,DataFrame API提供的是一套高层的关系操作,比函数式的RDD API要更加友好,门槛更低。由于与R和Pandas的DataFrame类似,Spark DataFrame很好地继承了传统单机数据分析的开发体验。 2.2

Spark GraphX - How can I read from a JSON file in Spark and create a graph from the data?

家住魔仙堡 提交于 2019-12-24 14:17:16
问题 I'm new to Spark and Scala, and I am trying to read a bunch of tweeter data from a JSON file and turn that into a graph where a vertex represents a tweet and the edge connects to tweets which are a re-tweet of the original posted item. So far I have managed to read from the JSON file and figure out the Schema of my RDD. Now I believe I need to somehow take the data from the SchemaRDD object and create an RDD for the Vertices and an RDD for the edges. Is this the way to approach this or is

ReduceByKey function In Spark

丶灬走出姿态 提交于 2019-12-24 13:59:11
问题 I've read somewhere that for operations that act on a single RDD, such as reduceByKey() , running on a pre-partitioned RDD will cause all the values for each key to be computed locally on a single machine, requiring only the final, locally reduced value to be sent from each worker node back to the master. Which means that I have to declare a partitioner like: val sc = new SparkContext(...) val userData = sc.sequenceFile[UserID, UserInfo]("hdfs://...") .partitionBy(new HashPartitioner(100)) //

Doing reduceByKey on each partition of RDD separately without aggregating results

柔情痞子 提交于 2019-12-24 12:09:59
问题 I have an RDD partitioned on the cluster and I want to do reduceByKey on each partition separately. I don't want result of reduceByKey on partitions to be merged together. I want to prevent Spark to do shuffle intermediate results of reduceByKey in the cluster. The below code does not work but I want sth like this: myPairedRDD.mapPartitions({iter => iter.reduceByKey((x, y) => x + y)}) How can I achieve this? 回答1: You could try something myPairedRDD.mapPartitions(iter => iter.groupBy(_._1)

How to do range lookup and search in PySpark

半城伤御伤魂 提交于 2019-12-24 08:59:58
问题 I try to code in PySpark a function which can do combination search and lookup values within a range. The following is the detailed description. I have two data sets. One data set, say D1 , is basically a lookup table, as in below: MinValue MaxValue Value1 Value2 --------------------------------- 1 1000 0.5 0.6 1001 2000 0.8 0.1 2001 4000 0.2 0.5 4001 9000 0.04 0.06 The other data set, say D2, is a table with millions of records, for example: ID InterestsRate Days ----------------------------

Java.lang.IllegalArgumentException: requirement failed: Columns not found in Double

纵然是瞬间 提交于 2019-12-24 08:46:56
问题 I am working in spark I have many csv files that contain lines, a line looks like that: 2017,16,16,51,1,1,4,-79.6,-101.90,-98.900 It can contain more or less fields, depends on the csv file Each file corresponds to a cassandra table, where I need to insert all the lines the file contains so what I basically do is get the line, split its elements and put them in a List[Double] sc.stop import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org

number of tuples limit in RDD; reading RDD throws arrayIndexOutOfBoundsException

老子叫甜甜 提交于 2019-12-24 07:41:19
问题 I tried a modification of DF to RDD for a table containing 25 columns. Thereafter I came to know that Scala (until 2.11.8) has a limitation of a max of 22 tuples that could be used. val rdd = sc.textFile("/user/hive/warehouse/myDB.db/myTable/") rdd: org.apache.spark.rdd.RDD[String] = /user/hive/warehouse/myDB.db/myTable/ MapPartitionsRDD[3] at textFile at <console>:24 Sample Data: [2017-02-26, 100052-ACC, 100052, 3260, 1005, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0

Converting pipe-delimited file to spark dataframe to CSV file

送分小仙女□ 提交于 2019-12-24 06:44:34
问题 I have a CSV file with one single column and the rows are defined as follows : 123 || food || fruit 123 || food || fruit || orange 123 || food || fruit || apple I want to create a csv file with a single column and distinct row values as : orange apple I tried using the following code : val data = sc.textFile("fruits.csv") val rows = data.map(_.split("||")) val rddnew = rows.flatMap( arr => { val text = arr(0) val words = text.split("||") words.map( word => ( word, text ) ) } ) But this code