spark-Json、jdbc操作、hive本地版

雨燕双飞 提交于 2020-01-15 03:46:17

Spark-json files

前要

SparkSQL可以自动推断JSON数据集的架构,并将其作为Dataset[Row]。此转换可以使用SparkSession.read.json()无论是在Dataset[String]或JSON文件。

必要设置

//初始化sparkSessionBuilder 
var sparkSessionBuilder = SparkSession.builder()
   //给这个应用起个名字
    sparkSessionBuilder.appName("test01")
    //new一个SparkConf,便于设置一些参数
    var conf = new SparkConf()
   //运行的方式是本地,用上所有线程
    conf.setMaster("local[*]")
    sparkSessionBuilder.config(conf)
  //生成sparkSession
    var sparkSession = sparkSessionBuilder.getOrCreate()
    //生成sparkContext 
    var sparkContext = sparkSession.sparkContext

正题

导包(在方法内导包)
spark是方法参数sparkSession的变量名,其他一致

import spark.implicits._

创建一个Rdd

//有sparkContext创建一个Rdd
    var makeRdd = sc.makeRDD(1 to 5)

Rdd转化为dataFrame

 var mapRdd = makeRdd.map(t => t * t)
 /* 要把rdd变成dataFrame 
     * dataFrame应该和表结构是一样的
     * 得有一个表头;
     * 
     * 必须导入import spark.implicits._ ;
     * 参数就是列的别名
     * */
    var dF = mapRdd.toDF("squre")
    println("  dF.printSchema()")
    dF.printSchema()
    println("  dF.show()")
    dF.show()

将Json数据转化为Dataset

 var jsonStr = """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" ;
    /* dataSet在创建的时候,得需要一个编码器(就是序列化用的)(Encoders) */
    var dataSet = spark.createDataset(jsonStr :: Nil);
    /* 把dataSet当成dataFrame使用 */
    println("==df===printSchema====")
    dataSet.printSchema() ; 
    //打印结果如下
    /*root
 |-- value: string (nullable = true)*/
    println("==df===show====")
    dataSet.show() ; 
    //打印结果如下
    /*+--------------------+
|               value|
+--------------------+
|{"name":"Yin","ad...|
+--------------------+*/

读取Dataset的数据

/* read.json不光可以传入一个路径,也可以传入一个dataSet
     * 可以自动的把json对象给解析出来;
     *  */
    var jsonDataFrame = spark.read.json(dataSet);
    /* 把dataSet当成dataFrame使用 */
    println("==df===printSchema====")
    jsonDataFrame.printSchema() ; 
    //打印结果如下
    /*root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- name: string (nullable = true)*/
    println("==df===show====")
    jsonDataFrame.show() ; 
    //打印结果如下
    /*+----------------+----+
|         address|name|
+----------------+----+
|[Columbus, Ohio]| Yin|
+----------------+----+*/

JDBC To Other Databases

介绍

SmarkSQL还包括一个数据源,它可以使用JDBC从其他数据库读取数据。这个功能应该比使用JdbcRDD更为便利。这是因为结果作为DataFrame返回,可以轻松地在SparkSQL中处理,也可以与其他数据源连接。JDBC数据源也更容易从Java或Python中使用,因为它不需要用户提供ClassTag。(请注意,这与SparkSQLJDBC服务器不同,后者允许其他应用程序使用SparkSQL运行查询)。

Property Name Meaning
url The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. e.g., jdbc:postgresql://localhost/test?user=fred&password=secret
dbtable The JDBC table that should be read from or written into. Note that when using it in the read path anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses. It is not allowed to specify dbtable and query options at the same time.
query A query that will be used to read data into Spark. The specified query will be parenthesized and used as a subquery in the FROM clause. Spark will also assign an alias to the subquery clause. As an example, spark will issue a query of the following form to the JDBC Source.
driver The class name of the JDBC driver to use to connect to this URL.
partitionColumn, lowerBound, upperBound These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
numPartitions The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.
queryTimeout The number of seconds the driver will wait for a Statement object to execute to the given number of seconds. Zero means there is no limit. In the write path, this option depends on how JDBC drivers implement the API setQueryTimeout, e.g., the h2 JDBC driver checks the timeout of each query instead of an entire JDBC batch. It defaults to 0.
fetchsize The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). This option applies only to reading.
batchsize The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to 1000.
isolationLevel The transaction isolation level, which applies to current connection. It can be one of NONE, READ_COMMITTED, READ_UNCOMMITTED, REPEATABLE_READ, or SERIALIZABLE, corresponding to standard transaction isolation levels defined by JDBC’s Connection object, with default of READ_UNCOMMITTED. This option applies only to writing. Please refer the documentation in java.sql.Connection.
sessionInitStatement After each database session is opened to the remote DB and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block). Use this to implement session initialization code. Example: option(“sessionInitStatement”, “”“BEGIN execute immediate ‘alter session set “_serial_direct_read”=true’; END;”"")
truncate This is a JDBC writer related option. If enabled and supported by the JDBC database (PostgreSQL and Oracle at the moment), this options allows execution of a TRUNCATE TABLE t CASCADE (in the case of PostgreSQL a TRUNCATE TABLE ONLY t CASCADE is executed to prevent inadvertently truncating descendant tables). This will affect other tables, and thus should be used with care. This option applies only to writing. It defaults to the default cascading truncate behaviour of the JDBC database in question, specified in the isCascadeTruncate in each JDBCDialect.
createTableOptions This is a JDBC writer related option. If specified, this option allows setting of database-specific table and partition options when creating a table (e.g., CREATE TABLE t (name string) ENGINE=InnoDB.). This option applies only to writing.
createTableColumnTypes The database column data types to use instead of the defaults, when creating the table. Data type information should be specified in the same format as CREATE TABLE columns syntax (e.g: “name CHAR(64), comments VARCHAR(1024)”). The specified types should be valid spark sql data types. This option applies only to writing.
customSchema The custom schema to use for reading data from JDBC connectors. For example, “id DECIMAL(38, 0), name STRING”. You can also specify partial fields, and the others use the default type mapping. For example, “id DECIMAL(38, 0)”. The column names should be identical to the corresponding column names of JDBC table. Users can specify the corresponding data types of Spark SQL instead of using the defaults. This option applies only to reading.
pushDownPredicate The option to enable or disable predicate push-down into the JDBC data source. The default value is true, in which case Spark will push down filters to the JDBC data source as much as possible. Otherwise, if set to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark. Predicate push-down is usually turned off when the predicate filtering is performed faster by Spark than by the JDBC data source.

示例:

var dataFrameReader =  spark.read.format("jdbc")
  //url,就像学jdbc的连接一样,采用mysql8.0.13
  dataFrameReader .option("url", "jdbc:mysql://localhost:3306/myfirst?serverTimezone=UTC")
//数据库名
 dataFrameReader.option("dbtable", "first")
 //用户名
  dataFrameReader.option("user", "root")
  //密码
  dataFrameReader.option("password", "123456")
   /* 
     * dbtable:要查询的表名
     * query:指定sql语句要查询的结果,
     * 两都只能写一个,不能都写
     *  */
    var sql = """select name,empire from first where empire = '秦'"""
  dataFrameReader.option("query", sql)
  //在进行所有之后的配置,再进行加载
 var dataFrame =  dataFrameReader.load()
  dataFrame.printSchema()
  dataFrame.show()
  
   /* 这些数据可以存储到数据库里面 */
 var writer =  dataFrame.write.format("jdbc")
  //有些必要的配置
  //url
   writer .option("url", "jdbc:mysql://localhost:3306/myfirst?serverTimezone=UTC")
//表名,必须是未进行创建过的
  writer.option("dbtable", "test01")
//用户名
  writer.option("user", "root")
 //密码
  writer.option("password", "123456")
  //保存
  writer.save()

Hive Tables

SPARK SQL还支持读取和写入存储在Hive。但是,由于Hive有大量的依赖项,这些依赖项不包含在默认的SPark发行版中。如果在类路径上可以找到Hive依赖项,SPark将自动加载它们。请注意,这些Hive依赖项也必须存在于所有工作节点上,因为它们需要访问Hive序列化和反序列化库(SerDes),以便访问存储在Hive中的数据。

Hive的配置是通过将hive-site.xml, core-site.xml(用于安全配置)hdfs-site.xml(用于HDFS配置)conf/.

使用Hive时,必须实例化SparkSession有了Hive支持,包括与持久Hive转移的连接,对Hive SERDES的支持,以及Hive用户定义的函数。没有现有Hive部署的用户仍然可以启用Hive支持。未由hive-site.xml,上下文将自动创建metastore_db在当前目录中,并创建一个由spark.sql.warehouse.dir,它默认为目录。spark-warehouse在启动SPark应用程序的当前目录中。注意,hive.metastore.warehouse.dir财产hive-site.xml被取消,因为spark2.0.0相反,使用spark.sql.warehouse.dir若要指定仓库中数据库的默认位置.

必要设置

 /* 
     * 返回值是Builder
     *  */
    var builder = SparkSession.builder();
    /* 设置一些参数 */
    builder.appName("hiveSql");
    
    var conf = new SparkConf();
    /* 在本地跑 */
    conf.setMaster("local[*]");
    /* 传入一个config对象 */
    builder.config(conf);
    /* 单个单个的传 */
    //builder.config(key, value)
    /* 设置Spark自带的默认的仓库路径 */
    //builder.config("hive.metastore.warehouse.dir", "e:/test/warehouse");
    /* 启用hive(超级关键) */
    builder.enableHiveSupport()
    /* 初始化SparkSession */
    var spark = builder.getOrCreate() ; 
    /* 获取sparkContext
     * 千万不要new一个sparkContext和SparkSession
     *  */
    var sc = spark.sparkContext ;

示例:

println("==da_hive==");
    var dbName = "mySpark" ; 
    /* 创建数据库 */
    //注意是``,而非''
    var sql = """
      create database if not exists """ + 
      "`" + dbName + "`" ; 
    var sqlDataFrame = spark.sql(sql);
    println("sql=======" + sql);
    println("sqlDataFrame===printSchema====");
    sqlDataFrame.printSchema() ; 
    println("sqlDataFrame====show===");
    sqlDataFrame.show() ; 
    
    /* 切换数据库 */
    sql = """
      use """ + 
      "`" + dbName + "`" ; 
    sqlDataFrame = spark.sql(sql);
    println("sql=======" + sql);
    println("sqlDataFrame===printSchema====");
    sqlDataFrame.printSchema() ; 
    println("sqlDataFrame====show===");
    sqlDataFrame.show() ; 
    
    /* 执行一个sql语句 */
    sql = """
      create table if not exists src 
      (key int , value string) 
      -- spark要求必须加的
      using hive 
      """
    sqlDataFrame = spark.sql(sql);
    println("sql=======" + sql);
    println("sqlDataFrame===printSchema====");
    sqlDataFrame.printSchema() ; 
    println("sqlDataFrame====show===");
    sqlDataFrame.show() ; 
    /* 显示有多少张表 */
    sql = """
      show tables
      """
    sqlDataFrame = spark.sql(sql);
    println("sql=======" + sql);
    println("sqlDataFrame===printSchema====");
    sqlDataFrame.printSchema() ; 
    println("sqlDataFrame====show===");
    sqlDataFrame.show() ; 
    
    /* 插入数据 */
    sql = """
        load data local inpath './kv1.txt' into table src  
        """ ; 
    sqlDataFrame = spark.sql(sql);
    println("sql=======" + sql);
    println("sqlDataFrame===printSchema====");
    sqlDataFrame.printSchema() ; 
    println("sqlDataFrame====show===");
    sqlDataFrame.show() ; 

//在方法内的导包,方法的参数是SparkSession
import spark.implicits._ ; 
    /* 好多编码器都不用传了 */
    import spark.sql ; 
    
     /* 显示有多少张表 */
    var sql = """
      use mySpark 
      """
    var sqlDataFrame = spark.sql(sql);
    
    /* 查询所有的数据(前20条) */
    sql = """
        select * from src
        """ ; 
    sqlDataFrame = spark.sql(sql);
    println("sql=======" + sql);
    println("sqlDataFrame===printSchema====");
    sqlDataFrame.printSchema() ; 
    println("sqlDataFrame====show===");
    sqlDataFrame.show() ; 
    
    /* 查询总共多少行 */
//    sql = """
//        select count(*) from src
//        """ ; 
//    sqlDataFrame = spark.sql(sql);
//    println("sql=======" + sql);
//    println("sqlDataFrame===printSchema====");
//    sqlDataFrame.printSchema() ; 
//    println("sqlDataFrame====show===");
//    sqlDataFrame.show() ; 
   
  /* dataFrame===dataSet<row> 
     * map算子里面放的就是Row
     * */
    var mapDataSet = sqlDataFrame.map{
      case Row(key:Int,value:String) =>
        s"键:$key,值:$value"
    }
    println("mapDataSet1===printSchema====");
    mapDataSet.printSchema() ; 
    println("mapDataSet1====show===");
    mapDataSet.show() ; 
    
    //另外一种
    mapDataSet = sqlDataFrame.map(t =>{
      
      t.get(0) + "new" + t.get(1)
    }
      )
       println("mapDataSet2===printSchema====");
    mapDataSet.printSchema() ; 
    println("mapDataSet2====show===");
    mapDataSet.show() ;
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!