Spark Streaming + Hive

ε祈祈猫儿з 提交于 2019-12-24 00:42:31

问题


We are in a process to build a application that takes data from source system through flume and then with the help of Kafka message system to spark streaming for in memory processing, after processing data into data frame we will put data into hive tables. Flow will be as follows Source System -> Flume -> Kafka -> Spark Streaming -> Hive , Is it correct flow or we need to review it?

We are taking Discrete stream and converting it into data frame for SQL compatibility functions. Now we have 14 tables in hive where we have to load data according to code type. If we see code below we are filtering our Dstream before giving it to specific foreachRDD and for 14 filters we have to filter it 14 times for separate 14 foreachRRD bodies

val fil_sms = lines.filter(_.startsWith("1|"))
    val fil_calls = lines.filter(_.startsWith("7|"))

Can we handle this in single foreachRDD body because i have tried but it is filtering only one line. Am i doing right procedure if someone help in this code to make it more better for performance and implementation. Hopefully you understand my query

import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
import org.apache.hadoop.fs.permission.FsPermission
import org.apache.hadoop.fs.FileSystem
import java.net.URI
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.Duration
import org.apache.spark.sql.types._
    object test1 {

      case class cdrsms(col1: String , col2: String , col3: String , col4: String ,
                       col5: String , col6: String , col7: String , col8: String,
                       col9: String ,  col10: String)


      case class cdrcalls(col1: String , col2: String , col3: String , col4: String ,
                        col5: String , col6: String , col7: String , col8: String,
                        col9: String ,  col10: String , col11: String ,col12 : String , col13 : String)

      def main(args: Array[String]) {
        val sparkConf = new SparkConf().setAppName("hwtest").setMaster("local[*]")
        val topicsSet = "hwkf01".split(",").toSet
        val kafkaParams = Map[String, String]("metadata.broker.list" -> "192.168.2.210:9092")
        val sc = new SparkContext(sparkConf)
        val ssc = new StreamingContext(sc, Seconds(20))
        val messages: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
          ssc, kafkaParams, topicsSet)
        val lines = messages.map(_._2)
        val fil_sms = lines.filter(_.startsWith("1|"))
        val fil_calls = lines.filter(_.startsWith("7|"))
       val sqlContext = new HiveContext(sc)
        import sqlContext.implicits._

        fil_sms.foreachRDD(rdd=> if(!rdd.isEmpty){
          //val sms = rdd.filter(_.startsWith("1|"))
          rdd.map(_.split('|')).map(p => cdrsms(p(0), p(1),p(2),p(3),p(4),p(5),p(6),p(7),p(8),p(9))).toDF().registerTempTable("cdr_sms")
          val tbl1 = sqlContext.sql("SELECT * FROM cdr_sms")
          tbl1.foreach(println)
          sqlContext.sql("insert into table sms select * from cdr_data")
        })

        fil_calls.foreachRDD(rdd=> if(!rdd.isEmpty){
          rdd.map(_.split('|')).map(p => cdrcalls(p(0), p(1),p(2),p(3),p(4),p(5),p(6),p(7),p(8),p(9),p(10),p(11),p(12))).toDF().registerTempTable("cdr_calls")
          val tbl1 = sqlContext.sql("SELECT * FROM cdr_calls")
          tbl1.foreach(println)

          sqlContext.sql("insert into table calls select * from cdr_data")
        })
        ssc.start()
        ssc.awaitTermination()
      }
    }

回答1:


I am waiting for expert answer mean while I have tried many workouts and I changed my code to below in single foreachRDD, please guide if is good way to go with. Both codes are working fine and have same end result but I am finding best practice to achieve same result. I have removed temporary table creation and now directly inserting to hive table that will definitely save physical memory and processing time.

import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
import org.apache.hadoop.fs.permission.FsPermission
import org.apache.hadoop.fs.FileSystem
import java.net.URI
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.Duration
import org.apache.spark.sql.types._

object test1 {

  case class cdrsms(col1: String , col2: String , col3: String , col4: String ,
                   col5: String , col6: String , col7: String , col8: String,
                   col9: String ,  col10: String)


  case class cdrcalls(col1: String , col2: String , col3: String , col4: String ,
                    col5: String , col6: String , col7: String , col8: String,
                    col9: String ,  col10: String , col11: String ,col12 : String , col13 : String)

  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("hwtest").setMaster("local[*]")
    val topicsSet = "hwkf01".split(",").toSet
    val kafkaParams = Map[String, String]("metadata.broker.list" -> "192.168.2.210:9092")
    val sc = new SparkContext(sparkConf)
    val ssc = new StreamingContext(sc, Seconds(20))
    val messages: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topicsSet)
    val lines = messages.map(_._2)

   val sqlContext = new HiveContext(sc)
    import sqlContext.implicits._

    lines.foreachRDD(rdd=> if(!rdd.isEmpty){
      val sms = rdd.filter(_.startsWith("7|"))
      val calls = rdd.filter(_.startsWith("1|"))

      sms.map(_.split('|'))
        .map(p => cdrsms(p(0), p(1),p(2), p(3),p(4),p(5),p(6),p(7),p(8),p(9)))
        .toDF()
        .write.mode("append")
        .insertInto("sms_cdr")

      calls.map(_.split('|'))
        .map(p => cdrcalls(p(0), p(1),p(2), p(3),p(4),p(5),p(6),p(7),p(8),p(9),p(10),p(11),p(12)))
        .toDF()
        .write.mode("append")
        .insertInto("calls_cdr")

   })


    ssc.start()
    ssc.awaitTermination()
  }
}


来源:https://stackoverflow.com/questions/49585399/spark-streaming-hive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!