Error when connecting spark structured streaming + kafka

别说谁变了你拦得住时间么 提交于 2021-02-11 15:45:49

问题


im trying to connect my structured streaming spark 2.4.5 with kafka, but all the times that im trying this Data Source Provider errors appears. Follow my scala code and my sbt build:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger

object streaming_app_demo {
  def main(args: Array[String]): Unit = {

    println("Spark Structured Streaming with Kafka Demo Application Started ...")

    val KAFKA_TOPIC_NAME_CONS = "test"
    val KAFKA_OUTPUT_TOPIC_NAME_CONS = "test"
    val KAFKA_BOOTSTRAP_SERVERS_CONS = "localhost:9092"


    val spark = SparkSession.builder
      .master("local[*]")
      .appName("Spark Structured Streaming with Kafka Demo")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    // Stream from Kafka
    val df = spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS)
      .option("subscribe", KAFKA_TOPIC_NAME_CONS)
      .option("startingOffsets", "latest")
      .load()

    val ds = df
      .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
      .writeStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("topic", "test2")
      .start()
  }
}

And the error is:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
    at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:161)
    at streaming_app_demo$.main(teste.scala:29)
    at streaming_app_demo.main(teste.scala)

And my sbt.build is:

name := "scala_212"

version := "0.1"

scalaVersion := "2.12.11"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.5"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"

libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.5" % "provided"

libraryDependencies += "org.apache.kafka" % "kafka-clients" % "2.5.0"

Thank You !


回答1:


For spark structured streaming + kafka, this spark-sql-kafka-0-10 library required.

You are getting this org.apache.spark.sql.AnalysisException: Failed to find data source: kafka exception because spark-sql-kafka library is not available in your classpath & It is unable to find org.apache.spark.sql.sources.DataSourceRegister inside META-INF/services folder.

DataSourceRegister path inside jar file

/org/apache/spark/spark-sql-kafka-0-10_2.11/2.2.0/spark-sql-kafka-0-10_2.11-2.2.0.jar!/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

Update

If you are using SBT, try add below code block. This will include org.apache.spark.sql.sources.DataSourceRegister file in your final jar.

// META-INF discarding
assemblyMergeStrategy in assembly := {
  case PathList("META-INF","services",xs @ _*) => MergeStrategy.filterDistinctLines
  case PathList("META-INF",xs @ _*) => MergeStrategy.discard
  case "application.conf" => MergeStrategy.concat
  case _ => MergeStrategy.first
}



来源:https://stackoverflow.com/questions/61578464/error-when-connecting-spark-structured-streaming-kafka

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!