Why does format(“kafka”) fail with “Failed to find data source: kafka.” (even with uber-jar)?

前端未结

关注

 6  1578

难免孤独 2020-11-30 10:23

I use HDP-2.6.3.0 with Spark2 package 2.2.0.

I\'m trying to write a Kafka consumer, using the Structured Streaming API, but I\'m getting the following error after su

6条回答

被撕碎了的回忆 (楼主)

2020-11-30 10:41
kafka data source is an external module and is not available to Spark applications by default.

You have to define it as a dependency in your pom.xml (as you have done), but that's just the very first step to have it in your Spark application.
```
    
        org.apache.spark
        spark-sql-kafka-0-10_2.11
        2.2.0
    
```
With that dependency you have to decide whether you want to create a so-called uber-jar that would have all the dependencies bundled altogether (that results in a fairly big jar file and makes the submission time longer) or use --packages (or less flexible --jars) option to add the dependency at spark-submit time.

(There are other options like storing the required jars on Hadoop HDFS or using Hadoop distribution-specific ways of defining dependencies for Spark applications, but let's keep things simple)

I'd recommend using --packages first and only when it works consider the other options.

Use spark-submit --packages to include the spark-sql-kafka-0-10 module as follows.
```
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
```
Include the other command-line options as you wish.

Uber-Jar Approach

Including all the dependencies in a so-called uber-jar may not always work due to how META-INF directories are handled.

For kafka data source to work (and other data sources in general) you have to ensure that META-INF/services/org.apache.spark.sql.sources.DataSourceRegister of all the data sources are merged (not replace or first or whatever strategy you use).

kafka data sources uses its own META-INF/services/org.apache.spark.sql.sources.DataSourceRegister that registers org.apache.spark.sql.kafka010.KafkaSourceProvider as the data source provider for kafka format.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

Why does format(“kafka”) fail with “Failed to find data source: kafka.” (even with uber-jar)?

Uber-Jar Approach