apache-spark

How to distribute data evenly in Kafka producing messages through Spark?

大憨熊 提交于 2021-02-05 08:10:41
问题 I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others. +-----------------------------------------------------+ | partition | messages | earlist offset | next offset| +-----------------------------------------------------+ |1 | 166522754 | 5861603324 | 6028126078 | |2 | 152251127 | 6010226633 | 6162477760 | |3 | 382935293 | 6332944925 | 6715880218 | |4 | 188126274 | 6171311709 | 6359437983 | |5 | 188270700 | 6100140089 |

pyspark : Flattening of records coming from input file

一世执手 提交于 2021-02-05 08:10:35
问题 I have the input csv file like below - plant_id, system1_id, system2_id, system3_id A1 s1-111 s2-111 s3-111 A2 s1-222 s2-222 s3-222 A3 s1-333 s2-333 s3-333 I want to flatten the record like this below plant_id system_id system_name A1 s1-111 system1 A1 s2-111 system2 A1 s3-111 system3 A2 s1-222 system1 A2 s2-222 system2 A2 s3-222 system3 A3 s1-333 system1 A3 s2-333 system2 A3 s3-333 system3 currently I am able to achieve it by creating a transposed pyspark df for each system column and then

Load XML string from Column in PySpark

点点圈 提交于 2021-02-05 07:20:25
问题 I have a JSON file in which one of the columns is an XML string. I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file. How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values? The following doesn't work: tr = spark.read.json( "my-file-path") trans_xml = sqlContext.read.format('com.databricks.spark.xml')

How do I connect to Hive from spark using Scala on IntelliJ?

痞子三分冷 提交于 2021-02-05 06:50:52
问题 I am new to hive and spark and am trying to figure out a way to access tables in hive to manipulate and access the data. How can it be done? 回答1: in spark < 2.0 val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val myDataFrame = sqlContext.sql("select * from mydb.mytable") in later versions of spark, use SparkSession: SparkSession is now the new entry point of Spark that replaces the old SQLContext and HiveContext. Note that the old SQLContext and

Spark: Exception in thread “main” org.apache.spark.sql.catalyst.errors.package

╄→尐↘猪︶ㄣ 提交于 2021-02-05 04:43:14
问题 While running my spark-submit code, I get this error when I execute. Scala file which performs joins. I am just curious to know what is this TreeNodeException error. Why do we have this error? Please share your ideas on this TreeNodeException error: Exception in thread “main” org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: 回答1: I encountered this exception when joining dataframes too Exception in thread “main” org.apache.spark.sql.catalyst.errors.package

NullPointerException in ProtoBuf when Kryo serialization is used with Spark

别来无恙 提交于 2021-02-04 21:06:44
问题 I am getting the following error in my spark application when it is trying to serialize a protobuf field which is a map of key String and value float. Kryo serialization is being used in the spark app. Caused by: java.lang.NullPointerException at com.google.protobuf.UnmodifiableLazyStringList.size(UnmodifiableLazyStringList.java:68) at java.util.AbstractList.add(AbstractList.java:108) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134) at com

Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

南笙酒味 提交于 2021-02-04 21:05:17
问题 This question already has an answer here : Why does using cache on streaming Datasets fail with “AnalysisException: Queries with streaming sources must be executed with writeStream.start()”? (1 answer) Closed 13 days ago . I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML. Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running

Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

折月煮酒 提交于 2021-02-04 21:01:35
问题 This question already has an answer here : Why does using cache on streaming Datasets fail with “AnalysisException: Queries with streaming sources must be executed with writeStream.start()”? (1 answer) Closed 13 days ago . I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML. Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running

How to get year and week number aligned for a date

自闭症网瘾萝莉.ら 提交于 2021-02-04 21:00:06
问题 While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year. I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber, For example, running: spark.sql(

How to get year and week number aligned for a date

倖福魔咒の 提交于 2021-02-04 20:59:18
问题 While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year. I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber, For example, running: spark.sql(