apache-spark | 易学教程

How to distribute data evenly in Kafka producing messages through Spark?

阅读更多关于 How to distribute data evenly in Kafka producing messages through Spark?

问题 I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others. +-----------------------------------------------------+ | partition | messages | earlist offset | next offset| +-----------------------------------------------------+ |1 | 166522754 | 5861603324 | 6028126078 | |2 | 152251127 | 6010226633 | 6162477760 | |3 | 382935293 | 6332944925 | 6715880218 | |4 | 188126274 | 6171311709 | 6359437983 | |5 | 188270700 | 6100140089 |

pyspark : Flattening of records coming from input file

阅读更多关于 pyspark : Flattening of records coming from input file

问题 I have the input csv file like below - plant_id, system1_id, system2_id, system3_id A1 s1-111 s2-111 s3-111 A2 s1-222 s2-222 s3-222 A3 s1-333 s2-333 s3-333 I want to flatten the record like this below plant_id system_id system_name A1 s1-111 system1 A1 s2-111 system2 A1 s3-111 system3 A2 s1-222 system1 A2 s2-222 system2 A2 s3-222 system3 A3 s1-333 system1 A3 s2-333 system2 A3 s3-333 system3 currently I am able to achieve it by creating a transposed pyspark df for each system column and then

Load XML string from Column in PySpark

阅读更多关于 Load XML string from Column in PySpark

问题 I have a JSON file in which one of the columns is an XML string. I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file. How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values? The following doesn't work: tr = spark.read.json( "my-file-path") trans_xml = sqlContext.read.format('com.databricks.spark.xml')

How do I connect to Hive from spark using Scala on IntelliJ?

阅读更多关于 How do I connect to Hive from spark using Scala on IntelliJ?

问题 I am new to hive and spark and am trying to figure out a way to access tables in hive to manipulate and access the data. How can it be done? 回答1: in spark < 2.0 val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val myDataFrame = sqlContext.sql("select * from mydb.mytable") in later versions of spark, use SparkSession: SparkSession is now the new entry point of Spark that replaces the old SQLContext and HiveContext. Note that the old SQLContext and

Spark: Exception in thread “main” org.apache.spark.sql.catalyst.errors.package

阅读更多关于 Spark: Exception in thread “main” org.apache.spark.sql.catalyst.errors.package

问题 While running my spark-submit code, I get this error when I execute. Scala file which performs joins. I am just curious to know what is this TreeNodeException error. Why do we have this error? Please share your ideas on this TreeNodeException error: Exception in thread “main” org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: 回答1: I encountered this exception when joining dataframes too Exception in thread “main” org.apache.spark.sql.catalyst.errors.package

NullPointerException in ProtoBuf when Kryo serialization is used with Spark

阅读更多关于 NullPointerException in ProtoBuf when Kryo serialization is used with Spark

问题 I am getting the following error in my spark application when it is trying to serialize a protobuf field which is a map of key String and value float. Kryo serialization is being used in the spark app. Caused by: java.lang.NullPointerException at com.google.protobuf.UnmodifiableLazyStringList.size(UnmodifiableLazyStringList.java:68) at java.util.AbstractList.add(AbstractList.java:108) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134) at com

Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

阅读更多关于 Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

问题 This question already has an answer here : Why does using cache on streaming Datasets fail with “AnalysisException: Queries with streaming sources must be executed with writeStream.start()”? (1 answer) Closed 13 days ago . I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML. Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running

Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

阅读更多关于 Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

How to get year and week number aligned for a date

阅读更多关于 How to get year and week number aligned for a date

问题 While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year. I understand the logic for weeknumber and the one of year when they run separately. However, when they are combined, in some cases they don't bring consistent results and I was wondering what is the best way in Spark to make sure that those scenarios are handled with a consistent year for the given weeknumber, For example, running: spark.sql(

How to get year and week number aligned for a date

阅读更多关于 How to get year and week number aligned for a date