apache-spark | 易学教程

How to load tar.gz files in streaming datasets?

阅读更多关于 How to load tar.gz files in streaming datasets?

问题 I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data. I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files. Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream? The code I use to process the files is this: val schema = Encoders.product[RawData].schema val trackerData = spark .readStream .option(

How to load tar.gz files in streaming datasets?

阅读更多关于 How to load tar.gz files in streaming datasets?

Spark find max of date partitioned column

阅读更多关于 Spark find max of date partitioned column

问题 I have a parquet partitioned in the following way: data /batch_date=2020-01-20 /batch_date=2020-01-21 /batch_date=2020-01-22 /batch_date=2020-01-23 /batch_date=2020-01-24 Here batch_date which is the partition column is of date type. I want only read the data from the latest date partition but as a consumer I don't know what is the latest value. I could use a simple group by something like df.groupby().agg(max(col('batch_date'))).first() While this would work it's a very inefficient way since

How to get progress of streaming query after awaitTermination?

阅读更多关于 How to get progress of streaming query after awaitTermination?

问题 I am new to spark and was reading few things about monitoring the spark application. Basically, I want to know how many records were processed by spark application in given trigger time and progress of query. I know 'lastProgress' gives all those metrics but when I'm using awaitTermination with 'lastProgress' it always returns null. val q4s = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", brokers) .option("subscribe", topic) .option("startingOffsets", "earliest") .load()

How to get progress of streaming query after awaitTermination?

阅读更多关于 How to get progress of streaming query after awaitTermination?

How to do this transformation in SQL/Spark/GraphFrames

阅读更多关于 How to do this transformation in SQL/Spark/GraphFrames

问题 I've a table containing the following two columns: Device-Id Account-Id d1 a1 d2 a1 d1 a2 d2 a3 d3 a4 d3 a5 d4 a6 d1 a4 Device-Id is the unique Id of the device on which my app is installed and Account-Id is the id of a user account. A user can have multiple devices and can create multiple accounts on the same device(eg. d1 device has a1, a2 and a3 accounts set up). I want to find unique actual users(should be represented as a new column with some unique UUID in the generated table) and the

How to do this transformation in SQL/Spark/GraphFrames

阅读更多关于 How to do this transformation in SQL/Spark/GraphFrames

How to do this transformation in SQL/Spark/GraphFrames

阅读更多关于 How to do this transformation in SQL/Spark/GraphFrames

Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0

阅读更多关于 Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0

问题 I have a Spark Structured Streaming consuming records from Kafka topic with 2 partition. Spark Job: 2 queries, each consuming from 2 separate partition, running from same spark session. val df1 = session.readStream.format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("assign", "{\"multi-stream1\" : [0]}") .option("startingOffsets", latest) .option("key.deserializer", classOf[StringDeserializer].getName) .option("value.deserializer", classOf[StringDeserializer]

Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0

阅读更多关于 Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0