apache-spark

How to load tar.gz files in streaming datasets?

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-01 03:51:34
问题 I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data. I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files. Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream? The code I use to process the files is this: val schema = Encoders.product[RawData].schema val trackerData = spark .readStream .option(

How to load tar.gz files in streaming datasets?

懵懂的女人 提交于 2021-01-01 03:50:55
问题 I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data. I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files. Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream? The code I use to process the files is this: val schema = Encoders.product[RawData].schema val trackerData = spark .readStream .option(

Spark find max of date partitioned column

情到浓时终转凉″ 提交于 2020-12-31 06:24:44
问题 I have a parquet partitioned in the following way: data /batch_date=2020-01-20 /batch_date=2020-01-21 /batch_date=2020-01-22 /batch_date=2020-01-23 /batch_date=2020-01-24 Here batch_date which is the partition column is of date type. I want only read the data from the latest date partition but as a consumer I don't know what is the latest value. I could use a simple group by something like df.groupby().agg(max(col('batch_date'))).first() While this would work it's a very inefficient way since

How to get progress of streaming query after awaitTermination?

非 Y 不嫁゛ 提交于 2020-12-31 06:01:09
问题 I am new to spark and was reading few things about monitoring the spark application. Basically, I want to know how many records were processed by spark application in given trigger time and progress of query. I know 'lastProgress' gives all those metrics but when I'm using awaitTermination with 'lastProgress' it always returns null. val q4s = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", brokers) .option("subscribe", topic) .option("startingOffsets", "earliest") .load()

How to get progress of streaming query after awaitTermination?

泄露秘密 提交于 2020-12-31 06:01:06
问题 I am new to spark and was reading few things about monitoring the spark application. Basically, I want to know how many records were processed by spark application in given trigger time and progress of query. I know 'lastProgress' gives all those metrics but when I'm using awaitTermination with 'lastProgress' it always returns null. val q4s = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", brokers) .option("subscribe", topic) .option("startingOffsets", "earliest") .load()

How to do this transformation in SQL/Spark/GraphFrames

北战南征 提交于 2020-12-31 04:32:48
问题 I've a table containing the following two columns: Device-Id Account-Id d1 a1 d2 a1 d1 a2 d2 a3 d3 a4 d3 a5 d4 a6 d1 a4 Device-Id is the unique Id of the device on which my app is installed and Account-Id is the id of a user account. A user can have multiple devices and can create multiple accounts on the same device(eg. d1 device has a1, a2 and a3 accounts set up). I want to find unique actual users(should be represented as a new column with some unique UUID in the generated table) and the

How to do this transformation in SQL/Spark/GraphFrames

北战南征 提交于 2020-12-31 04:32:35
问题 I've a table containing the following two columns: Device-Id Account-Id d1 a1 d2 a1 d1 a2 d2 a3 d3 a4 d3 a5 d4 a6 d1 a4 Device-Id is the unique Id of the device on which my app is installed and Account-Id is the id of a user account. A user can have multiple devices and can create multiple accounts on the same device(eg. d1 device has a1, a2 and a3 accounts set up). I want to find unique actual users(should be represented as a new column with some unique UUID in the generated table) and the

How to do this transformation in SQL/Spark/GraphFrames

天大地大妈咪最大 提交于 2020-12-31 04:32:09
问题 I've a table containing the following two columns: Device-Id Account-Id d1 a1 d2 a1 d1 a2 d2 a3 d3 a4 d3 a5 d4 a6 d1 a4 Device-Id is the unique Id of the device on which my app is installed and Account-Id is the id of a user account. A user can have multiple devices and can create multiple accounts on the same device(eg. d1 device has a1, a2 and a3 accounts set up). I want to find unique actual users(should be represented as a new column with some unique UUID in the generated table) and the

Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0

吃可爱长大的小学妹 提交于 2020-12-30 04:34:58
问题 I have a Spark Structured Streaming consuming records from Kafka topic with 2 partition. Spark Job: 2 queries, each consuming from 2 separate partition, running from same spark session. val df1 = session.readStream.format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("assign", "{\"multi-stream1\" : [0]}") .option("startingOffsets", latest) .option("key.deserializer", classOf[StringDeserializer].getName) .option("value.deserializer", classOf[StringDeserializer]

Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0

大兔子大兔子 提交于 2020-12-30 04:32:33
问题 I have a Spark Structured Streaming consuming records from Kafka topic with 2 partition. Spark Job: 2 queries, each consuming from 2 separate partition, running from same spark session. val df1 = session.readStream.format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("assign", "{\"multi-stream1\" : [0]}") .option("startingOffsets", latest) .option("key.deserializer", classOf[StringDeserializer].getName) .option("value.deserializer", classOf[StringDeserializer]