apache-spark | 易学教程

Getting correct offset for timezone using current_timestamp in apache spark

阅读更多关于 Getting correct offset for timezone using current_timestamp in apache spark

问题 I am new to both Java and Apache spark and trying to understand the timestamp and timezone usage. I would like all the timestamps to be stored in EST timezone in SQL Server from data i get from apache spark DF. When I use current_timestamp, I am getting the correct EST time but the offset i am getting when i look at data is '+00:00' instead of '-04:00'. Here is a value stored in database that is passed in from spark dataset: 2020-04-07 11:36:23.0220 +00:00 From what I see current_timestamp

SAS Proc Freq with PySpark (Frequency, percent, cumulative frequency, and cumulative percent)

阅读更多关于 SAS Proc Freq with PySpark (Frequency, percent, cumulative frequency, and cumulative percent)

问题 I'm looking for a way to reproduce the SAS Proc Freq code in PySpark. I found this code that does exactly what I need. However, it is given in Pandas. I want to make sure it does use the best what Spark can offer, as the code will run with massive datasets. In this other post (which was also adapted for this StackOverflow answer), I also found instructions to compute distributed groupwise cumulative sums in PySpark, but not sure how to adapt it to my end. Here's an input and output example

How to stream data from Kafka topic to Delta table using Spark Structured Streaming

阅读更多关于 How to stream data from Kafka topic to Delta table using Spark Structured Streaming

问题 I'm trying to understand databricks delta and thinking to do a POC using Kafka. Basically the plan is to consume data from Kafka and insert it to the databricks delta table. These are the steps that I did: Create a delta table on databricks. %sql CREATE TABLE hazriq_delta_trial2 ( value STRING ) USING delta LOCATION '/delta/hazriq_delta_trial2' Consume data from Kafka. import org.apache.spark.sql.types._ val kafkaBrokers = "broker1:port,broker2:port,broker3:port" val kafkaTopic = "kafkapoc"

Killing spark streaming job when no activity

阅读更多关于 Killing spark streaming job when no activity

问题 I want to kill my spark streaming job when there is no activity (i.e. the receivers are not receiving messages) for a certain time. I tried doing this var counter = 0 myDStream.foreachRDD { rdd => if (rdd.count() == 0L) { counter = counter + 1 if (counter == 40) { ssc.stop(true, true) } } else { counter = 0 } } Is there a better way of doing this? How would I make a variable available to all receivers and update the variable by 1 whenever there is no activity? 回答1: Use a NoSQL Table like

Pyspark: How to count the number of each equal distance interval in RDD

阅读更多关于 Pyspark: How to count the number of each equal distance interval in RDD

问题 I have a RDD[Double] , I want to divide the RDD into k equal intervals, then count the number of each equal distance interval in RDD. For example, the RDD is like [0,1,2,3,4,5,6,6,7,7,10] . I want to divided it into 10 equal intervals, so the intervals are [0,1), [1,2), [2,3), [3,4), [4,5), [5,6), [6,7), [7,8), [8,9), [9,10] . As you can see, each element of RDD will be in one of the intervals. Then I want to calculate the number of each interval. Here, there are one element in [0,1),[1,2),[2

PySpark and time series data: how to smartly avoid overlapping dates?

阅读更多关于 PySpark and time series data: how to smartly avoid overlapping dates?

问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

Spark : Create dataframe with default values

阅读更多关于 Spark : Create dataframe with default values

问题 Can we put a default value in a field of dataframe while creating the dataframe? I am creating a spark dataframe from List<Object[]> rows as : List<org.apache.spark.sql.Row> sparkRows = rows.stream().map(RowFactory::create).collect(Collectors.toList()); Dataset<org.apache.spark.sql.Row> dataset = session.createDataFrame(sparkRows, schema); While looking for a way, I found that org.apache.spark.sql.types.DataTypes contains object of org.apache.spark.sql.types.Metadata class. The documentation

Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

阅读更多关于 Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

问题 Say I have a list of column names and they all exist in the dataframe Cols = ['A', 'B', 'C', 'D'], I am looking for a quick way to get a table/dataframe like NA_counts min max A 5 0 100 B 10 0 120 C 8 1 99 D 2 0 500 TIA 回答1: You can calculate each metric separately and then union all like this: nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols] max_cols = [max(col(c)).alias(c) for c in cols] min_cols = [min(col(c)).alias(c) for c in cols] nulls_df = df

Spark graphframe find hierarchy

阅读更多关于 Spark graphframe find hierarchy

Spark structured streaming - ways to lookup high volume non-static dataset?

阅读更多关于 Spark structured streaming - ways to lookup high volume non-static dataset?

问题 I wish to build a spark structured streaming job that does something like below(lookup a huge non-static dataset) Read from kafka(json record) For each json record Get {user_key} Read from huge Phoenix table(non-static) filter by {user_key} Further DF transformations Write to another phoenix table How to lookup huge volume non-static dataset per kafka message? 来源： https://stackoverflow.com/questions/62421785/spark-structured-streaming-ways-to-lookup-high-volume-non-static-dataset