apache-spark

Getting correct offset for timezone using current_timestamp in apache spark

 ̄綄美尐妖づ 提交于 2021-01-29 16:18:27
问题 I am new to both Java and Apache spark and trying to understand the timestamp and timezone usage. I would like all the timestamps to be stored in EST timezone in SQL Server from data i get from apache spark DF. When I use current_timestamp, I am getting the correct EST time but the offset i am getting when i look at data is '+00:00' instead of '-04:00'. Here is a value stored in database that is passed in from spark dataset: 2020-04-07 11:36:23.0220 +00:00 From what I see current_timestamp

SAS Proc Freq with PySpark (Frequency, percent, cumulative frequency, and cumulative percent)

纵然是瞬间 提交于 2021-01-29 15:29:09
问题 I'm looking for a way to reproduce the SAS Proc Freq code in PySpark. I found this code that does exactly what I need. However, it is given in Pandas. I want to make sure it does use the best what Spark can offer, as the code will run with massive datasets. In this other post (which was also adapted for this StackOverflow answer), I also found instructions to compute distributed groupwise cumulative sums in PySpark, but not sure how to adapt it to my end. Here's an input and output example

How to stream data from Kafka topic to Delta table using Spark Structured Streaming

核能气质少年 提交于 2021-01-29 15:11:34
问题 I'm trying to understand databricks delta and thinking to do a POC using Kafka. Basically the plan is to consume data from Kafka and insert it to the databricks delta table. These are the steps that I did: Create a delta table on databricks. %sql CREATE TABLE hazriq_delta_trial2 ( value STRING ) USING delta LOCATION '/delta/hazriq_delta_trial2' Consume data from Kafka. import org.apache.spark.sql.types._ val kafkaBrokers = "broker1:port,broker2:port,broker3:port" val kafkaTopic = "kafkapoc"

Killing spark streaming job when no activity

本秂侑毒 提交于 2021-01-29 13:40:30
问题 I want to kill my spark streaming job when there is no activity (i.e. the receivers are not receiving messages) for a certain time. I tried doing this var counter = 0 myDStream.foreachRDD { rdd => if (rdd.count() == 0L) { counter = counter + 1 if (counter == 40) { ssc.stop(true, true) } } else { counter = 0 } } Is there a better way of doing this? How would I make a variable available to all receivers and update the variable by 1 whenever there is no activity? 回答1: Use a NoSQL Table like

Pyspark: How to count the number of each equal distance interval in RDD

↘锁芯ラ 提交于 2021-01-29 12:33:11
问题 I have a RDD[Double] , I want to divide the RDD into k equal intervals, then count the number of each equal distance interval in RDD. For example, the RDD is like [0,1,2,3,4,5,6,6,7,7,10] . I want to divided it into 10 equal intervals, so the intervals are [0,1), [1,2), [2,3), [3,4), [4,5), [5,6), [6,7), [7,8), [8,9), [9,10] . As you can see, each element of RDD will be in one of the intervals. Then I want to calculate the number of each interval. Here, there are one element in [0,1),[1,2),[2

PySpark and time series data: how to smartly avoid overlapping dates?

谁都会走 提交于 2021-01-29 12:06:08
问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

Spark : Create dataframe with default values

ⅰ亾dé卋堺 提交于 2021-01-29 11:22:04
问题 Can we put a default value in a field of dataframe while creating the dataframe? I am creating a spark dataframe from List<Object[]> rows as : List<org.apache.spark.sql.Row> sparkRows = rows.stream().map(RowFactory::create).collect(Collectors.toList()); Dataset<org.apache.spark.sql.Row> dataset = session.createDataFrame(sparkRows, schema); While looking for a way, I found that org.apache.spark.sql.types.DataTypes contains object of org.apache.spark.sql.types.Metadata class. The documentation

Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

狂风中的少年 提交于 2021-01-29 11:19:31
问题 Say I have a list of column names and they all exist in the dataframe Cols = ['A', 'B', 'C', 'D'], I am looking for a quick way to get a table/dataframe like NA_counts min max A 5 0 100 B 10 0 120 C 8 1 99 D 2 0 500 TIA 回答1: You can calculate each metric separately and then union all like this: nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols] max_cols = [max(col(c)).alias(c) for c in cols] min_cols = [min(col(c)).alias(c) for c in cols] nulls_df = df

Spark graphframe find hierarchy

本小妞迷上赌 提交于 2021-01-29 10:34:20
问题 I am trying to do a pretty simple use case . I have two dataframe - >>> g.vertices.show(20,False) +------------------------+ |id | +------------------------+ |Router_UPDATE_INSERT | |Seq_Unique_Key | |Target_New_Insert | |Target_Existing_Update | |Target_Existing_Insert | |SAMPLE_CUSTOMER | |SAMPLE_CUSTOMER_MASTER | |Sorter_SAMPLE_CUSTOMER | |Sorter_CUSTOMER_MASTER | |Join_Source_Target | |Exp_DetectChanges | |Filter_Unchanged_Records| Details of edges - >>> g.edges.show(20,False) +----------

Spark structured streaming - ways to lookup high volume non-static dataset?

时间秒杀一切 提交于 2021-01-29 10:32:34
问题 I wish to build a spark structured streaming job that does something like below(lookup a huge non-static dataset) Read from kafka(json record) For each json record Get {user_key} Read from huge Phoenix table(non-static) filter by {user_key} Further DF transformations Write to another phoenix table How to lookup huge volume non-static dataset per kafka message? 来源: https://stackoverflow.com/questions/62421785/spark-structured-streaming-ways-to-lookup-high-volume-non-static-dataset