apache-spark-sql

Implement SCD Type 2 in Spark

自闭症网瘾萝莉.ら 提交于 2021-02-18 08:47:47
问题 Trying to implement SCD Type 2 logic in Spark 2.4.4. I've two Data Frames; one containing 'Existing Data' and the other containing 'New Incoming Data'. Input and expected output are given below. What needs to happen is: All incoming rows should get appended to the existing data. Only following 3 rows which were previously 'active' should become inactive with appropriate 'endDate' populated as follows: pk=1, amount = 20 => Row should become 'inactive' & 'endDate' is the 'startDate' of

In spark, is their any alternative for union() function while appending new row?

混江龙づ霸主 提交于 2021-02-18 08:40:35
问题 In my code table_df has some columns on which I am doing some calculations like min, max, mean etc. and I want to create new_df with specified schema new_df_schema. In my logic, I have written spark-sql for calculations and appending each new generated row to initially empty new_df and at the end, it results in new_df with all calculated values for all columns. But the problem is when the columns are more in number it leads to performance issue. Can this be done without using union() function

Calculate a grouped median in pyspark

倖福魔咒の 提交于 2021-02-18 07:55:36
问题 When using pyspark, I'd like to be able to calculate the difference between grouped values and their median for the group. Is this possible? Here is some code I hacked up that does what I want except that it calculates the grouped diff from mean. Also, please feel free to comment on how I could make this better if you feel like being helpful :) from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import ( StringType, LongType, DoubleType, StructField,

Should we parallelize a DataFrame like we parallelize a Seq before training

不羁的心 提交于 2021-02-17 15:36:40
问题 Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) val lr = new LogisticRegression() lr.setMaxIter(10).setRegParam(0.01) val model1 = lr.fit(training) Assuming we

In Spark scala, how to check between adjacent rows in a dataframe

只谈情不闲聊 提交于 2021-02-17 05:52:12
问题 How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe . This should happen at a key level I have following data after sorting on key, dates source_Df.show() +-----+--------+------------+------------+ | key | code | begin_dt | end_dt | +-----+--------+------------+------------+ | 10 | ABC | 2018-01-01 | 2018-01-08 | | 10 | BAC | 2018-01-03 | 2018-01-15 | | 10 | CAS | 2018-01-03 | 2018-01-21 | | 20 | AAA | 2017-11-12 | 2018-01-03 | | 20 | DAS | 2018-01-01 |

Transposing table to given format in spark [duplicate]

六月ゝ 毕业季﹏ 提交于 2021-02-17 05:09:18
问题 This question already has answers here : How to pivot Spark DataFrame? (10 answers) Closed 4 days ago . I am using sparkv2.4.1, Have a scenario , where i need to convert given table structred as below val df = Seq( ("A", "2016-01-01", "2016-12-01", "0.044999408"), ("A", "2016-01-01", "2016-12-01", "0.0449999426"), ("A", "2016-01-01", "2016-12-01", "0.045999415"), ("B", "2016-01-01", "2016-12-01", "0.0787888909"), ("B", "2016-01-01", "2016-12-01", "0.079779426"), ("B", "2016-01-01", "2016-12

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

丶灬走出姿态 提交于 2021-02-17 04:42:13
问题 I am using IntelliJ 2016.3 version. import sbt.Keys._ import sbt._ object ApplicationBuild extends Build { object Versions { val spark = "1.6.3" } val projectName = "example-spark" val common = Seq( version := "1.0", scalaVersion := "2.11.7" ) val customLibraryDependencies = Seq( "org.apache.spark" %% "spark-core" % Versions.spark % "provided", "org.apache.spark" %% "spark-sql" % Versions.spark % "provided", "org.apache.spark" %% "spark-hive" % Versions.spark % "provided", "org.apache.spark"

How to display (or operate on) objects encoded by Kryo in Spark Dataset?

旧城冷巷雨未停 提交于 2021-02-16 13:55:08
问题 Say you have this: // assume we handle custom type class MyObj(val i: Int, val j: String) implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj] val ds = spark.createDataset(Seq(new MyObj(1, "a"),new MyObj(2, "b"),new MyObj(3, "c"))) When do a ds.show , I got: +--------------------+ | value| +--------------------+ |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| +--------------------+ I understand that it's because the contents are encoded into internal

How to display (or operate on) objects encoded by Kryo in Spark Dataset?

北战南征 提交于 2021-02-16 13:54:10
问题 Say you have this: // assume we handle custom type class MyObj(val i: Int, val j: String) implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj] val ds = spark.createDataset(Seq(new MyObj(1, "a"),new MyObj(2, "b"),new MyObj(3, "c"))) When do a ds.show , I got: +--------------------+ | value| +--------------------+ |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| +--------------------+ I understand that it's because the contents are encoded into internal

Joining two spark dataframes on time (TimestampType) in python

烂漫一生 提交于 2021-02-16 03:30:32
问题 I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in dates_df with date=1/3/2015:00:00:00 should be joined with events_df with time=1/3/2015:00:00:01 because both timestamps are within 5 seconds from each other. I'm trying to get this logic working with python spark, and it is extremely painful. How do