apache-spark-sql | 易学教程

Implement SCD Type 2 in Spark

阅读更多关于 Implement SCD Type 2 in Spark

问题 Trying to implement SCD Type 2 logic in Spark 2.4.4. I've two Data Frames; one containing 'Existing Data' and the other containing 'New Incoming Data'. Input and expected output are given below. What needs to happen is: All incoming rows should get appended to the existing data. Only following 3 rows which were previously 'active' should become inactive with appropriate 'endDate' populated as follows: pk=1, amount = 20 => Row should become 'inactive' & 'endDate' is the 'startDate' of

In spark, is their any alternative for union() function while appending new row?

阅读更多关于 In spark, is their any alternative for union() function while appending new row?

问题 In my code table_df has some columns on which I am doing some calculations like min, max, mean etc. and I want to create new_df with specified schema new_df_schema. In my logic, I have written spark-sql for calculations and appending each new generated row to initially empty new_df and at the end, it results in new_df with all calculated values for all columns. But the problem is when the columns are more in number it leads to performance issue. Can this be done without using union() function

Calculate a grouped median in pyspark

阅读更多关于 Calculate a grouped median in pyspark

问题 When using pyspark, I'd like to be able to calculate the difference between grouped values and their median for the group. Is this possible? Here is some code I hacked up that does what I want except that it calculates the grouped diff from mean. Also, please feel free to comment on how I could make this better if you feel like being helpful :) from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import ( StringType, LongType, DoubleType, StructField,

Should we parallelize a DataFrame like we parallelize a Seq before training

阅读更多关于 Should we parallelize a DataFrame like we parallelize a Seq before training

问题 Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) val lr = new LogisticRegression() lr.setMaxIter(10).setRegParam(0.01) val model1 = lr.fit(training) Assuming we

In Spark scala, how to check between adjacent rows in a dataframe

阅读更多关于 In Spark scala, how to check between adjacent rows in a dataframe

问题 How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe . This should happen at a key level I have following data after sorting on key, dates source_Df.show() +-----+--------+------------+------------+ | key | code | begin_dt | end_dt | +-----+--------+------------+------------+ | 10 | ABC | 2018-01-01 | 2018-01-08 | | 10 | BAC | 2018-01-03 | 2018-01-15 | | 10 | CAS | 2018-01-03 | 2018-01-21 | | 20 | AAA | 2017-11-12 | 2018-01-03 | | 20 | DAS | 2018-01-01 |

Transposing table to given format in spark [duplicate]

阅读更多关于 Transposing table to given format in spark [duplicate]

问题 This question already has answers here : How to pivot Spark DataFrame? (10 answers) Closed 4 days ago . I am using sparkv2.4.1, Have a scenario , where i need to convert given table structred as below val df = Seq( ("A", "2016-01-01", "2016-12-01", "0.044999408"), ("A", "2016-01-01", "2016-12-01", "0.0449999426"), ("A", "2016-01-01", "2016-12-01", "0.045999415"), ("B", "2016-01-01", "2016-12-01", "0.0787888909"), ("B", "2016-01-01", "2016-12-01", "0.079779426"), ("B", "2016-01-01", "2016-12

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

阅读更多关于 Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

问题 I am using IntelliJ 2016.3 version. import sbt.Keys._ import sbt._ object ApplicationBuild extends Build { object Versions { val spark = "1.6.3" } val projectName = "example-spark" val common = Seq( version := "1.0", scalaVersion := "2.11.7" ) val customLibraryDependencies = Seq( "org.apache.spark" %% "spark-core" % Versions.spark % "provided", "org.apache.spark" %% "spark-sql" % Versions.spark % "provided", "org.apache.spark" %% "spark-hive" % Versions.spark % "provided", "org.apache.spark"

How to display (or operate on) objects encoded by Kryo in Spark Dataset?

阅读更多关于 How to display (or operate on) objects encoded by Kryo in Spark Dataset?

问题 Say you have this: // assume we handle custom type class MyObj(val i: Int, val j: String) implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj] val ds = spark.createDataset(Seq(new MyObj(1, "a"),new MyObj(2, "b"),new MyObj(3, "c"))) When do a ds.show , I got: +--------------------+ | value| +--------------------+ |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| +--------------------+ I understand that it's because the contents are encoded into internal

How to display (or operate on) objects encoded by Kryo in Spark Dataset?

阅读更多关于 How to display (or operate on) objects encoded by Kryo in Spark Dataset?

Joining two spark dataframes on time (TimestampType) in python

阅读更多关于 Joining two spark dataframes on time (TimestampType) in python

问题 I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in dates_df with date=1/3/2015:00:00:00 should be joined with events_df with time=1/3/2015:00:00:01 because both timestamps are within 5 seconds from each other. I'm trying to get this logic working with python spark, and it is extremely painful. How do