apache-spark-sql

Remove duplicates from PySpark array column

こ雲淡風輕ζ 提交于 2021-02-08 06:48:50
问题 I have a PySpark Dataframe that contains an ArrayType(StringType()) column. This column contains duplicate strings inside the array which I need to remove. For example, one row entry could look like [milk, bread, milk, toast] . Let's say my dataframe is named df and my column is named arraycol . I need something like: df = df.withColumn("arraycol_without_dupes", F.remove_dupes_from_array("arraycol")) My intution was that there exists a simple solution to this, but after browsing stackoverflow

Pyspark: Calculate streak of consecutive observations

走远了吗. 提交于 2021-02-08 06:44:26
问题 I have a Spark (2.4.0) data frame with a column that has just two values (either 0 or 1 ). I need to calculate the streak of consecutive 0 s and 1 s in this data, resetting the streak to zero if the value changes. An example: from pyspark.sql import (SparkSession, Window) from pyspark.sql.functions import (to_date, row_number, lead, col) spark = SparkSession.builder.appName('test').getOrCreate() # Create dataframe df = spark.createDataFrame([ ('2018-01-01', 'John', 0, 0), ('2018-01-01', 'Paul

SPARK - Joining 2 dataframes on values in an array

非 Y 不嫁゛ 提交于 2021-02-08 06:40:54
问题 I can't find an easy and elegant solution to this one. I have a df1 with this column : |-- guitars: array (nullable = true) | |-- element: long (containsNull = true) I have a df2 made of guitars, and an id matching with the Long in my df 1. root |-- guitarId: long (nullable = true) |-- make: string (nullable = true) |-- model: string (nullable = true) |-- type: string (nullable = true) I want to join my two dfs, obviously, and instead of having an array of long, I want an array of struct

SPARK - Joining 2 dataframes on values in an array

时光怂恿深爱的人放手 提交于 2021-02-08 06:40:12
问题 I can't find an easy and elegant solution to this one. I have a df1 with this column : |-- guitars: array (nullable = true) | |-- element: long (containsNull = true) I have a df2 made of guitars, and an id matching with the Long in my df 1. root |-- guitarId: long (nullable = true) |-- make: string (nullable = true) |-- model: string (nullable = true) |-- type: string (nullable = true) I want to join my two dfs, obviously, and instead of having an array of long, I want an array of struct

Count number of words in each sentence Spark Dataframes

女生的网名这么多〃 提交于 2021-02-08 06:25:17
问题 I have a Spark Dataframe where each row has a review. +--------------------+ | reviewText| +--------------------+ |Spiritually and m...| |This is one my mu...| |This book provide...| |I first read THE ...| +--------------------+ I have tried: SplitSentences = df.withColumn("split_sent",sentencesplit_udf(col('reviewText'))) SplitSentences = SplitSentences.select(SplitSentences.split_sent) Then I created the function: def word_count(text): return len(text.split()) wordcount_udf = udf(lambda x:

In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

六眼飞鱼酱① 提交于 2021-02-08 04:59:26
问题 If I have a DataFrame called df that looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | N/A| baz| |null| etc| +----+----+ I can selectively replace values like so: val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2")) so that df2 looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | baz| baz| |null| etc| +----+----+ but why can't I check if it's null, like: val df3 = df2.withColumn("a1", when($"a1" === null, $"a2")) so that I get: +----+----+ | a1+ a2| +----+----+ | foo|

Spark2 session for Cassandra , sql queries

南楼画角 提交于 2021-02-08 03:50:12
问题 In Spark-2.0 what is the best way to create a Spark session. Because in both Spark-2.0 and Cassandra- the APIs have been reworked, essentially deprecating the SqlContext (and also CassandraSqlContext). So for executing SQL- either I create a Cassandra Session (com.datastax.driver.core.Session) and use execute( " ") . Or I have to create a SparkSession (org.apache.spark.sql.SparkSession) and execute sql(String sqlText) method. I don't know the SQL limitations of either - can someone explain.

Spark2 session for Cassandra , sql queries

旧时模样 提交于 2021-02-08 03:48:00
问题 In Spark-2.0 what is the best way to create a Spark session. Because in both Spark-2.0 and Cassandra- the APIs have been reworked, essentially deprecating the SqlContext (and also CassandraSqlContext). So for executing SQL- either I create a Cassandra Session (com.datastax.driver.core.Session) and use execute( " ") . Or I have to create a SparkSession (org.apache.spark.sql.SparkSession) and execute sql(String sqlText) method. I don't know the SQL limitations of either - can someone explain.

spark dataframe: explode list column

梦想的初衷 提交于 2021-02-07 21:32:38
问题 I've got an output from Spark Aggregator which is List[Character] case class Character(name: String, secondName: String, faculty: String) val charColumn = HPAggregator.toColumn val resultDF = someDF.select(charColumn) So my dataframe looks like: +-----------------------------------------------+ | value | +-----------------------------------------------+ |[[harry, potter, gryffindor],[ron, weasley ... | +-----------------------------------------------+ Now I want to convert it to +------------

How to execute async operations (i.e. returning a Future) from map/filter/etc.?

十年热恋 提交于 2021-02-07 20:20:23
问题 I have a DataSet.map operation that needs to pull data in from an external REST API. The REST API client returns a Future[Int] . Is it possible to have the DataSet.map operation somehow await this Future asynchronously? Or will I need to block the thread using Await.result ? Or is this just not the done thing... i.e. should I instead try and load the data held by the API into a DataSet of its own, and perform a join ? Thanks in advance! EDIT: Different from: Spark job with Async HTTP call