apache-spark-sql

How can I concatenate the rows in a pyspark dataframe with multiple columns using groupby and aggregate

ぃ、小莉子 提交于 2020-07-10 03:11:13
问题 I have a pyspark dataframe with multiple columns. For example the one below. from pyspark.sql import Row l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")] rdd = sc.parallelize(l) score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2])) score_card = sqlContext.createDataFrame(score_rdd) +----+--------+--------+ |name|letters1|letters2| +----+--------+--------+ |Jack| a| p| |Jack| b| q| |Bell| c| r| |Bell| d| s| +----+--------+--------+ Now I want to

Use Map to replace column values in Spark

情到浓时终转凉″ 提交于 2020-07-06 13:37:48
问题 I have to map a list of columns to another column in a Spark dataset: think something like this val translationMap: Map[Column, Column] = Map( lit("foo") -> lit("bar"), lit("baz") -> lit("bab") ) And I have a dataframe like this one: val df = Seq("foo", "baz").toDF("mov") So I intend to perform the translation like this: df.select( col("mov"), translationMap(col("mov")) ) but this piece of code spits the following error key not found: movs java.util.NoSuchElementException: key not found: movs

Use Map to replace column values in Spark

这一生的挚爱 提交于 2020-07-06 13:37:15
问题 I have to map a list of columns to another column in a Spark dataset: think something like this val translationMap: Map[Column, Column] = Map( lit("foo") -> lit("bar"), lit("baz") -> lit("bab") ) And I have a dataframe like this one: val df = Seq("foo", "baz").toDF("mov") So I intend to perform the translation like this: df.select( col("mov"), translationMap(col("mov")) ) but this piece of code spits the following error key not found: movs java.util.NoSuchElementException: key not found: movs

Get the last element of a window in Spark 2.1.1

点点圈 提交于 2020-07-05 04:44:06
问题 I have a dataframe in which I have subcategories, and want the last element of each of these subcategories. val windowSpec = Window.partitionBy("name").orderBy("count") sqlContext .createDataFrame( Seq[(String, Int)]( ("A", 1), ("A", 2), ("A", 3), ("B", 10), ("B", 20), ("B", 30) )) .toDF("name", "count") .withColumn("firstCountOfName", first("count").over(windowSpec)) .withColumn("lastCountOfName", last("count").over(windowSpec)) .show() returns me something strange: +----+-----+-------------

How to GROUPING SETS as operator/method on Dataset?

◇◆丶佛笑我妖孽 提交于 2020-07-05 03:58:30
问题 Is there no function level grouping_sets support in spark scala? I have no idea this patch applied to master https://github.com/apache/spark/pull/5080 I want to do this kind of query by scala dataframe api. GROUP BY expression list GROUPING SETS(expression list2) cube and rollup functions are available in Dataset API, but can't find grouping sets. Why? 回答1: I want to do this kind of query by scala dataframe api. tl;dr Up to Spark 2.1.0 it is not possible. There are currently no plans to add

Why agg() in PySpark is only able to summarize one column at a time? [duplicate]

断了今生、忘了曾经 提交于 2020-07-04 13:49:12
问题 This question already has answers here : Multiple Aggregate operations on the same column of a spark dataframe (3 answers) Closed 3 years ago . For the below dataframe df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High']) When I try to find min & max I am only getting min value in output. df.agg({'High':'max','High':'min'}).show() +-----------+ |min(High) | +-----------+ | 2094900| +-----------+ Why can't agg() give both max & min like in Pandas? 回答1: As you

Does Spark SQL use Hive Metastore?

蓝咒 提交于 2020-07-04 07:59:10
问题 I am developing a Spark SQL application and I've got few questions: I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation. I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive. But am I correct? 回答1: I read that Spark-SQL uses

Spark SQL: How to append new row to dataframe table (from another table)

╄→гoц情女王★ 提交于 2020-07-02 06:33:30
问题 I am using Spark SQL with dataframes. I have an input dataframe, and I would like to append (or insert) its rows to a larger dataframe that has more columns. How would I do that? If this were SQL, I would use INSERT INTO OUTPUT SELECT ... FROM INPUT , but I don't know how to do that with Spark SQL. For concreteness: var input = sqlContext.createDataFrame(Seq( (10L, "Joe Doe", 34), (11L, "Jane Doe", 31), (12L, "Alice Jones", 25) )).toDF("id", "name", "age") var output = sqlContext

Spark SQL: How to append new row to dataframe table (from another table)

☆樱花仙子☆ 提交于 2020-07-02 06:32:45
问题 I am using Spark SQL with dataframes. I have an input dataframe, and I would like to append (or insert) its rows to a larger dataframe that has more columns. How would I do that? If this were SQL, I would use INSERT INTO OUTPUT SELECT ... FROM INPUT , but I don't know how to do that with Spark SQL. For concreteness: var input = sqlContext.createDataFrame(Seq( (10L, "Joe Doe", 34), (11L, "Jane Doe", 31), (12L, "Alice Jones", 25) )).toDF("id", "name", "age") var output = sqlContext

Create a new dataset based given operation column

对着背影说爱祢 提交于 2020-06-30 08:39:12
问题 I am using spark-sql-2.3.1v and have the below scenario: Given a dataset: val ds = Seq( (1, "x1", "y1", "0.1992019"), (2, null, "y2", "2.2500000"), (3, "x3", null, "15.34567"), (4, null, "y4", null), (5, "x4", "y4", "0") ).toDF("id","col_x", "col_y","value") i.e. +---+-----+-----+---------+ | id|col_x|col_y| value| +---+-----+-----+---------+ | 1| x1| y1|0.1992019| | 2| null| y2|2.2500000| | 3| x3| null| 15.34567| | 4| null| y4| null| | 5| x4| y4| 0| +---+-----+-----+---------+ Requirement: I