apache-spark-sql | 易学教程

How can I concatenate the rows in a pyspark dataframe with multiple columns using groupby and aggregate

阅读更多关于 How can I concatenate the rows in a pyspark dataframe with multiple columns using groupby and aggregate

问题 I have a pyspark dataframe with multiple columns. For example the one below. from pyspark.sql import Row l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")] rdd = sc.parallelize(l) score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2])) score_card = sqlContext.createDataFrame(score_rdd) +----+--------+--------+ |name|letters1|letters2| +----+--------+--------+ |Jack| a| p| |Jack| b| q| |Bell| c| r| |Bell| d| s| +----+--------+--------+ Now I want to

Use Map to replace column values in Spark

阅读更多关于 Use Map to replace column values in Spark

问题 I have to map a list of columns to another column in a Spark dataset: think something like this val translationMap: Map[Column, Column] = Map( lit("foo") -> lit("bar"), lit("baz") -> lit("bab") ) And I have a dataframe like this one: val df = Seq("foo", "baz").toDF("mov") So I intend to perform the translation like this: df.select( col("mov"), translationMap(col("mov")) ) but this piece of code spits the following error key not found: movs java.util.NoSuchElementException: key not found: movs

Use Map to replace column values in Spark

阅读更多关于 Use Map to replace column values in Spark

Get the last element of a window in Spark 2.1.1

阅读更多关于 Get the last element of a window in Spark 2.1.1

问题 I have a dataframe in which I have subcategories, and want the last element of each of these subcategories. val windowSpec = Window.partitionBy("name").orderBy("count") sqlContext .createDataFrame( Seq[(String, Int)]( ("A", 1), ("A", 2), ("A", 3), ("B", 10), ("B", 20), ("B", 30) )) .toDF("name", "count") .withColumn("firstCountOfName", first("count").over(windowSpec)) .withColumn("lastCountOfName", last("count").over(windowSpec)) .show() returns me something strange: +----+-----+-------------

How to GROUPING SETS as operator/method on Dataset?

阅读更多关于 How to GROUPING SETS as operator/method on Dataset?

问题 Is there no function level grouping_sets support in spark scala? I have no idea this patch applied to master https://github.com/apache/spark/pull/5080 I want to do this kind of query by scala dataframe api. GROUP BY expression list GROUPING SETS(expression list2) cube and rollup functions are available in Dataset API, but can't find grouping sets. Why? 回答1: I want to do this kind of query by scala dataframe api. tl;dr Up to Spark 2.1.0 it is not possible. There are currently no plans to add

Why agg() in PySpark is only able to summarize one column at a time? [duplicate]

阅读更多关于 Why agg() in PySpark is only able to summarize one column at a time? [duplicate]

问题 This question already has answers here : Multiple Aggregate operations on the same column of a spark dataframe (3 answers) Closed 3 years ago . For the below dataframe df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High']) When I try to find min & max I am only getting min value in output. df.agg({'High':'max','High':'min'}).show() +-----------+ |min(High) | +-----------+ | 2094900| +-----------+ Why can't agg() give both max & min like in Pandas? 回答1: As you

Does Spark SQL use Hive Metastore?

阅读更多关于 Does Spark SQL use Hive Metastore?

问题 I am developing a Spark SQL application and I've got few questions: I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation. I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive. But am I correct? 回答1: I read that Spark-SQL uses

Spark SQL: How to append new row to dataframe table (from another table)

阅读更多关于 Spark SQL: How to append new row to dataframe table (from another table)

问题 I am using Spark SQL with dataframes. I have an input dataframe, and I would like to append (or insert) its rows to a larger dataframe that has more columns. How would I do that? If this were SQL, I would use INSERT INTO OUTPUT SELECT ... FROM INPUT , but I don't know how to do that with Spark SQL. For concreteness: var input = sqlContext.createDataFrame(Seq( (10L, "Joe Doe", 34), (11L, "Jane Doe", 31), (12L, "Alice Jones", 25) )).toDF("id", "name", "age") var output = sqlContext

Spark SQL: How to append new row to dataframe table (from another table)

阅读更多关于 Spark SQL: How to append new row to dataframe table (from another table)

Create a new dataset based given operation column

阅读更多关于 Create a new dataset based given operation column

问题 I am using spark-sql-2.3.1v and have the below scenario: Given a dataset: val ds = Seq( (1, "x1", "y1", "0.1992019"), (2, null, "y2", "2.2500000"), (3, "x3", null, "15.34567"), (4, null, "y4", null), (5, "x4", "y4", "0") ).toDF("id","col_x", "col_y","value") i.e. +---+-----+-----+---------+ | id|col_x|col_y| value| +---+-----+-----+---------+ | 1| x1| y1|0.1992019| | 2| null| y2|2.2500000| | 3| x3| null| 15.34567| | 4| null| y4| null| | 5| x4| y4| 0| +---+-----+-----+---------+ Requirement: I