spark-dataframe | 易学教程

Spark doing exchange of partitions already correctly distributed

阅读更多关于 Spark doing exchange of partitions already correctly distributed

问题 I am joining 2 datasets by two columns and result is dataset containing 55 billion rows. After that I have to do some aggregation on this DS by different column than the ones used in join. Problem is that Spark is doing exchange partition after join(taking too much time with 55 billion rows) although data is already correctly distributed because aggregate column is unique. I know that aggregation key is correctly distributed and is there a way telling this to Spark app? 回答1: 1) Go to Spark UI

How to run Spark Application as daemon

阅读更多关于 How to run Spark Application as daemon

问题 I have a basic question about running spark application. I have a Java client which will send me request for query data which is residing in HDFS. The request I get is REST API over HTTP and I need to interpret the request and form Spark SQL queries and return the response back to client. I am unable to understand how can I make my spark application as daemon which is waiting for request and can execute the queries using the pre instantiated SQL context ? 回答1: You can have a thread that run

Extract a specific JSON structure from a json string in a Spark Rdd - Scala

阅读更多关于 Extract a specific JSON structure from a json string in a Spark Rdd - Scala

问题 I have a json string such as: {"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1], {"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f

Aggregate (Sum) over Window for a list of Columns

阅读更多关于 Aggregate (Sum) over Window for a list of Columns

问题 I'm having trouble finding a generic way to calculate the Sum (or any aggregate function) over a given window, for a list of columns available in the DataFrame. val inputDF = spark .sparkContext .parallelize( Seq( (1,2,1, 30, 100), (1,2,2, 30, 100), (1,2,3, 30, 100), (11,21,1, 30, 100), (11,21,2, 30, 100), (11,21,3, 30, 100) ), 10) .toDF("c1", "c2", "offset", "v1", "v2") input.show +---+---+------+---+---+ | c1| c2|offset| v1| v2| +---+---+------+---+---+ | 1| 2| 1| 30|100| | 1| 2| 2| 30|100|

Exporting nested fields with invalid characters from Spark 2 to Parquet [duplicate]

阅读更多关于 Exporting nested fields with invalid characters from Spark 2 to Parquet [duplicate]

问题 This question already has answers here : Spark Dataframe validating column names for parquet writes (scala) (4 answers) Closed last year . I am trying to use spark 2.0.2 to convert a JSON file into parquet. The JSON file comes from an external source and therefor the schema can't be changed before it arrives. The file contains a map of attributes. The attribute names arn't known before I receive the file. The attribute names contain characters that can't be used in parquet. { "id" : 1, "name"

Performance of UDAF versus Aggregator in Spark

阅读更多关于 Performance of UDAF versus Aggregator in Spark

问题 I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe. I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+. 来源： https://stackoverflow.com/questions/45356452/performance-of-udaf-versus-aggregator-in-spark

Pyspark - how to backfill a DataFrame?

阅读更多关于 Pyspark - how to backfill a DataFrame?

问题 How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame ? The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter. In pandas you can use the following to backfill a time series: Create data import pandas as pd index = pd.date_range('2017-01-01', '2017-01-05') data = [1, 2, 3, None, 5] df = pd.DataFrame({'data': data}, index=index) Giving Out[1]: data 2017-01-01 1.0 2017-01-02 2

How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

阅读更多关于 How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

问题 I have a Dataframe A that contains a column of array string. ... |-- browse: array (nullable = true) | |-- element: string (containsNull = true) ... For example three sample rows would be +---------+--------+---------+ | column 1| browse| column n| +---------+--------+---------+ | foo1| [X,Y,Z]| bar1| | foo2| [K,L]| bar2| | foo3| [M]| bar3| And another Dataframe B that contains a column of string |-- browsenodeid: string (nullable = true) Some sample rows for it would be +------------+

Using Python's reduce() to join multiple PySpark DataFrames

阅读更多关于 Using Python's reduce() to join multiple PySpark DataFrames

问题 Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error: def join_dataframes(list_of_join_columns, left_df, right_df): return left_df.join(right_df, on=list_of_join_columns) joined_df = functools.reduce( functools.partial(join_dataframes, list_of_join_columns), list_of

Converting pattern of date in spark dataframe

阅读更多关于 Converting pattern of date in spark dataframe

问题 I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern) I want to display the column value in MM/dd/yyyy pattern My data is val df = sc.parallelize(Array( ("steak", "1990-01-01", "2000-01-01", 150), ("steak", "2000-01-02", "2001-01-13", 180), ("fish", "1990-01-01", "2001-01-01", 100) )).toDF("name", "startDate", "endDate", "price") df.show() +-----+----------+----------+-----+ | name| startDate| endDate|price| +-----+----------+----------+-----+ |steak|1990-01