spark-dataframe

Spark doing exchange of partitions already correctly distributed

扶醉桌前 提交于 2019-12-23 06:49:33
问题 I am joining 2 datasets by two columns and result is dataset containing 55 billion rows. After that I have to do some aggregation on this DS by different column than the ones used in join. Problem is that Spark is doing exchange partition after join(taking too much time with 55 billion rows) although data is already correctly distributed because aggregate column is unique. I know that aggregation key is correctly distributed and is there a way telling this to Spark app? 回答1: 1) Go to Spark UI

How to run Spark Application as daemon

两盒软妹~` 提交于 2019-12-23 06:06:29
问题 I have a basic question about running spark application. I have a Java client which will send me request for query data which is residing in HDFS. The request I get is REST API over HTTP and I need to interpret the request and form Spark SQL queries and return the response back to client. I am unable to understand how can I make my spark application as daemon which is waiting for request and can execute the queries using the pre instantiated SQL context ? 回答1: You can have a thread that run

Extract a specific JSON structure from a json string in a Spark Rdd - Scala

随声附和 提交于 2019-12-23 04:22:42
问题 I have a json string such as: {"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1], {"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f

Aggregate (Sum) over Window for a list of Columns

。_饼干妹妹 提交于 2019-12-23 02:44:13
问题 I'm having trouble finding a generic way to calculate the Sum (or any aggregate function) over a given window, for a list of columns available in the DataFrame. val inputDF = spark .sparkContext .parallelize( Seq( (1,2,1, 30, 100), (1,2,2, 30, 100), (1,2,3, 30, 100), (11,21,1, 30, 100), (11,21,2, 30, 100), (11,21,3, 30, 100) ), 10) .toDF("c1", "c2", "offset", "v1", "v2") input.show +---+---+------+---+---+ | c1| c2|offset| v1| v2| +---+---+------+---+---+ | 1| 2| 1| 30|100| | 1| 2| 2| 30|100|

Exporting nested fields with invalid characters from Spark 2 to Parquet [duplicate]

我的梦境 提交于 2019-12-23 01:50:15
问题 This question already has answers here : Spark Dataframe validating column names for parquet writes (scala) (4 answers) Closed last year . I am trying to use spark 2.0.2 to convert a JSON file into parquet. The JSON file comes from an external source and therefor the schema can't be changed before it arrives. The file contains a map of attributes. The attribute names arn't known before I receive the file. The attribute names contain characters that can't be used in parquet. { "id" : 1, "name"

Performance of UDAF versus Aggregator in Spark

拈花ヽ惹草 提交于 2019-12-22 17:10:11
问题 I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe. I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+. 来源: https://stackoverflow.com/questions/45356452/performance-of-udaf-versus-aggregator-in-spark

Pyspark - how to backfill a DataFrame?

拟墨画扇 提交于 2019-12-22 13:50:24
问题 How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame ? The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter. In pandas you can use the following to backfill a time series: Create data import pandas as pd index = pd.date_range('2017-01-01', '2017-01-05') data = [1, 2, 3, None, 5] df = pd.DataFrame({'data': data}, index=index) Giving Out[1]: data 2017-01-01 1.0 2017-01-02 2

How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

元气小坏坏 提交于 2019-12-22 11:18:51
问题 I have a Dataframe A that contains a column of array string. ... |-- browse: array (nullable = true) | |-- element: string (containsNull = true) ... For example three sample rows would be +---------+--------+---------+ | column 1| browse| column n| +---------+--------+---------+ | foo1| [X,Y,Z]| bar1| | foo2| [K,L]| bar2| | foo3| [M]| bar3| And another Dataframe B that contains a column of string |-- browsenodeid: string (nullable = true) Some sample rows for it would be +------------+

Using Python's reduce() to join multiple PySpark DataFrames

人盡茶涼 提交于 2019-12-22 10:40:04
问题 Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error: def join_dataframes(list_of_join_columns, left_df, right_df): return left_df.join(right_df, on=list_of_join_columns) joined_df = functools.reduce( functools.partial(join_dataframes, list_of_join_columns), list_of

Converting pattern of date in spark dataframe

a 夏天 提交于 2019-12-22 09:39:23
问题 I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern) I want to display the column value in MM/dd/yyyy pattern My data is val df = sc.parallelize(Array( ("steak", "1990-01-01", "2000-01-01", 150), ("steak", "2000-01-02", "2001-01-13", 180), ("fish", "1990-01-01", "2001-01-01", 100) )).toDF("name", "startDate", "endDate", "price") df.show() +-----+----------+----------+-----+ | name| startDate| endDate|price| +-----+----------+----------+-----+ |steak|1990-01