pyspark-sql | 易学教程

How to select last row and also how to access PySpark dataframe by index?

阅读更多关于 How to select last row and also how to access PySpark dataframe by index?

问题 From a PySpark SQL dataframe like name age city abc 20 A def 30 B How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe). And how can I access the dataframe rows by index.like row no. 12 or 200 . In pandas I can do df.tail(1) # for last row df.ix[rowno or index] # by index df.loc[] or by df.iloc[] I am just curious how to access pyspark dataframe in such ways or alternative ways. Thanks 回答1: How to get the last row. Long and ugly way which assumes

how to get max(date) from given set of data grouped by some fields using pyspark?

阅读更多关于 how to get max(date) from given set of data grouped by some fields using pyspark?

问题 I have the data in the dataframe as below: datetime | userId | memberId | value | 2016-04-06 16:36:... | 1234 | 111 | 1 2016-04-06 17:35:... | 1234 | 222 | 5 2016-04-06 17:50:... | 1234 | 111 | 8 2016-04-06 18:36:... | 1234 | 222 | 9 2016-04-05 16:36:... | 4567 | 111 | 1 2016-04-06 17:35:... | 4567 | 222 | 5 2016-04-06 18:50:... | 4567 | 111 | 8 2016-04-06 19:36:... | 4567 | 222 | 9 I need to find the max(datetime) groupby userid,memberid. When I tried as below: df2 = df.groupBy('userId',

PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe

阅读更多关于 PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe

问题 I would like to compute the maximum of a subset of columns for each row and add it as a new column for the existing Dataframe . I managed to do this in very awkward way: def add_colmax(df,subset_columns,colnm): ''' calculate the maximum of the selected "subset_columns" from dataframe df for each row, new column containing row wise maximum is added to dataframe df. df: dataframe. It must contain subset_columns as subset of columns colnm: Name of the new column containing row-wise maximum of

PySpark: modify column values when another column value satisfies a condition

阅读更多关于 PySpark: modify column values when another column value satisfies a condition

问题 I have a PySpark Dataframe that has two columns Id and rank, +---+----+ | Id|Rank| +---+----+ | a| 5| | b| 7| | c| 8| | d| 1| +---+----+ For each row, I'm looking to replace Id with "other" if Rank is larger than 5. If I use pseudocode to explain: For row in df: if row.Rank>5: then replace(row.Id,"other") The result should look like, +-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+ Any clue how to achieve this? Thanks!!! To create this Dataframe: df =

SparkSQL on pyspark: how to generate time series?

阅读更多关于 SparkSQL on pyspark: how to generate time series?

问题 I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and stop columns of type date . Suppose that my_table contains: start | stop ------------------------- 2000-01-01 | 2000-01-05 2012-03-20 | 2012-03-23 In PostgreSQL it's very easy to do that: SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table and it will generate this table: dt ------------ 2000-01-01 2000-01-02

PySpark - get row number for each row in a group

阅读更多关于 PySpark - get row number for each row in a group

问题 Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So Group Date A 2000 A 2002 A 2007 B 1999 B 2015 Would become Group Date row_num A 2000 0 A 2002 1 A 2007 2 B 1999 0 B 2015 1 回答1: Use window function: from pyspark.sql.window import * from pyspark.sql.functions import row_number df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))) 回答2: The accepted solution almost has it right. Here is the solution

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

阅读更多关于 How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

问题 import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) Expected output dataframe with count of nan/null for each column Note: The previous questions I found in stack overflow only checks for null & not nan. Thats why i have created a new question. I know i can use isnull() function in spark to find number of Null values in Spark column but

More than one hour to execute pyspark.sql.DataFrame.take(4)

阅读更多关于 More than one hour to execute pyspark.sql.DataFrame.take(4)

I am running spark 1.6 on 3 VMs (i.e. 1x master; 2x slaves) all with 4 cores and 16GB RAM. I can see the workers registered on spark-master webUI. I want to retrieve data from my Vertica database to work on it. As I didn't manage to run complex queries I tried dummy queries to understand. We consider here an easy task. My code is: df = sqlContext.read.format('jdbc').options(url='xxxx', dbtable='xxx', user='xxxx', password='xxxx').load() four = df.take(4) And the output is (note: I replace with @IPSLAVE the slave VM IP:Port): 16/03/08 13:50:41 INFO SparkContext: Starting job: take at <stdin>:1

GroupByKey and create lists of values pyspark sql dataframe

阅读更多关于 GroupByKey and create lists of values pyspark sql dataframe

问题 So I have a spark dataframe that looks like: a | b | c 5 | 2 | 1 5 | 4 | 3 2 | 4 | 2 2 | 3 | 7 And I want to group by column a , create a list of values from column b, and forget about c. The output dataframe would be : a | b_list 5 | (2,4) 2 | (4,3) How would I go about doing this with a pyspark sql dataframe? Thank you! :) 回答1: Here are the steps to get that Dataframe. >>> from pyspark.sql import functions as F >>> >>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4,

TypeError: Column is not iterable - How to iterate over ArrayType()?

阅读更多关于 TypeError: Column is not iterable - How to iterate over ArrayType()?

Consider the following DataFrame: +------+-----------------------+ |type |names | +------+-----------------------+ |person|[john, sam, jane] | |pet |[whiskers, rover, fido]| +------+-----------------------+ Which can be created with the following code: import pyspark.sql.functions as f data = [ ('person', ['john', 'sam', 'jane']), ('pet', ['whiskers', 'rover', 'fido']) ] df = sqlCtx.createDataFrame(data, ["type", "names"]) df.show(truncate=False) Is there a way to directly modify the ArrayType() column "names" by applying a function to each element, without using a udf ? For example, suppose I