pyspark-sql

How to select last row and also how to access PySpark dataframe by index?

柔情痞子 提交于 2019-11-27 06:00:41
问题 From a PySpark SQL dataframe like name age city abc 20 A def 30 B How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe). And how can I access the dataframe rows by index.like row no. 12 or 200 . In pandas I can do df.tail(1) # for last row df.ix[rowno or index] # by index df.loc[] or by df.iloc[] I am just curious how to access pyspark dataframe in such ways or alternative ways. Thanks 回答1: How to get the last row. Long and ugly way which assumes

how to get max(date) from given set of data grouped by some fields using pyspark?

南楼画角 提交于 2019-11-27 04:45:33
问题 I have the data in the dataframe as below: datetime | userId | memberId | value | 2016-04-06 16:36:... | 1234 | 111 | 1 2016-04-06 17:35:... | 1234 | 222 | 5 2016-04-06 17:50:... | 1234 | 111 | 8 2016-04-06 18:36:... | 1234 | 222 | 9 2016-04-05 16:36:... | 4567 | 111 | 1 2016-04-06 17:35:... | 4567 | 222 | 5 2016-04-06 18:50:... | 4567 | 111 | 8 2016-04-06 19:36:... | 4567 | 222 | 9 I need to find the max(datetime) groupby userid,memberid. When I tried as below: df2 = df.groupBy('userId',

PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe

烂漫一生 提交于 2019-11-27 04:35:34
问题 I would like to compute the maximum of a subset of columns for each row and add it as a new column for the existing Dataframe . I managed to do this in very awkward way: def add_colmax(df,subset_columns,colnm): ''' calculate the maximum of the selected "subset_columns" from dataframe df for each row, new column containing row wise maximum is added to dataframe df. df: dataframe. It must contain subset_columns as subset of columns colnm: Name of the new column containing row-wise maximum of

PySpark: modify column values when another column value satisfies a condition

孤人 提交于 2019-11-27 03:22:06
问题 I have a PySpark Dataframe that has two columns Id and rank, +---+----+ | Id|Rank| +---+----+ | a| 5| | b| 7| | c| 8| | d| 1| +---+----+ For each row, I'm looking to replace Id with "other" if Rank is larger than 5. If I use pseudocode to explain: For row in df: if row.Rank>5: then replace(row.Id,"other") The result should look like, +-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+ Any clue how to achieve this? Thanks!!! To create this Dataframe: df =

SparkSQL on pyspark: how to generate time series?

前提是你 提交于 2019-11-27 02:21:18
问题 I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and stop columns of type date . Suppose that my_table contains: start | stop ------------------------- 2000-01-01 | 2000-01-05 2012-03-20 | 2012-03-23 In PostgreSQL it's very easy to do that: SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table and it will generate this table: dt ------------ 2000-01-01 2000-01-02

PySpark - get row number for each row in a group

本小妞迷上赌 提交于 2019-11-26 23:16:38
问题 Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So Group Date A 2000 A 2002 A 2007 B 1999 B 2015 Would become Group Date row_num A 2000 0 A 2002 1 A 2007 2 B 1999 0 B 2015 1 回答1: Use window function: from pyspark.sql.window import * from pyspark.sql.functions import row_number df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))) 回答2: The accepted solution almost has it right. Here is the solution

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

試著忘記壹切 提交于 2019-11-26 22:20:48
问题 import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) Expected output dataframe with count of nan/null for each column Note: The previous questions I found in stack overflow only checks for null & not nan. Thats why i have created a new question. I know i can use isnull() function in spark to find number of Null values in Spark column but

More than one hour to execute pyspark.sql.DataFrame.take(4)

折月煮酒 提交于 2019-11-26 19:05:44
I am running spark 1.6 on 3 VMs (i.e. 1x master; 2x slaves) all with 4 cores and 16GB RAM. I can see the workers registered on spark-master webUI. I want to retrieve data from my Vertica database to work on it. As I didn't manage to run complex queries I tried dummy queries to understand. We consider here an easy task. My code is: df = sqlContext.read.format('jdbc').options(url='xxxx', dbtable='xxx', user='xxxx', password='xxxx').load() four = df.take(4) And the output is (note: I replace with @IPSLAVE the slave VM IP:Port): 16/03/08 13:50:41 INFO SparkContext: Starting job: take at <stdin>:1

GroupByKey and create lists of values pyspark sql dataframe

允我心安 提交于 2019-11-26 18:36:17
问题 So I have a spark dataframe that looks like: a | b | c 5 | 2 | 1 5 | 4 | 3 2 | 4 | 2 2 | 3 | 7 And I want to group by column a , create a list of values from column b, and forget about c. The output dataframe would be : a | b_list 5 | (2,4) 2 | (4,3) How would I go about doing this with a pyspark sql dataframe? Thank you! :) 回答1: Here are the steps to get that Dataframe. >>> from pyspark.sql import functions as F >>> >>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4,

TypeError: Column is not iterable - How to iterate over ArrayType()?

浪尽此生 提交于 2019-11-26 17:57:41
Consider the following DataFrame: +------+-----------------------+ |type |names | +------+-----------------------+ |person|[john, sam, jane] | |pet |[whiskers, rover, fido]| +------+-----------------------+ Which can be created with the following code: import pyspark.sql.functions as f data = [ ('person', ['john', 'sam', 'jane']), ('pet', ['whiskers', 'rover', 'fido']) ] df = sqlCtx.createDataFrame(data, ["type", "names"]) df.show(truncate=False) Is there a way to directly modify the ArrayType() column "names" by applying a function to each element, without using a udf ? For example, suppose I