pyspark-sql

show distinct column values in pyspark dataframe: python

本秂侑毒 提交于 2019-11-28 16:24:43
问题 Please suggest pyspark dataframe alternative for Pandas df['col'].unique() . I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then SQL query for distinct values). Also I don't need groupby->countDistinct , instead I want to check distinct VALUES in that column. 回答1: Let's assume we're working with the following representation of data (two columns, k and v , where k contains three entries, two unique: +---+---+ | k| v| +---+---+

Spark SQL security considerations

*爱你&永不变心* 提交于 2019-11-28 14:27:13
What are the security considerations when accepting and executing arbitrary spark SQL queries? Imagine the following setup: Two files on hdfs are registered as tables a_secrets and b_secrets : # must only be accessed by clients with access to all of customer a' data spark.read.csv("/customer_a/secrets.csv").createTempView("a_secrets") # must only be accessed by clients with access to all of customer b's data spark.read.csv("/customer_b/secrets.csv").createTempView("b_secrets") These two views, I could secure using simple hdfs file permissions. But say I have the following logical views of

Caching ordered Spark DataFrame creates unwanted job

情到浓时终转凉″ 提交于 2019-11-28 13:22:55
I want to convert a RDD to a DataFrame and want to cache the results of the RDD: from pyspark.sql import * from pyspark.sql.types import * import pyspark.sql.functions as fn schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())]) df = spark.createDataFrame( sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(), schema=schema, verifySchema=False ).orderBy("t") #.cache() If you don't use a cache function no job is generated. If you use cache only after the orderBy 1 jobs is generated for cache : If you use cache only after the

Spark Structured Streaming using sockets, set SCHEMA, Display DATAFRAME in console

折月煮酒 提交于 2019-11-28 13:01:21
How can I set a schema for a streaming DataFrame in PySpark. from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import split # Import data types from pyspark.sql.types import * spark = SparkSession\ .builder\ .appName("StructuredNetworkWordCount")\ .getOrCreate() # Create DataFrame representing the stream of input lines from connection to localhost:5560 lines = spark\ .readStream\ .format('socket')\ .option('host', '192.168.0.113')\ .option('port', 5560)\ .load() For example I need a table like : Name, lastName, PhoneNumber Bob, Dylan,

Window function is not working on Pyspark sqlcontext

时光毁灭记忆、已成空白 提交于 2019-11-28 12:18:38
问题 I have a data frame and I want to roll up the data into 7days and do some aggregation on some of the function. I have a pyspark sql dataframe like ------ Sale_Date|P_1|P_2|P_3|G_1|G_2|G_3|Total_Sale|Sale_Amt|Promo_Disc_Amt | |2013-04-10| 1| 9| 1| 1| 1| 1| 1| 295.0|0.0| |2013-04-11| 1| 9| 1| 1| 1| 1| 3| 567.0|0.0| |2013-04-12| 1| 9| 1| 1| 1| 1| 2| 500.0|200.0| |2013-04-13| 1| 9| 1| 1| 1| 1| 1| 245.0|20.0| |2013-04-14| 1| 9| 1| 1| 1| 1| 1| 245.0|0.0| |2013-04-15| 1| 9| 1| 1| 1| 1| 2| 500.0|200

PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'

余生长醉 提交于 2019-11-28 12:18:26
I have timestamp dataset which is in format of And I have written a udf in pyspark to process this dataset and return as Map of key values. But am getting below error message. Dataset:df_ts_list +--------------------+ | ts_list| +--------------------+ |[1477411200, 1477...| |[1477238400, 1477...| |[1477022400, 1477...| |[1477224000, 1477...| |[1477256400, 1477...| |[1477346400, 1476...| |[1476986400, 1477...| |[1477321200, 1477...| |[1477306800, 1477...| |[1477062000, 1477...| |[1477249200, 1477...| |[1477040400, 1477...| |[1477090800, 1477...| +--------------------+ Pyspark UDF: >>> def on

Forward Fill New Row to Account for Missing Dates

╄→гoц情女王★ 提交于 2019-11-28 11:27:26
问题 I currently have a dataset grouped into hourly increments by a variable "aggregator". There are gaps in this hourly data and what i would ideally like to do is forward fill the rows with the prior row which maps to the variable in column x. I've seen some solutions to similar problems using PANDAS but ideally i would like to understand how best to approach this with a pyspark UDF. I'd initially thought about something like the following with PANDAS but also struggled to implement this to just

PySpark: modify column values when another column value satisfies a condition

和自甴很熟 提交于 2019-11-28 10:03:20
I have a PySpark Dataframe that has two columns Id and rank, +---+----+ | Id|Rank| +---+----+ | a| 5| | b| 7| | c| 8| | d| 1| +---+----+ For each row, I'm looking to replace Id with "other" if Rank is larger than 5. If I use pseudocode to explain: For row in df: if row.Rank>5: then replace(row.Id,"other") The result should look like, +-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+ Any clue how to achieve this? Thanks!!! To create this Dataframe: df = spark.createDataFrame([('a',5),('b',7),('c',8),('d',1)], ["Id","Rank"]) You can use when and otherwise

SparkSQL on pyspark: how to generate time series?

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-28 08:42:57
I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and stop columns of type date . Suppose that my_table contains: start | stop ------------------------- 2000-01-01 | 2000-01-05 2012-03-20 | 2012-03-23 In PostgreSQL it's very easy to do that: SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table and it will generate this table: dt ------------ 2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05 2012-03-20 2012-03-21 2012-03-22 2012-03-23 but how to do that using

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

ⅰ亾dé卋堺 提交于 2019-11-28 07:52:52
I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this: files = ['s3a://dev/2017/01/03/data.parquet', 's3a://dev/2017/01/02/data.parquet'] df = session.read.parquet(*files) This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining. Is this possible? Mariusz Yes