pyspark-sql | 易学教程

show distinct column values in pyspark dataframe: python

阅读更多关于 show distinct column values in pyspark dataframe: python

问题 Please suggest pyspark dataframe alternative for Pandas df['col'].unique() . I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then SQL query for distinct values). Also I don't need groupby->countDistinct , instead I want to check distinct VALUES in that column. 回答1: Let's assume we're working with the following representation of data (two columns, k and v , where k contains three entries, two unique: +---+---+ | k| v| +---+---+

Spark SQL security considerations

阅读更多关于 Spark SQL security considerations

What are the security considerations when accepting and executing arbitrary spark SQL queries? Imagine the following setup: Two files on hdfs are registered as tables a_secrets and b_secrets : # must only be accessed by clients with access to all of customer a' data spark.read.csv("/customer_a/secrets.csv").createTempView("a_secrets") # must only be accessed by clients with access to all of customer b's data spark.read.csv("/customer_b/secrets.csv").createTempView("b_secrets") These two views, I could secure using simple hdfs file permissions. But say I have the following logical views of

Caching ordered Spark DataFrame creates unwanted job

阅读更多关于 Caching ordered Spark DataFrame creates unwanted job

I want to convert a RDD to a DataFrame and want to cache the results of the RDD: from pyspark.sql import * from pyspark.sql.types import * import pyspark.sql.functions as fn schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())]) df = spark.createDataFrame( sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(), schema=schema, verifySchema=False ).orderBy("t") #.cache() If you don't use a cache function no job is generated. If you use cache only after the orderBy 1 jobs is generated for cache : If you use cache only after the

Spark Structured Streaming using sockets, set SCHEMA, Display DATAFRAME in console

阅读更多关于 Spark Structured Streaming using sockets, set SCHEMA, Display DATAFRAME in console

How can I set a schema for a streaming DataFrame in PySpark. from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import split # Import data types from pyspark.sql.types import * spark = SparkSession\ .builder\ .appName("StructuredNetworkWordCount")\ .getOrCreate() # Create DataFrame representing the stream of input lines from connection to localhost:5560 lines = spark\ .readStream\ .format('socket')\ .option('host', '192.168.0.113')\ .option('port', 5560)\ .load() For example I need a table like : Name, lastName, PhoneNumber Bob, Dylan,

Window function is not working on Pyspark sqlcontext

阅读更多关于 Window function is not working on Pyspark sqlcontext

问题 I have a data frame and I want to roll up the data into 7days and do some aggregation on some of the function. I have a pyspark sql dataframe like ------ Sale_Date|P_1|P_2|P_3|G_1|G_2|G_3|Total_Sale|Sale_Amt|Promo_Disc_Amt | |2013-04-10| 1| 9| 1| 1| 1| 1| 1| 295.0|0.0| |2013-04-11| 1| 9| 1| 1| 1| 1| 3| 567.0|0.0| |2013-04-12| 1| 9| 1| 1| 1| 1| 2| 500.0|200.0| |2013-04-13| 1| 9| 1| 1| 1| 1| 1| 245.0|20.0| |2013-04-14| 1| 9| 1| 1| 1| 1| 1| 245.0|0.0| |2013-04-15| 1| 9| 1| 1| 1| 1| 2| 500.0|200

PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'

阅读更多关于 PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'

I have timestamp dataset which is in format of And I have written a udf in pyspark to process this dataset and return as Map of key values. But am getting below error message. Dataset:df_ts_list +--------------------+ | ts_list| +--------------------+ |[1477411200, 1477...| |[1477238400, 1477...| |[1477022400, 1477...| |[1477224000, 1477...| |[1477256400, 1477...| |[1477346400, 1476...| |[1476986400, 1477...| |[1477321200, 1477...| |[1477306800, 1477...| |[1477062000, 1477...| |[1477249200, 1477...| |[1477040400, 1477...| |[1477090800, 1477...| +--------------------+ Pyspark UDF: >>> def on

Forward Fill New Row to Account for Missing Dates

阅读更多关于 Forward Fill New Row to Account for Missing Dates

问题 I currently have a dataset grouped into hourly increments by a variable "aggregator". There are gaps in this hourly data and what i would ideally like to do is forward fill the rows with the prior row which maps to the variable in column x. I've seen some solutions to similar problems using PANDAS but ideally i would like to understand how best to approach this with a pyspark UDF. I'd initially thought about something like the following with PANDAS but also struggled to implement this to just

PySpark: modify column values when another column value satisfies a condition

阅读更多关于 PySpark: modify column values when another column value satisfies a condition

I have a PySpark Dataframe that has two columns Id and rank, +---+----+ | Id|Rank| +---+----+ | a| 5| | b| 7| | c| 8| | d| 1| +---+----+ For each row, I'm looking to replace Id with "other" if Rank is larger than 5. If I use pseudocode to explain: For row in df: if row.Rank>5: then replace(row.Id,"other") The result should look like, +-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+ Any clue how to achieve this? Thanks!!! To create this Dataframe: df = spark.createDataFrame([('a',5),('b',7),('c',8),('d',1)], ["Id","Rank"]) You can use when and otherwise

SparkSQL on pyspark: how to generate time series?

阅读更多关于 SparkSQL on pyspark: how to generate time series?

I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and stop columns of type date . Suppose that my_table contains: start | stop ------------------------- 2000-01-01 | 2000-01-05 2012-03-20 | 2012-03-23 In PostgreSQL it's very easy to do that: SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table and it will generate this table: dt ------------ 2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05 2012-03-20 2012-03-21 2012-03-22 2012-03-23 but how to do that using

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

阅读更多关于 Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this: files = ['s3a://dev/2017/01/03/data.parquet', 's3a://dev/2017/01/02/data.parquet'] df = session.read.parquet(*files) This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining. Is this possible? Mariusz Yes