pyspark | 易学教程

Select spark dataframe column with special character in it using selectExpr

阅读更多关于 Select spark dataframe column with special character in it using selectExpr

问题 I am in a scenario where my columns name is Município with accent on the letter í . My selectExpr command is failing because of it. Is there a way to fix it? Basically I have something like the following expression: .selectExpr("...CAST (Município as string) as Município...") What I really want is to be able to leave the column with the same name that it came, so in the future, I won't have this kind of problem on different tables/files. How can I make spark dataframe accept accents or other

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

阅读更多关于 AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

阅读更多关于 AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

阅读更多关于 AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

Spark find max of date partitioned column

阅读更多关于 Spark find max of date partitioned column

问题 I have a parquet partitioned in the following way: data /batch_date=2020-01-20 /batch_date=2020-01-21 /batch_date=2020-01-22 /batch_date=2020-01-23 /batch_date=2020-01-24 Here batch_date which is the partition column is of date type. I want only read the data from the latest date partition but as a consumer I don't know what is the latest value. I could use a simple group by something like df.groupby().agg(max(col('batch_date'))).first() While this would work it's a very inefficient way since

PySpark DataFrame: Custom Explode Function

阅读更多关于 PySpark DataFrame: Custom Explode Function

问题 How to implement a custom explode function using udfs, so we can have extra information on items? For example, along with items, I want to have items' indices. The part I do not know how to do is when a udf returns multiple values and we should place those values as separate rows. 回答1: If you need custom explode function, then you need to write UDF that gets array and returns array. For example for this DF: df = spark.createDataFrame([(['a', 'b', 'c'], ), (['d', 'e'],)], ['array']) df.show()

How to do this transformation in SQL/Spark/GraphFrames

阅读更多关于 How to do this transformation in SQL/Spark/GraphFrames

问题 I've a table containing the following two columns: Device-Id Account-Id d1 a1 d2 a1 d1 a2 d2 a3 d3 a4 d3 a5 d4 a6 d1 a4 Device-Id is the unique Id of the device on which my app is installed and Account-Id is the id of a user account. A user can have multiple devices and can create multiple accounts on the same device(eg. d1 device has a1, a2 and a3 accounts set up). I want to find unique actual users(should be represented as a new column with some unique UUID in the generated table) and the

How to do this transformation in SQL/Spark/GraphFrames

阅读更多关于 How to do this transformation in SQL/Spark/GraphFrames

How to do this transformation in SQL/Spark/GraphFrames

阅读更多关于 How to do this transformation in SQL/Spark/GraphFrames

how to check if a particular directory exists in S3 bucket using pyspark and boto3

阅读更多关于 how to check if a particular directory exists in S3 bucket using pyspark and boto3

问题 How to check if a particular file is present inside a particular directory in my S3? I use Boto3 and tried this code (which doesn't work): import boto3 s3 = boto3.resource('s3') bucket = s3.Bucket('my-bucket') key = 'dootdoot.jpg' objs = list(bucket.objects.filter(Prefix=key)) if len(objs) > 0 and objs[0].key == key: print("Exists!") else: print("Doesn't exist") 回答1: Please try this code as following Get subdirectory info folder¶ folders = bucket.list("","/") for folder in folders: print