pyspark

Select spark dataframe column with special character in it using selectExpr

夙愿已清 提交于 2021-01-01 04:29:11
问题 I am in a scenario where my columns name is Município with accent on the letter í . My selectExpr command is failing because of it. Is there a way to fix it? Basically I have something like the following expression: .selectExpr("...CAST (Município as string) as Município...") What I really want is to be able to leave the column with the same name that it came, so in the future, I won't have this kind of problem on different tables/files. How can I make spark dataframe accept accents or other

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

谁都会走 提交于 2020-12-31 20:17:46
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

拥有回忆 提交于 2020-12-31 20:07:44
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

扶醉桌前 提交于 2020-12-31 20:01:17
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

Spark find max of date partitioned column

情到浓时终转凉″ 提交于 2020-12-31 06:24:44
问题 I have a parquet partitioned in the following way: data /batch_date=2020-01-20 /batch_date=2020-01-21 /batch_date=2020-01-22 /batch_date=2020-01-23 /batch_date=2020-01-24 Here batch_date which is the partition column is of date type. I want only read the data from the latest date partition but as a consumer I don't know what is the latest value. I could use a simple group by something like df.groupby().agg(max(col('batch_date'))).first() While this would work it's a very inefficient way since

PySpark DataFrame: Custom Explode Function

[亡魂溺海] 提交于 2020-12-31 05:08:41
问题 How to implement a custom explode function using udfs, so we can have extra information on items? For example, along with items, I want to have items' indices. The part I do not know how to do is when a udf returns multiple values and we should place those values as separate rows. 回答1: If you need custom explode function, then you need to write UDF that gets array and returns array. For example for this DF: df = spark.createDataFrame([(['a', 'b', 'c'], ), (['d', 'e'],)], ['array']) df.show()

How to do this transformation in SQL/Spark/GraphFrames

北战南征 提交于 2020-12-31 04:32:48
问题 I've a table containing the following two columns: Device-Id Account-Id d1 a1 d2 a1 d1 a2 d2 a3 d3 a4 d3 a5 d4 a6 d1 a4 Device-Id is the unique Id of the device on which my app is installed and Account-Id is the id of a user account. A user can have multiple devices and can create multiple accounts on the same device(eg. d1 device has a1, a2 and a3 accounts set up). I want to find unique actual users(should be represented as a new column with some unique UUID in the generated table) and the

How to do this transformation in SQL/Spark/GraphFrames

北战南征 提交于 2020-12-31 04:32:35
问题 I've a table containing the following two columns: Device-Id Account-Id d1 a1 d2 a1 d1 a2 d2 a3 d3 a4 d3 a5 d4 a6 d1 a4 Device-Id is the unique Id of the device on which my app is installed and Account-Id is the id of a user account. A user can have multiple devices and can create multiple accounts on the same device(eg. d1 device has a1, a2 and a3 accounts set up). I want to find unique actual users(should be represented as a new column with some unique UUID in the generated table) and the

How to do this transformation in SQL/Spark/GraphFrames

天大地大妈咪最大 提交于 2020-12-31 04:32:09
问题 I've a table containing the following two columns: Device-Id Account-Id d1 a1 d2 a1 d1 a2 d2 a3 d3 a4 d3 a5 d4 a6 d1 a4 Device-Id is the unique Id of the device on which my app is installed and Account-Id is the id of a user account. A user can have multiple devices and can create multiple accounts on the same device(eg. d1 device has a1, a2 and a3 accounts set up). I want to find unique actual users(should be represented as a new column with some unique UUID in the generated table) and the

how to check if a particular directory exists in S3 bucket using pyspark and boto3

喜夏-厌秋 提交于 2020-12-30 03:47:51
问题 How to check if a particular file is present inside a particular directory in my S3? I use Boto3 and tried this code (which doesn't work): import boto3 s3 = boto3.resource('s3') bucket = s3.Bucket('my-bucket') key = 'dootdoot.jpg' objs = list(bucket.objects.filter(Prefix=key)) if len(objs) > 0 and objs[0].key == key: print("Exists!") else: print("Doesn't exist") 回答1: Please try this code as following Get subdirectory info folder¶ folders = bucket.list("","/") for folder in folders: print