问题
I have a large dataset(5GB) in the form of jason in S3 bucket. I need to transform the schema of the data, and write back the transformed data to S3 using an ETL script.
So I use a crawler to detect the schema and load the data in pyspark dataframe, and change the schema. Now I iterate over every row in the dataframe and convert it to dictionary. Remove null columns, and then convert the dictionary to string and write back to S3. Following is the code:
#df is the pyspark dataframe
columns = df.columns
print(columns)
s3 = boto3.resource('s3')
cnt = 1
for row in df.rdd.toLocalIterator():
data = row.asDict(True)
for col_name in columns:
if data[col_name] is None:
del data[col_name]
content = json.dumps(data)
object = s3.Object('write-test-transaction-transformed', str(cnt)).put(Body=content)
cnt = cnt+1
print(cnt)
I have used toLocalIterator. Is the execution of above code performes serially? if yes then how to optimize it? Is there any better approach for execution of above logic?
回答1:
assuming, each row in the dataset as json string format
import pyspark.sql.functions as F
def drop_null_cols(data):
import json
content = json.loads(data)
for key, value in list(content.items()):
if value is None:
del content[key]
return json.dumps(content)
drop_null_cols_udf = F.udf(drop_null_cols, F.StringType())
df = spark.createDataFrame(
["{\"name\":\"Ranga\", \"age\":25, \"city\":\"Hyderabad\"}",
"{\"name\":\"John\", \"age\":null, \"city\":\"New York\"}",
"{\"name\":null, \"age\":31, \"city\":\"London\"}"],
"string"
).toDF("data")
df.select(
drop_null_cols_udf("data").alias("data")
).show(10,False)
If the input dataframe is having the cols and output only needs to be not null cols json
df = spark.createDataFrame(
[('Ranga', 25, 'Hyderabad'),
('John', None, 'New York'),
(None, 31, 'London'),
],
['name', 'age', 'city']
)
df.withColumn(
"data", F.to_json(F.struct([x for x in df.columns]))
).select(
drop_null_cols_udf("data").alias("data")
).show(10, False)
#df.write.format("csv").save("s3://path/to/file/) -- save to s3
which results
+-------------------------------------------------+
|data |
+-------------------------------------------------+
|{"name": "Ranga", "age": 25, "city": "Hyderabad"}|
|{"name": "John", "city": "New York"} |
|{"age": 31, "city": "London"} |
+-------------------------------------------------+
回答2:
I'll follow the below approach(written in scala, but can be implemented in python with minimal change)-
- Find the dataset count and named it as
totalCount
val totalcount = inputDF.count()
Find the
count(col)
for all the dataframe columns and get the map of fields to their count- Here for all columns of input dataframe, the count is getting computed
- Please note that
count(anycol)
returns the number of rows for which the supplied column are all non-null. For example - if a column has 10 row value and if say 5 values arenull
then the count(column) becomes 5 - Fetch the first row as
Map[colName, count(colName)]
referred asfieldToCount
val cols = inputDF.columns.map { inputCol =>
functions.count(col(inputCol)).as(inputCol)
}
// Returns the number of rows for which the supplied column are all non-null.
// count(null) returns 0
val row = dataset.select(cols: _*).head()
val fieldToCount = row.getValuesMap[Long]($(inputCols))
Get the columns to be removed
- Use the Map created in step#2 here and mark the column having count less than the totalCount as the column to be removed
- select all the columns which has
count == totalCount
from the input dataframe and save the processed output Dataframe anywhere in any format as per requirement. - Please note that,
this approach will remove all the column having at least one null value
val fieldToBool = fieldToCount.mapValues(_ < totalcount)
val processedDF = inputDF.select(fieldToBool.filterNot(_._2).map(_.1) :_*)
// save this processedDF anywhere in any format as per requirement
I believe this approach will perform well than the approach you have currently
回答3:
I solved the above problem. We can simply query the dataframe for null values. df = df.filter(df.column.isNotNull()) thereby removing all rows where null is present. So if there are n columns, We need 2^n queries to filter out all possible combinations. In my case there were 10 columns so total of 1024 queries, which is acceptable as sql queries are parrallelized.
来源:https://stackoverflow.com/questions/62102322/optimize-row-access-and-transformation-in-pyspark