Optimize row access and transformation in pyspark

问题

I have a large dataset(5GB) in the form of jason in S3 bucket. I need to transform the schema of the data, and write back the transformed data to S3 using an ETL script.

So I use a crawler to detect the schema and load the data in pyspark dataframe, and change the schema. Now I iterate over every row in the dataframe and convert it to dictionary. Remove null columns, and then convert the dictionary to string and write back to S3. Following is the code:

#df is the pyspark dataframe
columns = df.columns
print(columns)
s3 = boto3.resource('s3')
cnt = 1

for row in df.rdd.toLocalIterator():
    data = row.asDict(True)

    for col_name in columns:
        if data[col_name] is None:
            del data[col_name]

    content = json.dumps(data)
    object = s3.Object('write-test-transaction-transformed', str(cnt)).put(Body=content)
    cnt = cnt+1
print(cnt)

I have used toLocalIterator. Is the execution of above code performes serially? if yes then how to optimize it? Is there any better approach for execution of above logic?

回答1:

assuming, each row in the dataset as json string format

import pyspark.sql.functions as F

def drop_null_cols(data):
    import json
    content = json.loads(data)
    for key, value in list(content.items()):
        if value is None:
            del content[key]

    return json.dumps(content)

drop_null_cols_udf = F.udf(drop_null_cols, F.StringType())

df = spark.createDataFrame(
    ["{\"name\":\"Ranga\", \"age\":25, \"city\":\"Hyderabad\"}",
     "{\"name\":\"John\", \"age\":null, \"city\":\"New York\"}",
     "{\"name\":null, \"age\":31, \"city\":\"London\"}"],
    "string"
).toDF("data")

df.select(
    drop_null_cols_udf("data").alias("data")
).show(10,False)

If the input dataframe is having the cols and output only needs to be not null cols json

df = spark.createDataFrame(
        [('Ranga', 25, 'Hyderabad'),
         ('John', None, 'New York'),
         (None, 31, 'London'),
        ],
        ['name', 'age', 'city']
    )

df.withColumn(
    "data", F.to_json(F.struct([x for x in df.columns]))
).select(
    drop_null_cols_udf("data").alias("data")
).show(10, False)

#df.write.format("csv").save("s3://path/to/file/) -- save to s3

which results

+-------------------------------------------------+
|data                                             |
+-------------------------------------------------+
|{"name": "Ranga", "age": 25, "city": "Hyderabad"}|
|{"name": "John", "city": "New York"}             |
|{"age": 31, "city": "London"}                    |
+-------------------------------------------------+

回答2:

I'll follow the below approach(written in scala, but can be implemented in python with minimal change)-

Find the dataset count and named it as totalCount

val totalcount = inputDF.count()

Find the count(col) for all the dataframe columns and get the map of fields to their count
- Here for all columns of input dataframe, the count is getting computed
- Please note that count(anycol) returns the number of rows for which the supplied column are all non-null. For example - if a column has 10 row value and if say 5 values are null then the count(column) becomes 5
- Fetch the first row as Map[colName, count(colName)] referred as fieldToCount

val cols = inputDF.columns.map { inputCol =>
      functions.count(col(inputCol)).as(inputCol)
    }
// Returns the number of rows for which the supplied column are all non-null.
    // count(null) returns 0
    val row = dataset.select(cols: _*).head()
    val fieldToCount = row.getValuesMap[Long]($(inputCols))

Get the columns to be removed
- Use the Map created in step#2 here and mark the column having count less than the totalCount as the column to be removed
- select all the columns which has count == totalCount from the input dataframe and save the processed output Dataframe anywhere in any format as per requirement.
- Please note that, this approach will remove all the column having at least one null value

val fieldToBool = fieldToCount.mapValues(_ < totalcount)
val processedDF = inputDF.select(fieldToBool.filterNot(_._2).map(_.1) :_*)
// save this processedDF anywhere in any format as per requirement

I believe this approach will perform well than the approach you have currently

回答3:

I solved the above problem. We can simply query the dataframe for null values. df = df.filter(df.column.isNotNull()) thereby removing all rows where null is present. So if there are n columns, We need 2^n queries to filter out all possible combinations. In my case there were 10 columns so total of 1024 queries, which is acceptable as sql queries are parrallelized.

来源：https://stackoverflow.com/questions/62102322/optimize-row-access-and-transformation-in-pyspark

标签

python

amazon-web-services

apache-spark

amazon-s3

pyspark