Pyspark clean data within dataframe

只谈情不闲聊 提交于 2021-01-29 14:26:52

问题


I have the following file data.json which I try to clean using Pyspark.

{"positionmessage":{"callsign": "PPH1", "name": "testschip-10", "mmsi": 100,"timestamplast": "2019-08-01T00:00:08Z"}}
{"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}}
{"positionmessage":{"callsign": "PPH3", "name": "testschip-10", "mmsi": 300,"timestamplast": "2019-08-01T00:00:05Z"}}
{"positionmessage":{"callsign":       , "name":               , "mmsi": 200,"timestamplast": "2019-08-01T20:00:05Z"}}
{"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}}
{"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast":                       }}

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DateType, FloatType, TimestampType
import pyspark.sql.functions as f

appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

schema = StructType([
    StructField("positionmessage",
    StructType([
    StructField('callsign', StringType(), True),
    StructField('name', StringType(), True),
    StructField('timestamplast', TimestampType(), True),    
    StructField('mmsi', IntegerType(), True)
    ]))])

file_name = "data.json"
df = spark.read.schema(schema).json(file_name).select("positionmessage.*")
df = df.withColumn("name", f.split(df['name'], '\-')[1]).show() # strips the string "testschip-"

The timestamplast is not correct, due to T being present between day and hour. How do I fix this? Furthermore, I want to do the following operations: 1) s

  1. Specify dtype of "name" after "testschip-" has been removed as integer
  2. Drop duplicates - when timestamplast is the same for a certain name, it should be removed from the dataframe.
  3. If there is a missing timestamp, then the whole row should be removed from the dataframe.
  4. forward or backward fill of missing numbers within the group "name". Timestamplast should not be forward/backward filled (duplicates and missing numbers are already removed)
  5. Sort by timestamplast within the group "name" (timestamps must increase for a given name)
  6. I want to add a new column called "time_delta" which gives the timedifference between succesive "timestamplasts" within each group "name" with respect to the previous record.

来源:https://stackoverflow.com/questions/61899539/pyspark-clean-data-within-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!