pyspark | 易学教程

Pyspark clean data within dataframe

阅读更多关于 Pyspark clean data within dataframe

问题 I have the following file data.json which I try to clean using Pyspark. {"positionmessage":{"callsign": "PPH1", "name": "testschip-10", "mmsi": 100,"timestamplast": "2019-08-01T00:00:08Z"}} {"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}} {"positionmessage":{"callsign": "PPH3", "name": "testschip-10", "mmsi": 300,"timestamplast": "2019-08-01T00:00:05Z"}} {"positionmessage":{"callsign": , "name": , "mmsi": 200,"timestamplast"

Extract Embedded AWS Glue Connection Credentials Using Scala

阅读更多关于 Extract Embedded AWS Glue Connection Credentials Using Scala

问题 I have a glue job that reads directly from redshift, and to do that, one has to provide connection credentials. I have created an embedded glue connection and can extract the credentials with the following pyspark code. Is there a way to do this in Scala ? glue = boto3.client('glue', region_name='us-east-1') response = glue.get_connection( Name='name-of-embedded-connection', HidePassword=False ) table = spark.read.format( 'com.databricks.spark.redshift' ).option( 'url', 'jdbc:redshift://prod

Read CSV file in pyspark with ANSI encoding

阅读更多关于 Read CSV file in pyspark with ANSI encoding

问题 I am trying to read in a csv/text file that requires it to be read in using ANSI encoding. However this is not working. Any ideas? mainDF= spark.read.format("csv")\ .option("encoding","ANSI")\ .option("header","true")\ .option("maxRowsInMemory",1000)\ .option("inferSchema","false")\ .option("delimiter", "¬")\ .load(path) java.nio.charset.UnsupportedCharsetException: ANSI The file is over 5GB hence the spark requirement. I have also tried ANSI in lower case 回答1: ISO-8859-1 is the same as ANSI

Pyspark: How to count the number of each equal distance interval in RDD

阅读更多关于 Pyspark: How to count the number of each equal distance interval in RDD

问题 I have a RDD[Double] , I want to divide the RDD into k equal intervals, then count the number of each equal distance interval in RDD. For example, the RDD is like [0,1,2,3,4,5,6,6,7,7,10] . I want to divided it into 10 equal intervals, so the intervals are [0,1), [1,2), [2,3), [3,4), [4,5), [5,6), [6,7), [7,8), [8,9), [9,10] . As you can see, each element of RDD will be in one of the intervals. Then I want to calculate the number of each interval. Here, there are one element in [0,1),[1,2),[2

PySpark and time series data: how to smartly avoid overlapping dates?

阅读更多关于 PySpark and time series data: how to smartly avoid overlapping dates?

问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

parse pyspark column values into new columns

阅读更多关于 parse pyspark column values into new columns

问题 I have a pyspark dataframe like the example df below. It has 3 columns in it organization_id, id, query_builder. The query_builder column contains a string that's similar to a nested dict. I would like to parse the query_builder field into separate columns for the field, operator, and value. I've supplied an example desired output below. If need be I could convert the pyspark dataframe to a pandas dataframe to make it easier. Does anyone have suggestions or recognize the type of data in the

Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

阅读更多关于 Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

问题 Say I have a list of column names and they all exist in the dataframe Cols = ['A', 'B', 'C', 'D'], I am looking for a quick way to get a table/dataframe like NA_counts min max A 5 0 100 B 10 0 120 C 8 1 99 D 2 0 500 TIA 回答1: You can calculate each metric separately and then union all like this: nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols] max_cols = [max(col(c)).alias(c) for c in cols] min_cols = [min(col(c)).alias(c) for c in cols] nulls_df = df

How to use Pandas UDF Functionality in pyspark

阅读更多关于 How to use Pandas UDF Functionality in pyspark

问题 I have a spark frame with two columns which looks like: +-------------------------------------------------------------+------------------------------------+ |docId |id | +-------------------------------------------------------------+------------------------------------+ |DYSDG6-RTB-91d663dd-949e-45da-94dd-e604b6050cb5-1537142434000|91d663dd-949e-45da-94dd-e604b6050cb5| |VAVLS7-RTB-8e2c1917-0d6b-419b-a59e-cd4acc255bb7-1537142445000|8e2c1917-0d6b-419b-a59e-cd4acc255bb7| |VAVLS7-RTB-c818dcde

creating dataframe specific schema : StructField starting with capital letter

阅读更多关于 creating dataframe specific schema : StructField starting with capital letter

问题 Apologies for the lengthy post for a seemingly simple curiosity, but I wanted to give full context... In Databricks, I am creating a "row" of data based on a specific schema definition, and then inserting that row into an empty dataframe (also based on the same specific schema). The schema definition looks like this: myschema_xb = StructType( [ StructField("_xmlns", StringType(), True), StructField("_Version", DoubleType(), True), StructField("MyIds", ArrayType( StructType( [ StructField("_ID

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

阅读更多关于 TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

问题 I use pyspark streaming to read kafka data, but it went wrong: import os from pyspark.streaming.kafka import KafkaUtils from pyspark.streaming import StreamingContext from pyspark import SparkContext os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' sc = SparkContext(appName="test") sc.setLogLevel("WARN") ssc = StreamingContext(sc, 60) kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2}) kafkaStream