pyspark

Pyspark clean data within dataframe

只谈情不闲聊 提交于 2021-01-29 14:26:52
问题 I have the following file data.json which I try to clean using Pyspark. {"positionmessage":{"callsign": "PPH1", "name": "testschip-10", "mmsi": 100,"timestamplast": "2019-08-01T00:00:08Z"}} {"positionmessage":{"callsign": "PPH2", "name": "testschip-11", "mmsi": 200,"timestamplast": "2019-08-01T00:00:01Z"}} {"positionmessage":{"callsign": "PPH3", "name": "testschip-10", "mmsi": 300,"timestamplast": "2019-08-01T00:00:05Z"}} {"positionmessage":{"callsign": , "name": , "mmsi": 200,"timestamplast"

Extract Embedded AWS Glue Connection Credentials Using Scala

我的未来我决定 提交于 2021-01-29 14:17:51
问题 I have a glue job that reads directly from redshift, and to do that, one has to provide connection credentials. I have created an embedded glue connection and can extract the credentials with the following pyspark code. Is there a way to do this in Scala ? glue = boto3.client('glue', region_name='us-east-1') response = glue.get_connection( Name='name-of-embedded-connection', HidePassword=False ) table = spark.read.format( 'com.databricks.spark.redshift' ).option( 'url', 'jdbc:redshift://prod

Read CSV file in pyspark with ANSI encoding

荒凉一梦 提交于 2021-01-29 13:25:54
问题 I am trying to read in a csv/text file that requires it to be read in using ANSI encoding. However this is not working. Any ideas? mainDF= spark.read.format("csv")\ .option("encoding","ANSI")\ .option("header","true")\ .option("maxRowsInMemory",1000)\ .option("inferSchema","false")\ .option("delimiter", "¬")\ .load(path) java.nio.charset.UnsupportedCharsetException: ANSI The file is over 5GB hence the spark requirement. I have also tried ANSI in lower case 回答1: ISO-8859-1 is the same as ANSI

Pyspark: How to count the number of each equal distance interval in RDD

↘锁芯ラ 提交于 2021-01-29 12:33:11
问题 I have a RDD[Double] , I want to divide the RDD into k equal intervals, then count the number of each equal distance interval in RDD. For example, the RDD is like [0,1,2,3,4,5,6,6,7,7,10] . I want to divided it into 10 equal intervals, so the intervals are [0,1), [1,2), [2,3), [3,4), [4,5), [5,6), [6,7), [7,8), [8,9), [9,10] . As you can see, each element of RDD will be in one of the intervals. Then I want to calculate the number of each interval. Here, there are one element in [0,1),[1,2),[2

PySpark and time series data: how to smartly avoid overlapping dates?

谁都会走 提交于 2021-01-29 12:06:08
问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

parse pyspark column values into new columns

谁说我不能喝 提交于 2021-01-29 11:31:40
问题 I have a pyspark dataframe like the example df below. It has 3 columns in it organization_id, id, query_builder. The query_builder column contains a string that's similar to a nested dict. I would like to parse the query_builder field into separate columns for the field, operator, and value. I've supplied an example desired output below. If need be I could convert the pyspark dataframe to a pandas dataframe to make it easier. Does anyone have suggestions or recognize the type of data in the

Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

狂风中的少年 提交于 2021-01-29 11:19:31
问题 Say I have a list of column names and they all exist in the dataframe Cols = ['A', 'B', 'C', 'D'], I am looking for a quick way to get a table/dataframe like NA_counts min max A 5 0 100 B 10 0 120 C 8 1 99 D 2 0 500 TIA 回答1: You can calculate each metric separately and then union all like this: nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols] max_cols = [max(col(c)).alias(c) for c in cols] min_cols = [min(col(c)).alias(c) for c in cols] nulls_df = df

How to use Pandas UDF Functionality in pyspark

别等时光非礼了梦想. 提交于 2021-01-29 10:52:12
问题 I have a spark frame with two columns which looks like: +-------------------------------------------------------------+------------------------------------+ |docId |id | +-------------------------------------------------------------+------------------------------------+ |DYSDG6-RTB-91d663dd-949e-45da-94dd-e604b6050cb5-1537142434000|91d663dd-949e-45da-94dd-e604b6050cb5| |VAVLS7-RTB-8e2c1917-0d6b-419b-a59e-cd4acc255bb7-1537142445000|8e2c1917-0d6b-419b-a59e-cd4acc255bb7| |VAVLS7-RTB-c818dcde

creating dataframe specific schema : StructField starting with capital letter

拥有回忆 提交于 2021-01-29 09:58:53
问题 Apologies for the lengthy post for a seemingly simple curiosity, but I wanted to give full context... In Databricks, I am creating a "row" of data based on a specific schema definition, and then inserting that row into an empty dataframe (also based on the same specific schema). The schema definition looks like this: myschema_xb = StructType( [ StructField("_xmlns", StringType(), True), StructField("_Version", DoubleType(), True), StructField("MyIds", ArrayType( StructType( [ StructField("_ID

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

自闭症网瘾萝莉.ら 提交于 2021-01-29 09:48:01
问题 I use pyspark streaming to read kafka data, but it went wrong: import os from pyspark.streaming.kafka import KafkaUtils from pyspark.streaming import StreamingContext from pyspark import SparkContext os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' sc = SparkContext(appName="test") sc.setLogLevel("WARN") ssc = StreamingContext(sc, 60) kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2}) kafkaStream