pyspark-sql | 易学教程

How to convert empty arrays to nulls?

阅读更多关于 How to convert empty arrays to nulls?

I have below dataframe and i need to convert empty arrays to null. +----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| []| []| |1112| [45, 46]| [50, 50]| |1113| []| []| +----+---------+-----------+ i have tried below code which is not working. df.na.fill("null").show() expected output should be +----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| NUll| NUll| |1112| [45, 46]| [50, 50]| |1113| NUll| NUll| +----+---------+-----------+ For your given dataframe ,

Convert PySpark dataframe column from list to string

阅读更多关于 Convert PySpark dataframe column from list to string

Read in CSV in Pyspark with correct Datatypes

阅读更多关于 Read in CSV in Pyspark with correct Datatypes

When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this: "Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey" 149332,"15.11.2005",1,199.95,107,127998739,100000 I have found code that should work in this question , but when I execute it all the entries are returned as NULL . I use the following to create a custom schema: from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType,

ValueError: Cannot convert column into bool

阅读更多关于 ValueError: Cannot convert column into bool

I'm trying build a new column on dataframe as below: l = [(2, 1), (1,1)] df = spark.createDataFrame(l) def calc_dif(x,y): if (x>y) and (x==1): return x-y dfNew = df.withColumn("calc", calc_dif(df["_1"], df["_2"])) dfNew.show() But, I get: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 346, in <module> Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 334, in <module> File "<stdin>", line 38, in <module> File "<stdin>", line 36, in calc_dif File "/usr/hdp/current/spark2-client/python/pyspark/sql

How to read streaming data in XML format from Kafka?

阅读更多关于 How to read streaming data in XML format from Kafka?

I am trying to read XML data from Kafka topic using Spark Structured streaming. I tried using the Databricks spark-xml package, but I got an error saying that this package does not support streamed reading. Is there any way I can extract XML data from Kafka topic using structured streaming? My current code: df = spark \ .readStream \ .format("kafka") \ .format('com.databricks.spark.xml') \ .options(rowTag="MainElement")\ .option("kafka.bootstrap.servers", "localhost:9092") \ .option(subscribeType, "test") \ .load() The error: py4j.protocol.Py4JJavaError: An error occurred while calling o33

How to list all databases and tables in AWS Glue Catalog?

阅读更多关于 How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console. How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work. What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success. I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs ) In my case

Apache Spark OutOfMemoryError (HeapSpace)

阅读更多关于 Apache Spark OutOfMemoryError (HeapSpace)

I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group. df = spark.read.parquet('path/to/parquet/') check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8 for col, _ in check_columns.items(): total = (df .groupBy('groupID').count() .toDF('groupID', 'n_total') ) missing = (df .where(F.col(col).isNull()) .groupBy('groupID').count() .toDF(

Passing Array to Python Spark Lit Function

阅读更多关于 Passing Array to Python Spark Lit Function

Let's say I have a numpy array a that contains the numbers 1-10. So a is [1 2 3 4 5 6 7 8 9 10]. Now, I also have a Python Spark dataframe to which I want to add my numpy array a. I figure that a column of literals will do the job. So I do the following: df = df.withColumn("NewColumn", F.lit(a)) This doesn't work. The error is "Unsupported literal type class java.util.ArrayList". Now, if I try just one element of the array, as follows, it works. df = df.withColumn("NewColumn", F.lit(a[0])) Is there a way I can do what I'm trying? I've been working on the task I want to complete for days and

How to convert int64 datatype columns of parquet file to timestamp in SparkSQL data frame?

阅读更多关于 How to convert int64 datatype columns of parquet file to timestamp in SparkSQL data frame?

Here my DataFrame looks like this: +----------------+-------------+ | Business_Date| Code| +----------------+-------------+ |1539129600000000| BSD| |1539129600000000| BTN| |1539129600000000| BVI| |1539129600000000| BWP| |1539129600000000| BYB| +----------------+-------------+ I wanted to convert the Business_Date column from bigint to timestamp value while loading data into hive table. How can I do this? You can use pyspark.sql.functions.from_unixtime() which will Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in

how to store grouped data into json in pyspark

阅读更多关于 how to store grouped data into json in pyspark

I am new to pyspark I have a dataset which looks like (just a snapshot of few columns) I want to group my data by key. My key is CONCAT(a.div_nbr,a.cust_nbr) My ultimate goal is to convert the data into JSON, formated like this k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],.... e.g 248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } , { PRECIMA_ID:SCP 00248 0000138339 , PROD_NBR:6659079 , PROD_DESC:Beef Chuck Short Rib Slices, PROD_BRND