pyspark-sql

How to convert empty arrays to nulls?

感情迁移 提交于 2019-12-05 19:17:35
I have below dataframe and i need to convert empty arrays to null. +----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| []| []| |1112| [45, 46]| [50, 50]| |1113| []| []| +----+---------+-----------+ i have tried below code which is not working. df.na.fill("null").show() expected output should be +----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| NUll| NUll| |1112| [45, 46]| [50, 50]| |1113| NUll| NUll| +----+---------+-----------+ For your given dataframe ,

Convert PySpark dataframe column from list to string

一曲冷凌霜 提交于 2019-12-05 16:34:58
问题 I have this PySpark dataframe +-----------+--------------------+ |uuid | test_123 | +-----------+--------------------+ | 1 |[test, test2, test3]| | 2 |[test4, test, test6]| | 3 |[test6, test9, t55o]| and I want to convert the column test_123 to be like this: +-----------+--------------------+ |uuid | test_123 | +-----------+--------------------+ | 1 |"test,test2,test3" | | 2 |"test4,test,test6" | | 3 |"test6,test9,t55o" | so from list to be string. how can I do it with PySpark? 回答1: You can

Read in CSV in Pyspark with correct Datatypes

我的未来我决定 提交于 2019-12-05 16:21:29
When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this: "Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey" 149332,"15.11.2005",1,199.95,107,127998739,100000 I have found code that should work in this question , but when I execute it all the entries are returned as NULL . I use the following to create a custom schema: from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType,

ValueError: Cannot convert column into bool

扶醉桌前 提交于 2019-12-05 11:19:22
I'm trying build a new column on dataframe as below: l = [(2, 1), (1,1)] df = spark.createDataFrame(l) def calc_dif(x,y): if (x>y) and (x==1): return x-y dfNew = df.withColumn("calc", calc_dif(df["_1"], df["_2"])) dfNew.show() But, I get: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 346, in <module> Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 334, in <module> File "<stdin>", line 38, in <module> File "<stdin>", line 36, in calc_dif File "/usr/hdp/current/spark2-client/python/pyspark/sql

How to read streaming data in XML format from Kafka?

流过昼夜 提交于 2019-12-05 10:27:54
I am trying to read XML data from Kafka topic using Spark Structured streaming. I tried using the Databricks spark-xml package, but I got an error saying that this package does not support streamed reading. Is there any way I can extract XML data from Kafka topic using structured streaming? My current code: df = spark \ .readStream \ .format("kafka") \ .format('com.databricks.spark.xml') \ .options(rowTag="MainElement")\ .option("kafka.bootstrap.servers", "localhost:9092") \ .option(subscribeType, "test") \ .load() The error: py4j.protocol.Py4JJavaError: An error occurred while calling o33

How to list all databases and tables in AWS Glue Catalog?

孤街醉人 提交于 2019-12-05 09:50:22
I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console. How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work. What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success. I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs ) In my case

Apache Spark OutOfMemoryError (HeapSpace)

天涯浪子 提交于 2019-12-05 09:29:21
I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group. df = spark.read.parquet('path/to/parquet/') check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8 for col, _ in check_columns.items(): total = (df .groupBy('groupID').count() .toDF('groupID', 'n_total') ) missing = (df .where(F.col(col).isNull()) .groupBy('groupID').count() .toDF(

Passing Array to Python Spark Lit Function

↘锁芯ラ 提交于 2019-12-04 23:34:18
Let's say I have a numpy array a that contains the numbers 1-10. So a is [1 2 3 4 5 6 7 8 9 10]. Now, I also have a Python Spark dataframe to which I want to add my numpy array a. I figure that a column of literals will do the job. So I do the following: df = df.withColumn("NewColumn", F.lit(a)) This doesn't work. The error is "Unsupported literal type class java.util.ArrayList". Now, if I try just one element of the array, as follows, it works. df = df.withColumn("NewColumn", F.lit(a[0])) Is there a way I can do what I'm trying? I've been working on the task I want to complete for days and

How to convert int64 datatype columns of parquet file to timestamp in SparkSQL data frame?

落爺英雄遲暮 提交于 2019-12-04 18:33:26
Here my DataFrame looks like this: +----------------+-------------+ | Business_Date| Code| +----------------+-------------+ |1539129600000000| BSD| |1539129600000000| BTN| |1539129600000000| BVI| |1539129600000000| BWP| |1539129600000000| BYB| +----------------+-------------+ I wanted to convert the Business_Date column from bigint to timestamp value while loading data into hive table. How can I do this? You can use pyspark.sql.functions.from_unixtime() which will Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in

how to store grouped data into json in pyspark

我的未来我决定 提交于 2019-12-04 15:01:21
I am new to pyspark I have a dataset which looks like (just a snapshot of few columns) I want to group my data by key. My key is CONCAT(a.div_nbr,a.cust_nbr) My ultimate goal is to convert the data into JSON, formated like this k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],.... e.g 248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } , { PRECIMA_ID:SCP 00248 0000138339 , PROD_NBR:6659079 , PROD_DESC:Beef Chuck Short Rib Slices, PROD_BRND