pyspark-sql | 易学教程

How to transform JSON strings in columns of dataframe in PySpark?

阅读更多关于 How to transform JSON strings in columns of dataframe in PySpark?

问题 I have a pyspark dataframe as shown below +--------------------+---+ | _c0|_c1| +--------------------+---+ |{"object":"F...| 0| |{"object":"F...| 1| |{"object":"F...| 2| |{"object":"E...| 3| |{"object":"F...| 4| |{"object":"F...| 5| |{"object":"F...| 6| |{"object":"S...| 7| |{"object":"F...| 8| The column _c0 contains a string in dictionary form. '{"object":"F","time":"2019-07-18T15:08:16.143Z","values":[0.22124142944812775,0.2147877812385559,0.16713131964206696,0.3102800250053406,0

How to convert empty arrays to nulls?

阅读更多关于 How to convert empty arrays to nulls?

问题 I have below dataframe and i need to convert empty arrays to null. +----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| []| []| |1112| [45, 46]| [50, 50]| |1113| []| []| +----+---------+-----------+ i have tried below code which is not working. df.na.fill("null").show() expected output should be +----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| NUll| NUll|

How to read streaming data in XML format from Kafka?

阅读更多关于 How to read streaming data in XML format from Kafka?

问题 I am trying to read XML data from Kafka topic using Spark Structured streaming. I tried using the Databricks spark-xml package, but I got an error saying that this package does not support streamed reading. Is there any way I can extract XML data from Kafka topic using structured streaming? My current code: df = spark \ .readStream \ .format("kafka") \ .format('com.databricks.spark.xml') \ .options(rowTag="MainElement")\ .option("kafka.bootstrap.servers", "localhost:9092") \ .option

How to list all databases and tables in AWS Glue Catalog?

阅读更多关于 How to list all databases and tables in AWS Glue Catalog?

问题 I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console. How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work. What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success. 回答1: I spend several hours trying to find some info about CatalogConnection class but haven

ValueError: Cannot convert column into bool

阅读更多关于 ValueError: Cannot convert column into bool

问题 I'm trying build a new column on dataframe as below: l = [(2, 1), (1,1)] df = spark.createDataFrame(l) def calc_dif(x,y): if (x>y) and (x==1): return x-y dfNew = df.withColumn("calc", calc_dif(df["_1"], df["_2"])) dfNew.show() But, I get: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 346, in <module> Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 334, in <module> File "<stdin>", line 38, in

null value and countDistinct with spark dataframe

阅读更多关于 null value and countDistinct with spark dataframe

问题 I have a very simple dataframe df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c']) +----+---+---+ | a| b| c| +----+---+---+ |null| 1| 3| | 2| 1| 3| | 2| 1| 3| +----+---+---+ When I apply a countDistinct on this dataframe, I find different results depending on the method: First method df.distinct().count() 2 It's the result I except, the 2 last rows are identical but the first one is distinct (because of the null value) from the 2 others Second Method import pyspark.sql

use SQL inside AWS Glue pySpark script

阅读更多关于 use SQL inside AWS Glue pySpark script

问题 I want to use AWS Glue to convert some csv data to orc. The ETL job I created generated the following PySpark script: import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'

Passing Array to Python Spark Lit Function

阅读更多关于 Passing Array to Python Spark Lit Function

问题 Let's say I have a numpy array a that contains the numbers 1-10. So a is [1 2 3 4 5 6 7 8 9 10]. Now, I also have a Python Spark dataframe to which I want to add my numpy array a. I figure that a column of literals will do the job. So I do the following: df = df.withColumn("NewColumn", F.lit(a)) This doesn't work. The error is "Unsupported literal type class java.util.ArrayList". Now, if I try just one element of the array, as follows, it works. df = df.withColumn("NewColumn", F.lit(a[0])) Is

Run Pyspark and Kafka in Jupyter Notebook

阅读更多关于 Run Pyspark and Kafka in Jupyter Notebook

I could run this example in the terminal. My terminal command is: bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 examples/src/main/python/sql/streaming/structured_kafka_wordcount.py localhost:9092 subscribe test Now I wants to run it in Juypter python notebook. I tried to follow this (I could run the code in the link). But in my case, it failed. The following is my code: import os os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell" from pyspark.sql import SparkSession from pyspark.sql.functions import

pyspark use dataframe inside udf

阅读更多关于 pyspark use dataframe inside udf

I have two dataframes df1 +---+---+----------+ | n|val| distances| +---+---+----------+ | 1| 1|0.27308652| | 2| 1|0.24969208| | 3| 1|0.21314497| +---+---+----------+ and df2 +---+---+----------+ | x1| x2| w| +---+---+----------+ | 1| 2|0.03103427| | 1| 4|0.19012526| | 1| 10|0.26805446| | 1| 8|0.26825935| +---+---+----------+ I want to add a new column to df1 called gamma , which will contain the sum of the w value from df2 when df1.n == df2.x1 OR df1.n == df2.x2 I tried to use udf, but apparently selecting from the different dataframe will not work, because values should be determined before