apache-spark-sql | 易学教程

Pyspark explode json string

阅读更多关于 Pyspark explode json string

问题 Input_dataframe id name collection 111 aaaaa {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } 222 bbbbb {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } here id ==> string name ==> string collection ==> string

How to store nested custom objects in Spark Dataset?

阅读更多关于 How to store nested custom objects in Spark Dataset?

问题 The question is a follow-up of How to store custom objects in Dataset? Spark version: 3.0.1 Non-nested custom type is achievable: import spark.implicits._ import org.apache.spark.sql.{Encoder, Encoders} class AnObj(val a: Int, val b: String) implicit val myEncoder: Encoder[AnObj] = Encoders.kryo[AnObj] val d = spark.createDataset(Seq(new AnObj(1, "a"))) d.printSchema root |-- value: binary (nullable = true) However, if the custom type is nested inside a product type (i.e. case class ), it

Select few columns from nested array of struct from a Dataframe in Scala

阅读更多关于 Select few columns from nested array of struct from a Dataframe in Scala

问题 I have a dataframe with array of struct and inside that another array of struct. Any easy way to select few of the structs in the main array and also few in the nested array without disturbing the structure of the entire dataframe? SIMPLE INPUT: -MainArray ---StructCol1 ---StructCol2 ---StructCol3 ---SubArray ------SubArrayStruct4 ------SubArrayStruct5 ------SubArrayStruct6 SIMPLE OUTPUT: -MainArray ---StructCol1 ---StructCol2 ---SubArray ------SubArrayStruct4 ------SubArrayStruct5 The source

Pyspark: dynamically generate condition for when() clause during runtime

阅读更多关于 Pyspark: dynamically generate condition for when() clause during runtime

问题 I have read a csv file into pyspark dataframe . Now if I apply conditions in when() clause, it works fine when the conditions are given before runtime . import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import functions from pyspark.sql.functions import col sc = SparkContext('local', 'example') sql_sc = SQLContext(sc) pandas_df = pd.read_csv('file.csv') # assuming the file contains a header # Sample content of csv file # col1,value # 1,aa

How to read a CSV file with multiple delimiter in spark

阅读更多关于 How to read a CSV file with multiple delimiter in spark

问题 I am trying to read a CSV file using spark 1.6 s.no|Name$id|designation|salry 1 |abc$12 |xxx |yyy val df = spark.read.format("csv") .option("header","true") .option("delimiter","|") .load("path") if I add delimiter with $ it throwing error one delimiter permitted 回答1: You can apply operation once dataframe is created after reading it from source with primary delimiter ( I am referring "|" as primary delimiter for better understanding). You can do something like below: sc is the Sparksession

Dividing dataframes in pyspark

阅读更多关于 Dividing dataframes in pyspark

问题 Following up this question and dataframes, I am trying to convert this Into this (I know it looks the same, but refer to the next code line to see the difference): In pandas, I used the line code teste_2 = (value/value.groupby(level=0).sum()) and in pyspark I tried several solutions; the first one was: df_2 = (df/df.groupby(["age"]).sum()) However, I am getting the following error: TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame' The second one was: df_2 = (df.filter

pyspark-strange behavior of count function inside agg

阅读更多关于 pyspark-strange behavior of count function inside agg

问题 I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate. from pyspark.sql import functions as F tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2']) tst.show() +----+----+ |col1|col2| +----+----+ | 1| 2| | 1| 5| | 2|null| | 2| 3| | 3|null| | 3|null| +----+----+ tst.groupby('col1').agg(F.count('col2')).show() +----+-----------+ |col1|count(col2)| +----+-----------+ | 1| 2| | 3| 0| | 2| 1| +----+----------

pyspark-strange behavior of count function inside agg

阅读更多关于 pyspark-strange behavior of count function inside agg

add parent column name as prefix to avoid ambiguity

阅读更多关于 add parent column name as prefix to avoid ambiguity

问题 Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it. Added another column with json data. scala> val df = Seq( (77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""), (78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""), (178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""), (179,

add parent column name as prefix to avoid ambiguity

阅读更多关于 add parent column name as prefix to avoid ambiguity