apache-spark-sql

Pyspark explode json string

拜拜、爱过 提交于 2021-01-29 08:04:04
问题 Input_dataframe id name collection 111 aaaaa {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } 222 bbbbb {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } here id ==> string name ==> string collection ==> string

How to store nested custom objects in Spark Dataset?

a 夏天 提交于 2021-01-29 07:48:20
问题 The question is a follow-up of How to store custom objects in Dataset? Spark version: 3.0.1 Non-nested custom type is achievable: import spark.implicits._ import org.apache.spark.sql.{Encoder, Encoders} class AnObj(val a: Int, val b: String) implicit val myEncoder: Encoder[AnObj] = Encoders.kryo[AnObj] val d = spark.createDataset(Seq(new AnObj(1, "a"))) d.printSchema root |-- value: binary (nullable = true) However, if the custom type is nested inside a product type (i.e. case class ), it

Select few columns from nested array of struct from a Dataframe in Scala

99封情书 提交于 2021-01-29 06:50:28
问题 I have a dataframe with array of struct and inside that another array of struct. Any easy way to select few of the structs in the main array and also few in the nested array without disturbing the structure of the entire dataframe? SIMPLE INPUT: -MainArray ---StructCol1 ---StructCol2 ---StructCol3 ---SubArray ------SubArrayStruct4 ------SubArrayStruct5 ------SubArrayStruct6 SIMPLE OUTPUT: -MainArray ---StructCol1 ---StructCol2 ---SubArray ------SubArrayStruct4 ------SubArrayStruct5 The source

Pyspark: dynamically generate condition for when() clause during runtime

孤人 提交于 2021-01-29 06:37:29
问题 I have read a csv file into pyspark dataframe . Now if I apply conditions in when() clause, it works fine when the conditions are given before runtime . import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import functions from pyspark.sql.functions import col sc = SparkContext('local', 'example') sql_sc = SQLContext(sc) pandas_df = pd.read_csv('file.csv') # assuming the file contains a header # Sample content of csv file # col1,value # 1,aa

How to read a CSV file with multiple delimiter in spark

本小妞迷上赌 提交于 2021-01-29 06:00:07
问题 I am trying to read a CSV file using spark 1.6 s.no|Name$id|designation|salry 1 |abc$12 |xxx |yyy val df = spark.read.format("csv") .option("header","true") .option("delimiter","|") .load("path") if I add delimiter with $ it throwing error one delimiter permitted 回答1: You can apply operation once dataframe is created after reading it from source with primary delimiter ( I am referring "|" as primary delimiter for better understanding). You can do something like below: sc is the Sparksession

Dividing dataframes in pyspark

不想你离开。 提交于 2021-01-29 05:33:59
问题 Following up this question and dataframes, I am trying to convert this Into this (I know it looks the same, but refer to the next code line to see the difference): In pandas, I used the line code teste_2 = (value/value.groupby(level=0).sum()) and in pyspark I tried several solutions; the first one was: df_2 = (df/df.groupby(["age"]).sum()) However, I am getting the following error: TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame' The second one was: df_2 = (df.filter

pyspark-strange behavior of count function inside agg

ぃ、小莉子 提交于 2021-01-29 02:52:43
问题 I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate. from pyspark.sql import functions as F tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2']) tst.show() +----+----+ |col1|col2| +----+----+ | 1| 2| | 1| 5| | 2|null| | 2| 3| | 3|null| | 3|null| +----+----+ tst.groupby('col1').agg(F.count('col2')).show() +----+-----------+ |col1|count(col2)| +----+-----------+ | 1| 2| | 3| 0| | 2| 1| +----+----------

pyspark-strange behavior of count function inside agg

别说谁变了你拦得住时间么 提交于 2021-01-29 02:45:43
问题 I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate. from pyspark.sql import functions as F tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2']) tst.show() +----+----+ |col1|col2| +----+----+ | 1| 2| | 1| 5| | 2|null| | 2| 3| | 3|null| | 3|null| +----+----+ tst.groupby('col1').agg(F.count('col2')).show() +----+-----------+ |col1|count(col2)| +----+-----------+ | 1| 2| | 3| 0| | 2| 1| +----+----------

add parent column name as prefix to avoid ambiguity

≯℡__Kan透↙ 提交于 2021-01-28 21:59:16
问题 Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it. Added another column with json data. scala> val df = Seq( (77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""), (78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""), (178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""), (179,

add parent column name as prefix to avoid ambiguity

余生颓废 提交于 2021-01-28 21:43:12
问题 Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it. Added another column with json data. scala> val df = Seq( (77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""), (78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""), (178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""), (179,