spark-dataframe

add columns in dataframes dynamically with column names as elements in List

别等时光非礼了梦想. 提交于 2021-02-08 08:06:03
问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,

Remove Null from Array Columns in Dataframe in Scala with Spark (1.6)

假装没事ソ 提交于 2021-02-08 07:57:43
问题 I have a dataframe with a key column and a column which has an array of struct. The Schema looks like below. root |-- id: string (nullable = true) |-- desc: array (nullable = false) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) | | |-- age: long (nullable = false) The array "desc" can have any number of null values. I would like to create a final dataframe with the array having none of the null values using spark 1.6: An example would be: Key . Value 1010

Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

最后都变了- 提交于 2021-02-08 07:24:35
问题 I have a Spark DataFrame , let's say 'df'. I do the following simple aggregation on this DataFrame : df.groupBy().sum() Upon doing so, I get the following exception: java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38 Is there any way I can fix this? My guess is, if I can decrease the decimal precision of all the columns of double type in df, it would solve the problem. 来源: https://stackoverflow.com/questions/46462377/change-decimal-precision

Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

筅森魡賤 提交于 2021-02-08 07:22:27
问题 I have a Spark DataFrame , let's say 'df'. I do the following simple aggregation on this DataFrame : df.groupBy().sum() Upon doing so, I get the following exception: java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38 Is there any way I can fix this? My guess is, if I can decrease the decimal precision of all the columns of double type in df, it would solve the problem. 来源: https://stackoverflow.com/questions/46462377/change-decimal-precision

Json file to pyspark dataframe

99封情书 提交于 2021-02-08 06:14:09
问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

Json file to pyspark dataframe

廉价感情. 提交于 2021-02-08 06:13:49
问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

Json file to pyspark dataframe

扶醉桌前 提交于 2021-02-08 06:12:38
问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

六眼飞鱼酱① 提交于 2021-02-08 04:59:26
问题 If I have a DataFrame called df that looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | N/A| baz| |null| etc| +----+----+ I can selectively replace values like so: val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2")) so that df2 looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | baz| baz| |null| etc| +----+----+ but why can't I check if it's null, like: val df3 = df2.withColumn("a1", when($"a1" === null, $"a2")) so that I get: +----+----+ | a1+ a2| +----+----+ | foo|

How to check if a DataFrame was already cached/persisted before?

旧巷老猫 提交于 2021-02-07 09:57:38
问题 For spark's RDD object this is quite trivial as it exposes a getStorageLevel method, but DF does not seem to expose anything similar. anyone? 回答1: You can check weather a DataFrame is cached or not using Catalog (org.apache.spark.sql.catalog.Catalog) which comes in Spark 2. Code example : val sparkSession = SparkSession.builder. master("local") .appName("example") .getOrCreate() val df = sparkSession.read.csv("src/main/resources/sales.csv") df.createTempView("sales") //interacting with

Delete functionality with spark sql dataframe

泪湿孤枕 提交于 2021-02-07 08:47:36
问题 I have a requirement to do a load/delete specific records from postgres db for my spark application. For loading , I am using spark dataframe in the below format sqlContext.read.format("jdbc").options(Map("url" -> "postgres url", "user" -> "user" , "password" -> "xxxxxx" , "table" -> "(select * from employee where emp_id > 1000) as filtered_emp")).load() To delete the data, I am writing direct sql instead of using dataframes delete from employee where emp_id > 1000 The question is , is there