spark-dataframe | 易学教程

add columns in dataframes dynamically with column names as elements in List

阅读更多关于 add columns in dataframes dynamically with column names as elements in List

问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,

Remove Null from Array Columns in Dataframe in Scala with Spark (1.6)

阅读更多关于 Remove Null from Array Columns in Dataframe in Scala with Spark (1.6)

问题 I have a dataframe with a key column and a column which has an array of struct. The Schema looks like below. root |-- id: string (nullable = true) |-- desc: array (nullable = false) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) | | |-- age: long (nullable = false) The array "desc" can have any number of null values. I would like to create a final dataframe with the array having none of the null values using spark 1.6: An example would be: Key . Value 1010

Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

阅读更多关于 Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

问题 I have a Spark DataFrame , let's say 'df'. I do the following simple aggregation on this DataFrame : df.groupBy().sum() Upon doing so, I get the following exception: java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38 Is there any way I can fix this? My guess is, if I can decrease the decimal precision of all the columns of double type in df, it would solve the problem. 来源： https://stackoverflow.com/questions/46462377/change-decimal-precision

Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

阅读更多关于 Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

Json file to pyspark dataframe

阅读更多关于 Json file to pyspark dataframe

问题 I'm trying to work with JSON file on spark (pyspark) environment. Problem: Unable to convert JSON to expected format in Pyspark Dataframe 1st Input data set: https://health.data.ny.gov/api/views/cnih-y5dw/rows.json In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data". FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark

Json file to pyspark dataframe

阅读更多关于 Json file to pyspark dataframe

Json file to pyspark dataframe

阅读更多关于 Json file to pyspark dataframe

In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

阅读更多关于 In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

问题 If I have a DataFrame called df that looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | N/A| baz| |null| etc| +----+----+ I can selectively replace values like so: val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2")) so that df2 looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | baz| baz| |null| etc| +----+----+ but why can't I check if it's null, like: val df3 = df2.withColumn("a1", when($"a1" === null, $"a2")) so that I get: +----+----+ | a1+ a2| +----+----+ | foo|

How to check if a DataFrame was already cached/persisted before?

阅读更多关于 How to check if a DataFrame was already cached/persisted before?

问题 For spark's RDD object this is quite trivial as it exposes a getStorageLevel method, but DF does not seem to expose anything similar. anyone? 回答1: You can check weather a DataFrame is cached or not using Catalog (org.apache.spark.sql.catalog.Catalog) which comes in Spark 2. Code example : val sparkSession = SparkSession.builder. master("local") .appName("example") .getOrCreate() val df = sparkSession.read.csv("src/main/resources/sales.csv") df.createTempView("sales") //interacting with

Delete functionality with spark sql dataframe

阅读更多关于 Delete functionality with spark sql dataframe

问题 I have a requirement to do a load/delete specific records from postgres db for my spark application. For loading , I am using spark dataframe in the below format sqlContext.read.format("jdbc").options(Map("url" -> "postgres url", "user" -> "user" , "password" -> "xxxxxx" , "table" -> "(select * from employee where emp_id > 1000) as filtered_emp")).load() To delete the data, I am writing direct sql instead of using dataframes delete from employee where emp_id > 1000 The question is , is there