spark-dataframe

How to merge all part files in a folder created by SPARK data frame and rename as folder name in scala

♀尐吖头ヾ 提交于 2019-12-24 18:31:53
问题 Hi i have output of my spark data frame which creates folder structure and creates so may part files . Now i have to merge all part files inside the folder and rename that one file as folder path name . This is how i do partition df.write.partitionBy("DataPartition","PartitionYear") .format("csv") .option("nullValue", "") .option("header", "true")/ .option("codec", "gzip") .save("hdfs:///user/zeppelin/FinancialLineItem/output") It creates folder structure like this hdfs:///user/zeppelin

How to normalize an array column in a dataframe

删除回忆录丶 提交于 2019-12-24 18:23:19
问题 I'm using spark 2.2. and I want to normalize each value in the fixed-size array. input {"values": [1,2,3,4]} output {"values": [0.25, 0.5, 0.75, 1] } For now, I'm using a udf : val f = udf { (l: Seq[Double]) => val max = l.max l.map(_ / max) } Is there a way to avoid udf (and associated performance penalty). 回答1: Lets say that number of records in each array is n val n: Int Then import org.apache.spark.sql.functions._ df .withColumn("max", greatest((0 until n).map(i => col("value")(i)): _*))

Convert an RDD to a DataFrame in Spark using Scala

Deadly 提交于 2019-12-24 16:19:13
问题 I have textRDD: org.apache.spark.rdd.RDD[(String, String)] I would like to convert it to a DataFrame. The columns correspond to the title and content of each page(row). 回答1: Use toDF() , provide the column names if you have them. val textDF = textRDD.toDF("title": String, "content": String) textDF: org.apache.spark.sql.DataFrame = [title: string, content: string] or val textDF = textRDD.toDF() textDF: org.apache.spark.sql.DataFrame = [_1: string, _2: string] The shell auto-imports (I am using

How to use different window specification per column values?

吃可爱长大的小学妹 提交于 2019-12-24 09:48:14
问题 This is my partitionBy condition which i need to change based on the column value from the data frame . val windowSpec = Window.partitionBy("col1", "clo2","clo3").orderBy($"Col5".desc) Now if the value of the one of the column (col6) in data frame is I then above condition . But when the value of the column(col6) changes O then below condition val windowSpec = Window.partitionBy("col1","clo3").orderBy($"Col5".desc) How can i implement it in the spark data frame . So it is like for each record

PySpark: DataFrame - Convert Struct to Array

和自甴很熟 提交于 2019-12-24 07:53:01
问题 I have a dataframe in the following structure: root |-- index: long (nullable = true) |-- text: string (nullable = true) |-- topicDistribution: struct (nullable = true) | |-- type: long (nullable = true) | |-- values: array (nullable = true) | | |-- element: double (containsNull = true) |-- wiki_index: string (nullable = true) I need to change it to: root |-- index: long (nullable = true) |-- text: string (nullable = true) |-- topicDistribution: array (nullable = true) | |-- element: double

number of tuples limit in RDD; reading RDD throws arrayIndexOutOfBoundsException

老子叫甜甜 提交于 2019-12-24 07:41:19
问题 I tried a modification of DF to RDD for a table containing 25 columns. Thereafter I came to know that Scala (until 2.11.8) has a limitation of a max of 22 tuples that could be used. val rdd = sc.textFile("/user/hive/warehouse/myDB.db/myTable/") rdd: org.apache.spark.rdd.RDD[String] = /user/hive/warehouse/myDB.db/myTable/ MapPartitionsRDD[3] at textFile at <console>:24 Sample Data: [2017-02-26, 100052-ACC, 100052, 3260, 1005, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0

Summing n columns in Spark in Java using dataframes

不想你离开。 提交于 2019-12-24 07:25:17
问题 String[] col = {"a","b","c"} Data: id a b c d e 101 1 1 1 1 1 102 2 2 2 2 2 103 3 3 3 3 3 Expected output:- id with sum of columns specified in column string id (a+b+c) 101 3 102 6 103 9 How to do this using dataframes? 回答1: if you are using java you can do the following import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types

Summing n columns in Spark in Java using dataframes

白昼怎懂夜的黑 提交于 2019-12-24 07:25:05
问题 String[] col = {"a","b","c"} Data: id a b c d e 101 1 1 1 1 1 102 2 2 2 2 2 103 3 3 3 3 3 Expected output:- id with sum of columns specified in column string id (a+b+c) 101 3 102 6 103 9 How to do this using dataframes? 回答1: if you are using java you can do the following import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types

Partition a Spark Dataframe based on a specific column and dump the content of each partition on a csv

元气小坏坏 提交于 2019-12-24 05:02:08
问题 I'm using spark 1.6.2 Java APIs to load some data in a Dataframe DF1 that looks like: Key Value A v1 A v2 B v3 A v4 Now I need to partition DF1 based on a subset of value in column "Key" and dump each partition to a csv file (using spark-csv). Desired Output: A.csv Key Value A v1 A v2 A v4 B.csv Key Value B v3 At the moment what I'm doing is building an HashMap (myList) containing the subset of values that i need to filter and then iterate through that filtering a different Key each iteration

ERROR Executor: Exception in task 0.0 in stage 6.0 spark scala?

蓝咒 提交于 2019-12-24 03:15:10
问题 I have a json file like below. {"name":"method2","name1":"test","parameter1":"C:/Users/test/Desktop/Online.csv","parameter2": 1.0} I am loading my json file. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.json("C:/Users/test/Desktop/data.json") val df1=df.select($"name",$"parameter1",$"parameter2").toDF() df1.show() I have 3 function like below: def method1(P1:String, P2:Double) { val data = spark.read.option("header", true).csv(P1).toDF() val rs= data