spark-dataframe

Flatten Nested Spark Dataframe

别来无恙 提交于 2019-12-03 03:24:55
Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType, etc). Say I have a schema like: StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...))) Looking to adapt this into a flat table with a structure like: field1 field2 nested_array.nested_field1 nested_array.nested_field2 FYI,

Group By, Rank and aggregate spark data frame using pyspark

送分小仙女□ 提交于 2019-12-03 03:04:33
I have a dataframe that looks like: A B C --------------- A1 B1 0.8 A1 B2 0.55 A1 B3 0.43 A2 B1 0.7 A2 B2 0.5 A2 B3 0.5 A3 B1 0.2 A3 B2 0.3 A3 B3 0.4 How do I convert the column 'C' to the relative rank(higher score->better rank) per column A? Expected Output: A B Rank --------------- A1 B1 1 A1 B2 2 A1 B3 3 A2 B1 1 A2 B2 2 A2 B3 2 A3 B1 3 A3 B2 2 A3 B3 1 The ultimate state I want to reach is to aggregate column B and store the ranks for each A: Example: B Ranks B1 [1,1,3] B2 [2,2,2] B3 [3,2,1] Add rank: from pyspark.sql.functions import * from pyspark.sql.window import Window ranked = df

Read from a hive table and write back to it using spark sql

我们两清 提交于 2019-12-03 00:17:28
I am reading a Hive table using Spark SQL and assigning it to a scala val val x = sqlContext.sql("select * from some_table") Then I am doing some processing with the dataframe x and finally coming up with a dataframe y , which has the exact schema as the table some_table. Finally I am trying to insert overwrite the y dataframe to the same hive table some_table y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table") Then I am getting the error org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from I tried creating an insert sql

Pyspark: filter dataframe by regex with string formatting?

故事扮演 提交于 2019-12-02 23:22:30
I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword like %s" %substr) # dk should contain rows with keyword values such as "Arizona is hot." Note I'm

How to convert DataFrame to Dataset in Apache Spark in Java?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-02 23:08:13
I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset? Any Idea? my effort is: DataFrame df = ctx.read().json(logFile); Encoder<Person> encoder = new Encoder<>(); Dataset<Person> ds = new Dataset<Person>(ctx,df.logicalPlan(),encoder); ds.printSchema(); but the compiler say: Error:(23, 27) java: org.apache.spark.sql.Encoder is abstract; cannot be instantiated Edited(Solution): solution based on @Leet

Fetching distinct values on a column using Spark DataFrame

删除回忆录丶 提交于 2019-12-02 21:56:23
Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger. I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach? import sqlContext.implicits._ preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2) preProcessedData.select(ApplicationId).distinct.collect().foreach(x => { val applicationId = x.getAs[String](ApplicationId) val

Extracting tag attributes from xml using sparkxml

天涯浪子 提交于 2019-12-02 21:13:28
问题 I am loading a xml file using com.databricks.spark.xml and i want to read a tag attribute using the sql context . XML : <Receipt> <Sale> <DepartmentID>PR</DepartmentID> <Tax TaxExempt="false" TaxRate="10.25"/> </Sale> </Receipt> Loaded the file by, val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Receipt").load("/home/user/sale.xml"); df.registerTempTable("SPtable"); Printing the Schema: root |-- Sale: array (nullable = true) | |-- element: struct (containsNull =

Compute size of Spark dataframe - SizeEstimator gives unexpected results

落花浮王杯 提交于 2019-12-02 21:02:48
I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically. The reason is that I would like to have a method to compute an "optimal" number of partitions ("optimal" could mean different things here: it could mean having an optimal partition size , or resulting in an optimal file size when writing to Parquet tables - but both can be assumed to be some linear function of the dataframe size). In other words, I would like to call coalesce(n) or repartition(n) on the dataframe, where n is not a fixed number but rather a function of the dataframe size.

Is Spark SQL UDAF (user defined aggregate function) available in the Python API?

雨燕双飞 提交于 2019-12-02 20:56:18
As of Spark 1.5.0 it seems possible to write your own UDAF's for custom aggregations on DataFrames: Spark 1.5 DataFrame API Highlights: Date/Time/String Handling, Time Intervals, and UDAFs It is however unclear to me if this functionality is supported in the Python API? You cannot defined Python UDAF in Spark 1.5.0-2.0.0. There is a JIRA tracking this feature request: https://issues.apache.org/jira/browse/SPARK-10915 resolved with goal "later" so it probably won't happen anytime soon. You can use Scala UDAF from PySpark - it is described Spark: How to map Python with Scala or Java User Defined

How to tune spark job on EMR to write huge data quickly on S3

浪尽此生 提交于 2019-12-02 19:16:37
I have a spark job where i am doing outer join between two data frames . Size of first data frame is 260 GB,file format is text files which is split into 2200 files and the size of second data frame is 2GB . Then writing data frame output which is about 260 GB into S3 takes very long time is more than 2 hours after that i cancelled because i have been changed heavily on EMR . Here is my cluster info . emr-5.9.0 Master: m3.2xlarge Core: r4.16xlarge 10 machines (each machine has 64 vCore, 488 GiB memory,EBS Storage:100 GiB) This is my cluster config that i am setting capacity-scheduler yarn