spark-dataframe | 易学教程

Flatten Nested Spark Dataframe

阅读更多关于 Flatten Nested Spark Dataframe

Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType, etc). Say I have a schema like: StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...))) Looking to adapt this into a flat table with a structure like: field1 field2 nested_array.nested_field1 nested_array.nested_field2 FYI,

Group By, Rank and aggregate spark data frame using pyspark

阅读更多关于 Group By, Rank and aggregate spark data frame using pyspark

I have a dataframe that looks like: A B C --------------- A1 B1 0.8 A1 B2 0.55 A1 B3 0.43 A2 B1 0.7 A2 B2 0.5 A2 B3 0.5 A3 B1 0.2 A3 B2 0.3 A3 B3 0.4 How do I convert the column 'C' to the relative rank(higher score->better rank) per column A? Expected Output: A B Rank --------------- A1 B1 1 A1 B2 2 A1 B3 3 A2 B1 1 A2 B2 2 A2 B3 2 A3 B1 3 A3 B2 2 A3 B3 1 The ultimate state I want to reach is to aggregate column B and store the ranks for each A: Example: B Ranks B1 [1,1,3] B2 [2,2,2] B3 [3,2,1] Add rank: from pyspark.sql.functions import * from pyspark.sql.window import Window ranked = df

Read from a hive table and write back to it using spark sql

阅读更多关于 Read from a hive table and write back to it using spark sql

I am reading a Hive table using Spark SQL and assigning it to a scala val val x = sqlContext.sql("select * from some_table") Then I am doing some processing with the dataframe x and finally coming up with a dataframe y , which has the exact schema as the table some_table. Finally I am trying to insert overwrite the y dataframe to the same hive table some_table y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table") Then I am getting the error org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from I tried creating an insert sql

Pyspark: filter dataframe by regex with string formatting?

阅读更多关于 Pyspark: filter dataframe by regex with string formatting?

I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword like %s" %substr) # dk should contain rows with keyword values such as "Arizona is hot." Note I'm

How to convert DataFrame to Dataset in Apache Spark in Java?

阅读更多关于 How to convert DataFrame to Dataset in Apache Spark in Java?

I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset? Any Idea? my effort is: DataFrame df = ctx.read().json(logFile); Encoder<Person> encoder = new Encoder<>(); Dataset<Person> ds = new Dataset<Person>(ctx,df.logicalPlan(),encoder); ds.printSchema(); but the compiler say: Error:(23, 27) java: org.apache.spark.sql.Encoder is abstract; cannot be instantiated Edited(Solution): solution based on @Leet

Fetching distinct values on a column using Spark DataFrame

阅读更多关于 Fetching distinct values on a column using Spark DataFrame

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger. I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach? import sqlContext.implicits._ preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2) preProcessedData.select(ApplicationId).distinct.collect().foreach(x => { val applicationId = x.getAs[String](ApplicationId) val

Extracting tag attributes from xml using sparkxml

阅读更多关于 Extracting tag attributes from xml using sparkxml

问题 I am loading a xml file using com.databricks.spark.xml and i want to read a tag attribute using the sql context . XML : <Receipt> <Sale> <DepartmentID>PR</DepartmentID> <Tax TaxExempt="false" TaxRate="10.25"/> </Sale> </Receipt> Loaded the file by, val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Receipt").load("/home/user/sale.xml"); df.registerTempTable("SPtable"); Printing the Schema: root |-- Sale: array (nullable = true) | |-- element: struct (containsNull =

Compute size of Spark dataframe - SizeEstimator gives unexpected results

阅读更多关于 Compute size of Spark dataframe - SizeEstimator gives unexpected results

I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically. The reason is that I would like to have a method to compute an "optimal" number of partitions ("optimal" could mean different things here: it could mean having an optimal partition size , or resulting in an optimal file size when writing to Parquet tables - but both can be assumed to be some linear function of the dataframe size). In other words, I would like to call coalesce(n) or repartition(n) on the dataframe, where n is not a fixed number but rather a function of the dataframe size.

Is Spark SQL UDAF (user defined aggregate function) available in the Python API?

阅读更多关于 Is Spark SQL UDAF (user defined aggregate function) available in the Python API?

As of Spark 1.5.0 it seems possible to write your own UDAF's for custom aggregations on DataFrames: Spark 1.5 DataFrame API Highlights: Date/Time/String Handling, Time Intervals, and UDAFs It is however unclear to me if this functionality is supported in the Python API? You cannot defined Python UDAF in Spark 1.5.0-2.0.0. There is a JIRA tracking this feature request: https://issues.apache.org/jira/browse/SPARK-10915 resolved with goal "later" so it probably won't happen anytime soon. You can use Scala UDAF from PySpark - it is described Spark: How to map Python with Scala or Java User Defined

How to tune spark job on EMR to write huge data quickly on S3

阅读更多关于 How to tune spark job on EMR to write huge data quickly on S3

I have a spark job where i am doing outer join between two data frames . Size of first data frame is 260 GB,file format is text files which is split into 2200 files and the size of second data frame is 2GB . Then writing data frame output which is about 260 GB into S3 takes very long time is more than 2 hours after that i cancelled because i have been changed heavily on EMR . Here is my cluster info . emr-5.9.0 Master: m3.2xlarge Core: r4.16xlarge 10 machines (each machine has 64 vCore, 488 GiB memory,EBS Storage:100 GiB) This is my cluster config that i am setting capacity-scheduler yarn