spark-dataframe | 易学教程

SparkR: dplyr-style split-apply-combine on DataFrame

阅读更多关于 SparkR: dplyr-style split-apply-combine on DataFrame

问题 Under the previous RDD paradigm, I could specify a key and then map an operation to RDD elements corresponding to each key. I don't see a clear way to do this with DataFrame in SparkR as of 1.5.1. What I would like to do is something like a dplyr operation: new.df <- old.df %>% group_by("column1") %>% do(myfunc(.)) I currently have a large SparkR DataFrame of the form: timestamp value id 2015-09-01 05:00:00.0 1.132 24 2015-09-01 05:10:00.0 null 24 2015-09-01 05:20:00.0 1.129 24 2015-09-01 05

How to filter data using window functions in spark

阅读更多关于 How to filter data using window functions in spark

问题 I have the following data : rowid uid time code 1 1 5 a 2 1 6 b 3 1 7 c 4 2 8 a 5 2 9 c 6 2 9 c 7 2 10 c 8 2 11 a 9 2 12 c Now I wanted to filter the data in such a way that I can remove the rows 6 and 7 as for a particular uid i want to keep just one row with value 'c' in code So the expected data should be : rowid uid time code 1 1 5 a 2 1 6 b 3 1 7 c 4 2 8 a 5 2 9 c 8 2 11 a 9 2 12 c I'm using window function something like this : val window = Window.partitionBy("uid").orderBy("time") val

SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

阅读更多关于 SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

问题 I read a parquet file from HDFS system: path<-"hdfs://part_2015" AppDF <- parquetFile(sqlContext, path) printSchema(AppDF) root |-- app: binary (nullable = true) |-- category: binary (nullable = true) |-- date: binary (nullable = true) |-- user: binary (nullable = true) class(AppDF) [1] "DataFrame" attr(,"package") [1] "SparkR" collect(AppDF) .....error: arguments imply differing number of rows: 46021, 39175, 62744, 27137 head(AppDF) .....error: arguments imply differing number of rows: 36,

What is the difference between sort and orderBy functions in Spark

阅读更多关于 What is the difference between sort and orderBy functions in Spark

问题 What is the difference between sort and orderBy spark DataFrame? scala> zips.printSchema root |-- _id: string (nullable = true) |-- city: string (nullable = true) |-- loc: array (nullable = true) | |-- element: double (containsNull = true) |-- pop: long (nullable = true) |-- state: string (nullable = true) Below commands produce same result: zips.sort(desc("pop")).show zips.orderBy(desc("pop")).show 回答1: OrderBy is just an alias for the sort function. From the Spark documentation: /** *

Why does Apache Spark read unnecessary Parquet columns within nested structures?

阅读更多关于 Why does Apache Spark read unnecessary Parquet columns within nested structures?

问题 My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing unexpected columns being read for nested schema structures. To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell: // Preliminary setup sc.setLogLevel("INFO") import org.apache.spark.sql.types._ import org.apache.spark.sql._ // Create a

Why does Apache Spark read unnecessary Parquet columns within nested structures?

阅读更多关于 Why does Apache Spark read unnecessary Parquet columns within nested structures?

Read from a hive table and write back to it using spark sql

阅读更多关于 Read from a hive table and write back to it using spark sql

问题 I am reading a Hive table using Spark SQL and assigning it to a scala val val x = sqlContext.sql("select * from some_table") Then I am doing some processing with the dataframe x and finally coming up with a dataframe y , which has the exact schema as the table some_table. Finally I am trying to insert overwrite the y dataframe to the same hive table some_table y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table") Then I am getting the error org.apache.spark.sql

How to convert DataFrame to Dataset in Apache Spark in Java?

阅读更多关于 How to convert DataFrame to Dataset in Apache Spark in Java?

问题 I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset? Any Idea? my effort is: DataFrame df = ctx.read().json(logFile); Encoder<Person> encoder = new Encoder<>(); Dataset<Person> ds = new Dataset<Person>(ctx,df.logicalPlan(),encoder); ds.printSchema(); but the compiler say: Error:(23, 27) java: org

Is Spark SQL UDAF (user defined aggregate function) available in the Python API?

阅读更多关于 Is Spark SQL UDAF (user defined aggregate function) available in the Python API?

问题 As of Spark 1.5.0 it seems possible to write your own UDAF's for custom aggregations on DataFrames: Spark 1.5 DataFrame API Highlights: Date/Time/String Handling, Time Intervals, and UDAFs It is however unclear to me if this functionality is supported in the Python API? 回答1: You cannot defined Python UDAF in Spark 1.5.0-2.0.0. There is a JIRA tracking this feature request: https://issues.apache.org/jira/browse/SPARK-10915 resolved with goal "later" so it probably won't happen anytime soon.

SparkSQL : Can I explode two different variables in the same query?

阅读更多关于 SparkSQL : Can I explode two different variables in the same query?

问题 I have the following explode query, which works fine: data1 = sqlContext.sql("select explode(names) as name from data") I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did: data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data") But I got the errors: Only one generator allowed per select but Generate and and Explode found.; Does any one have any idea? I can actually make it work by doing