spark-dataframe

SparkR: dplyr-style split-apply-combine on DataFrame

你。 提交于 2019-12-21 06:05:28
问题 Under the previous RDD paradigm, I could specify a key and then map an operation to RDD elements corresponding to each key. I don't see a clear way to do this with DataFrame in SparkR as of 1.5.1. What I would like to do is something like a dplyr operation: new.df <- old.df %>% group_by("column1") %>% do(myfunc(.)) I currently have a large SparkR DataFrame of the form: timestamp value id 2015-09-01 05:00:00.0 1.132 24 2015-09-01 05:10:00.0 null 24 2015-09-01 05:20:00.0 1.129 24 2015-09-01 05

How to filter data using window functions in spark

给你一囗甜甜゛ 提交于 2019-12-21 05:40:48
问题 I have the following data : rowid uid time code 1 1 5 a 2 1 6 b 3 1 7 c 4 2 8 a 5 2 9 c 6 2 9 c 7 2 10 c 8 2 11 a 9 2 12 c Now I wanted to filter the data in such a way that I can remove the rows 6 and 7 as for a particular uid i want to keep just one row with value 'c' in code So the expected data should be : rowid uid time code 1 1 5 a 2 1 6 b 3 1 7 c 4 2 8 a 5 2 9 c 8 2 11 a 9 2 12 c I'm using window function something like this : val window = Window.partitionBy("uid").orderBy("time") val

SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

痴心易碎 提交于 2019-12-21 05:34:12
问题 I read a parquet file from HDFS system: path<-"hdfs://part_2015" AppDF <- parquetFile(sqlContext, path) printSchema(AppDF) root |-- app: binary (nullable = true) |-- category: binary (nullable = true) |-- date: binary (nullable = true) |-- user: binary (nullable = true) class(AppDF) [1] "DataFrame" attr(,"package") [1] "SparkR" collect(AppDF) .....error: arguments imply differing number of rows: 46021, 39175, 62744, 27137 head(AppDF) .....error: arguments imply differing number of rows: 36,

What is the difference between sort and orderBy functions in Spark

别来无恙 提交于 2019-12-20 18:03:47
问题 What is the difference between sort and orderBy spark DataFrame? scala> zips.printSchema root |-- _id: string (nullable = true) |-- city: string (nullable = true) |-- loc: array (nullable = true) | |-- element: double (containsNull = true) |-- pop: long (nullable = true) |-- state: string (nullable = true) Below commands produce same result: zips.sort(desc("pop")).show zips.orderBy(desc("pop")).show 回答1: OrderBy is just an alias for the sort function. From the Spark documentation: /** *

Why does Apache Spark read unnecessary Parquet columns within nested structures?

巧了我就是萌 提交于 2019-12-20 17:19:26
问题 My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing unexpected columns being read for nested schema structures. To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell: // Preliminary setup sc.setLogLevel("INFO") import org.apache.spark.sql.types._ import org.apache.spark.sql._ // Create a

Why does Apache Spark read unnecessary Parquet columns within nested structures?

不羁的心 提交于 2019-12-20 17:18:04
问题 My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing unexpected columns being read for nested schema structures. To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell: // Preliminary setup sc.setLogLevel("INFO") import org.apache.spark.sql.types._ import org.apache.spark.sql._ // Create a

Read from a hive table and write back to it using spark sql

僤鯓⒐⒋嵵緔 提交于 2019-12-20 11:06:30
问题 I am reading a Hive table using Spark SQL and assigning it to a scala val val x = sqlContext.sql("select * from some_table") Then I am doing some processing with the dataframe x and finally coming up with a dataframe y , which has the exact schema as the table some_table. Finally I am trying to insert overwrite the y dataframe to the same hive table some_table y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table") Then I am getting the error org.apache.spark.sql

How to convert DataFrame to Dataset in Apache Spark in Java?

放肆的年华 提交于 2019-12-20 10:33:05
问题 I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset? Any Idea? my effort is: DataFrame df = ctx.read().json(logFile); Encoder<Person> encoder = new Encoder<>(); Dataset<Person> ds = new Dataset<Person>(ctx,df.logicalPlan(),encoder); ds.printSchema(); but the compiler say: Error:(23, 27) java: org

Is Spark SQL UDAF (user defined aggregate function) available in the Python API?

不羁的心 提交于 2019-12-20 09:59:08
问题 As of Spark 1.5.0 it seems possible to write your own UDAF's for custom aggregations on DataFrames: Spark 1.5 DataFrame API Highlights: Date/Time/String Handling, Time Intervals, and UDAFs It is however unclear to me if this functionality is supported in the Python API? 回答1: You cannot defined Python UDAF in Spark 1.5.0-2.0.0. There is a JIRA tracking this feature request: https://issues.apache.org/jira/browse/SPARK-10915 resolved with goal "later" so it probably won't happen anytime soon.

SparkSQL : Can I explode two different variables in the same query?

情到浓时终转凉″ 提交于 2019-12-20 09:47:27
问题 I have the following explode query, which works fine: data1 = sqlContext.sql("select explode(names) as name from data") I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did: data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data") But I got the errors: Only one generator allowed per select but Generate and and Explode found.; Does any one have any idea? I can actually make it work by doing