spark-dataframe | 易学教程

pyspark - create DataFrame Grouping columns in map type structure

阅读更多关于 pyspark - create DataFrame Grouping columns in map type structure

My DataFrame has the following structure: ------------------------- | Brand | type | amount| ------------------------- | B | a | 10 | | B | b | 20 | | C | c | 30 | ------------------------- I want to reduce the amount of rows by grouping type and amount into one single column of type : Map So Brand will be unique and MAP_type_AMOUNT will have key,value for each type amount combination. I think Spark.sql might have some functions to help in this process, or do I have to get the RDD being the DataFrame and make my "own" conversion to map type? Expected : ------------------------- | Brand | MAP

Is it better for Spark to select from hive or select from file

阅读更多关于 Is it better for Spark to select from hive or select from file

I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why? Mike tl;dr : I would read it straight from the parquet files I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet") val dfhive = sqlContext.table("db.table")

How to select a same-size stratified sample from a dataframe in Apache Spark?

阅读更多关于 How to select a same-size stratified sample from a dataframe in Apache Spark?

I have a dataframe in Spark 2 as shown below where users have between 50 to thousands of posts. I would like to create a new dataframe that will have all the users in the original dataframe but with only 5 randomly sampled posts for each user. +--------+--------------+--------------------+ | user_id| post_id| text| +--------+--------------+--------------------+ |67778705|44783131591473|some text...........| |67778705|44783134580755|some text...........| |67778705|44783136367108|some text...........| |67778705|44783136970669|some text...........| |67778705|44783138143396|some text...........|

Error while exploding a struct column in Spark

阅读更多关于 Error while exploding a struct column in Spark

问题 I have a dataframe whose schema looks like this: event: struct (nullable = true) | | event_category: string (nullable = true) | | event_name: string (nullable = true) | | properties: struct (nullable = true) | | | ErrorCode: string (nullable = true) | | | ErrorDescription: string (nullable = true) I am trying to explode the struct column properties using the following code: df_json.withColumn("event_properties", explode($"event.properties")) But it is throwing the following exception: cannot

How to calculate Percentile of column in a DataFrame in spark?

阅读更多关于 How to calculate Percentile of column in a DataFrame in spark?

I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select percentile_approx("Open_Rate",0.10) from myTable); But I want to do it using Spark DataFrame for performance reasons. Sample data set |User ID|Open_Rate| ------------------- |A1 |10.3 | |B1 |4.04 | |C1 |21.7 | |D1 |18.6 | I want to find out how many users fall into 10 percentile or 20 percentile and so on. I want to do something like this df.select($"id

Which is efficient, Dataframe or RDD or hiveql?

阅读更多关于 Which is efficient, Dataframe or RDD or hiveql?

I am newbie to Apache Spark. My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file. For example, CSV1 name,age,deparment_id CSV2 department_id,deparment_name,location I want to get a third CSV file with name,age,deparment_name I am loading both the CSV into dataframes. And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe I am also able to do the same using several RDD.map() And I am also able to do the same using executing hiveql using HiveContext I want to

How to filter one spark dataframe against another dataframe

阅读更多关于 How to filter one spark dataframe against another dataframe

I'm trying to filter one dataframe against another: scala> val df1 = sc.parallelize((1 to 100).map(a=>(s"user $a", a*0.123, a))).toDF("name", "score", "user_id") scala> val df2 = sc.parallelize(List(2,3,4,5,6)).toDF("valid_id") Now I want to filter df1 and get back a dataframe that contains all the rows in df1 where user_id is in df2("valid_id"). In other words, I want all the rows in df1 where the user_id is either 2,3,4,5 or 6 scala> df1.select("user_id").filter($"user_id" in df2("valid_id")) warning: there were 1 deprecation warning(s); re-run with -deprecation for details org.apache.spark

Replacing whitespace in all column names in spark Dataframe

阅读更多关于 Replacing whitespace in all column names in spark Dataframe

问题 I have spark dataframe with whitespaces in some of column names, which has to be replaced with underscore. I know a single column can be renamed using withColumnRenamed() in sparkSQL, but to rename 'n' number of columns, this function has to chained 'n' times (to my knowledge). To automate this, i have tried: val old_names = df.columns() // contains array of old column names val new_names = old_names.map { x => if(x.contains(" ") == true) x.replaceAll("\\s","_") else x } // array of new

Pyspark dataframe LIKE operator

阅读更多关于 Pyspark dataframe LIKE operator

What is the equivalent in Pyspark for LIKE operator? For example I would like to do: SELECT * FROM table WHERE column LIKE "*somestring*"; looking for something easy like this (but this is not working): df.select('column').where(col('column').like("*s*")).show() braj You can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%string%' ). The col('col_name') is used to represent the condition and like is the operator: df.where(col('col1').like("%string%")).show() Using spark 2.0.0 onwards following also

How to save a DataFrame as compressed (gzipped) CSV?

阅读更多关于 How to save a DataFrame as compressed (gzipped) CSV?

问题 I use Spark 1.6.0 and Scala. I want to save a DataFrame as compressed CSV format. Here is what I have so far (assume I already have df and sc as SparkContext ): //set the conf to the codec I want sc.getConf.set("spark.hadoop.mapred.output.compress", "true") sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "true") sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec") sc.getConf.set("spark.hadoop.mapred.output.compression.type",