spark-dataframe | 易学教程

Spark Dataframe - Windowing Function - Lag & Lead for Insert & Update output

阅读更多关于 Spark Dataframe - Windowing Function - Lag & Lead for Insert & Update output

问题 I need to perform the below operation on dataframes using Windowing function Lag and Lead. For each Key, I need to perform the below Insert and update in the final output Insert Condition: 1. By Default, LAYER_NO=0 , needs to be written in output. 2. If there is any change in the value of COL1,COL2,COL3, with respective to its precious record,then that records needs to be written in output. Example: key_1 with layer_no=2, there is a change of value from 400 to 600 in COL3 Update Condition: 1.

Spark Streaming: How can I add more partitions to my DStream?

阅读更多关于 Spark Streaming: How can I add more partitions to my DStream?

问题 I have a spark-streaming app which looks like this: val message = KafkaUtils.createStream(...).map(_._2) message.foreachRDD( rdd => { if (!rdd.isEmpty){ val kafkaDF = sqlContext.read.json(rdd) kafkaDF.foreachPartition( i =>{ createConnection() i.foreach( row =>{ connection.sendToTable() } ) closeConnection() } ) And, I run it on a yarn cluster using spark-submit --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 5.... When I try to log kafkaDF.rdd

How To Push a Spark Dataframe to Elastic Search (Pyspark)

阅读更多关于 How To Push a Spark Dataframe to Elastic Search (Pyspark)

问题 Beginner ES Question here What is the workflow or steps for pushing a Spark Dataframe to Elastic Search? From research, I believe I need to use the spark.newAPIHadoopFile() method. However, digging through the Elastic Search Documentation, and other Stack Q/A's I am still a little confused on what format the arguments need to be in and why NOTE that I am using pyspark, this is a new table to ES (no index already exists), and the df is 5 columns (2 string types, 2 long types, and 1 list of

How to insert Spark DataFrame to Hive Internal table?

阅读更多关于 How to insert Spark DataFrame to Hive Internal table?

What's the right way to insert DF to Hive Internal table in Append Mode. It seems we can directly write the DF to Hive using "saveAsTable" method OR store the DF to temp table then use the query. df.write().mode("append").saveAsTable("tableName") OR df.registerTempTable("temptable") sqlContext.sql("CREATE TABLE IF NOT EXISTS mytable as select * from temptable") Will the second approach append the records or overwrite it? Is there any other way to effectively write the DF to Hive Internal table? df.saveAsTable("tableName", "append") is deprecated. Instead you should the second approach.

find mean and corr of 10,000 columns in pyspark Dataframe

阅读更多关于 find mean and corr of 10,000 columns in pyspark Dataframe

I have DF with 10K columns and 70Million rows. I want to calculate the mean and corr of 10K columns. I did below code but it wont work due to code size 64K issue ( https://issues.apache.org/jira/browse/SPARK-16845 ) Data: region dept week sal val1 val2 val3 ... val10000 US CS 1 1 2 1 1 ... 2 US CS 2 1.5 2 3 1 ... 2 US CS 3 1 2 2 2.1 2 US ELE 1 1.1 2 2 2.1 2 US ELE 2 2.1 2 2 2.1 2 US ELE 3 1 2 1 2 .... 2 UE CS 1 2 2 1 2 .... 2 Code: aggList = [func.mean(col) for col in df.columns] #exclude keys df2= df.groupBy('region', 'dept').agg(*aggList) code 2 aggList = [func.corr('sal', col).alias(col)

How to modify a Spark Dataframe with a complex nested structure?

阅读更多关于 How to modify a Spark Dataframe with a complex nested structure?

问题 I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example: I have schema defined as: StructType( StructField(name,StringType,true), StructField(data,ArrayType( StructType( StructField(name,StringType,true), StructField(values, MapType(StringType,StringType,true), true) ),

Spark DataSet filter performance

阅读更多关于 Spark DataSet filter performance

问题 I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different. The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class. val df = spark.read.csv(csvFile).as[FireIncident] A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host) df.where($"UnitID" === "B02").count() 2)

Joining a large and a ginormous spark dataframe

阅读更多关于 Joining a large and a ginormous spark dataframe

问题 I have two dataframes, df1 has 6 million rows, df2 has 1 billion. I have tried the standard df1.join(df2,df1("id")<=>df2("id2")) , but run out of memory. df1 is too large to be put into a broadcast join. I have even tried a bloom filter, but it was also too large to fit in a broadcast and still be useful. The only thing I have tried that doesn't error out is to break df1 into 300,000 row chunks and join with df2 in a foreach loop. But this takes an order of magnitude longer than it probably

Cannot resolve column (numeric column name) in Spark Dataframe

阅读更多关于 Cannot resolve column (numeric column name) in Spark Dataframe

问题 This is my data: scala> data.printSchema root |-- 1.0: string (nullable = true) |-- 2.0: string (nullable = true) |-- 3.0: string (nullable = true) This doesn't work :( scala> data.select("2.0").show Exception: org.apache.spark.sql.AnalysisException: cannot resolve '`2.0`' given input columns: [1.0, 2.0, 3.0];; 'Project ['2.0] +- Project [_1#5608 AS 1.0#5615, _2#5609 AS 2.0#5616, _3#5610 AS 3.0#5617] +- LocalRelation [_1#5608, _2#5609, _3#5610] ... Try this at home (I'm running on the shell v

Randomly shuffle column in Spark RDD or dataframe

阅读更多关于 Randomly shuffle column in Spark RDD or dataframe

问题 Is there anyway I can shuffle a column of an RDD or dataframe such that the entries in that column appear in random order? I'm not sure which APIs I could use to accomplish such a task. 回答1: What about selecting the column to shuffle, orderBy(rand) the column and zip it by index to the existing dataframe? import org.apache.spark.sql.functions.rand def addIndex(df: DataFrame) = spark.createDataFrame( // Add index df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)}, // Create