apache-spark-sql

Speed up InMemoryFileIndex for Spark SQL job with large number of input files

倖福魔咒の 提交于 2020-08-07 07:47:25
问题 I have an apache spark sql job (using Datasets), coded in Java, that get's it's input from between 70,000 to 150,000 files. It appears to take anywhere from 45 minutes to 1.5 hours to build the InMemoryFileIndex. There are no logs, very low network usage, and almost no CPU usage during this time. Here's a sample of what I see in the std output: 24698 [main] INFO org.spark_project.jetty.server.handler.ContextHandler - Started o.s.j.s.ServletContextHandler@32ec9c90{/static/sql,null,AVAILABLE,

How to efficiently map over DF and use combination of outputs?

ⅰ亾dé卋堺 提交于 2020-08-06 06:56:21
问题 Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations). What is the best way to get a resulting df that will contain the original df A and the 3 added columns? val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2") def addCol(df: DataFrame): DataFrame = { df.withColumn("method1", col("num1")/col("num2")) } def addCol(df: DataFrame): DataFrame = { df.withColumn("method2

spark expression rename the column list after aggregation

余生长醉 提交于 2020-08-05 07:14:37
问题 I have written below code to group and aggregate the columns val gmList = List("gc1","gc2","gc3") val aList = List("val1","val2","val3","val4","val5") val cype = "first" val exprs = aList.map((_ -> cype )).toMap dfgroupBy(gmList.map (col): _*).agg (exprs).show but this create a columns with appending aggregation name in all column as shown below so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs +----------+----------+-------------+------------

spark expression rename the column list after aggregation

我的未来我决定 提交于 2020-08-05 07:11:34
问题 I have written below code to group and aggregate the columns val gmList = List("gc1","gc2","gc3") val aList = List("val1","val2","val3","val4","val5") val cype = "first" val exprs = aList.map((_ -> cype )).toMap dfgroupBy(gmList.map (col): _*).agg (exprs).show but this create a columns with appending aggregation name in all column as shown below so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs +----------+----------+-------------+------------

spark expression rename the column list after aggregation

天大地大妈咪最大 提交于 2020-08-05 07:11:13
问题 I have written below code to group and aggregate the columns val gmList = List("gc1","gc2","gc3") val aList = List("val1","val2","val3","val4","val5") val cype = "first" val exprs = aList.map((_ -> cype )).toMap dfgroupBy(gmList.map (col): _*).agg (exprs).show but this create a columns with appending aggregation name in all column as shown below so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs +----------+----------+-------------+------------

How can i add multiple columns in Spark Datframe in efficiently

微笑、不失礼 提交于 2020-08-03 07:08:22
问题 I have set of columns names and need to add those columns in existing dataframe which is also very huge in size, i need to add the all columns from set to dataframe with StringType and default null value. I am following below approach but i found that when the number of columns and dataframe size is huge this affecting my performance. Is there any better way to this in spark? Note : Number of columns : ~500 import sparkSession.sqlContext.implicits._ var df = Seq( (1, "James"), (2, "Michael"),

How can i add multiple columns in Spark Datframe in efficiently

回眸只為那壹抹淺笑 提交于 2020-08-03 07:08:16
问题 I have set of columns names and need to add those columns in existing dataframe which is also very huge in size, i need to add the all columns from set to dataframe with StringType and default null value. I am following below approach but i found that when the number of columns and dataframe size is huge this affecting my performance. Is there any better way to this in spark? Note : Number of columns : ~500 import sparkSession.sqlContext.implicits._ var df = Seq( (1, "James"), (2, "Michael"),

How to return rows with Null values in pyspark dataframe?

£可爱£侵袭症+ 提交于 2020-07-30 06:11:06
问题 I am trying to get the rows with null values from a pyspark dataframe. In pandas, I can achieve this using isnull() on the dataframe: df = df[df.isnull().any(axis=1)] But in case of PySpark, when I am running below command it shows Attributeerror: df.filter(df.isNull()) AttributeError: 'DataFrame' object has no attribute 'isNull'. How can get the rows with null values without checking it for each column? 回答1: You can filter the rows with where , reduce and a list comprehension. For example,

copy current row , modify it and add a new row in spark

耗尽温柔 提交于 2020-07-30 04:25:55
问题 I am using spark-sql-2.4.1v with java8 version. I have a scenario where I need to copy current row and create another row modifying few columns data how can this be achieved in spark-sql ? Ex : Given val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") Current Output +---+-----+------+------+------+ | id| code|entity|value1|value2| +---+-----+------+------+------

dataframe look up and optimization

天大地大妈咪最大 提交于 2020-07-25 03:48:11
问题 I am using spark-sql-2.4.3v with java. I have scenario below val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14), ("23", "score", "school", 11 ,14), ("24", "rate", "school", 12 ,12), ("25", "score", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") df.show //this look up data populated from DB. val ll = List( ("aaaa", 11), ("aaa", 12), ("aa", 13), ("a", 14) ) val codeValudeDf = ll.toDF( "code"