apache-spark-sql | 易学教程

Speed up InMemoryFileIndex for Spark SQL job with large number of input files

阅读更多关于 Speed up InMemoryFileIndex for Spark SQL job with large number of input files

问题 I have an apache spark sql job (using Datasets), coded in Java, that get's it's input from between 70,000 to 150,000 files. It appears to take anywhere from 45 minutes to 1.5 hours to build the InMemoryFileIndex. There are no logs, very low network usage, and almost no CPU usage during this time. Here's a sample of what I see in the std output: 24698 [main] INFO org.spark_project.jetty.server.handler.ContextHandler - Started o.s.j.s.ServletContextHandler@32ec9c90{/static/sql,null,AVAILABLE,

How to efficiently map over DF and use combination of outputs?

阅读更多关于 How to efficiently map over DF and use combination of outputs?

问题 Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations). What is the best way to get a resulting df that will contain the original df A and the 3 added columns? val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2") def addCol(df: DataFrame): DataFrame = { df.withColumn("method1", col("num1")/col("num2")) } def addCol(df: DataFrame): DataFrame = { df.withColumn("method2

spark expression rename the column list after aggregation

阅读更多关于 spark expression rename the column list after aggregation

问题 I have written below code to group and aggregate the columns val gmList = List("gc1","gc2","gc3") val aList = List("val1","val2","val3","val4","val5") val cype = "first" val exprs = aList.map((_ -> cype )).toMap dfgroupBy(gmList.map (col): _*).agg (exprs).show but this create a columns with appending aggregation name in all column as shown below so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs +----------+----------+-------------+------------

spark expression rename the column list after aggregation

阅读更多关于 spark expression rename the column list after aggregation

spark expression rename the column list after aggregation

阅读更多关于 spark expression rename the column list after aggregation

How can i add multiple columns in Spark Datframe in efficiently

阅读更多关于 How can i add multiple columns in Spark Datframe in efficiently

问题 I have set of columns names and need to add those columns in existing dataframe which is also very huge in size, i need to add the all columns from set to dataframe with StringType and default null value. I am following below approach but i found that when the number of columns and dataframe size is huge this affecting my performance. Is there any better way to this in spark? Note : Number of columns : ~500 import sparkSession.sqlContext.implicits._ var df = Seq( (1, "James"), (2, "Michael"),

How can i add multiple columns in Spark Datframe in efficiently

阅读更多关于 How can i add multiple columns in Spark Datframe in efficiently

How to return rows with Null values in pyspark dataframe?

阅读更多关于 How to return rows with Null values in pyspark dataframe?

问题 I am trying to get the rows with null values from a pyspark dataframe. In pandas, I can achieve this using isnull() on the dataframe: df = df[df.isnull().any(axis=1)] But in case of PySpark, when I am running below command it shows Attributeerror: df.filter(df.isNull()) AttributeError: 'DataFrame' object has no attribute 'isNull'. How can get the rows with null values without checking it for each column? 回答1: You can filter the rows with where , reduce and a list comprehension. For example,

copy current row , modify it and add a new row in spark

阅读更多关于 copy current row , modify it and add a new row in spark

问题 I am using spark-sql-2.4.1v with java8 version. I have a scenario where I need to copy current row and create another row modifying few columns data how can this be achieved in spark-sql ? Ex : Given val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") Current Output +---+-----+------+------+------+ | id| code|entity|value1|value2| +---+-----+------+------+------

dataframe look up and optimization

阅读更多关于 dataframe look up and optimization

问题 I am using spark-sql-2.4.3v with java. I have scenario below val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14), ("23", "score", "school", 11 ,14), ("24", "rate", "school", 12 ,12), ("25", "score", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") df.show //this look up data populated from DB. val ll = List( ("aaaa", 11), ("aaa", 12), ("aa", 13), ("a", 14) ) val codeValudeDf = ll.toDF( "code"