apache-spark-sql | 易学教程

dataframe look up and optimization

阅读更多关于 dataframe look up and optimization

问题 I am using spark-sql-2.4.3v with java. I have scenario below val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14), ("23", "score", "school", 11 ,14), ("24", "rate", "school", 12 ,12), ("25", "score", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") df.show //this look up data populated from DB. val ll = List( ("aaaa", 11), ("aaa", 12), ("aa", 13), ("a", 14) ) val codeValudeDf = ll.toDF( "code"

Spark combine multiple rows to Single row base on specific Column with out groupBy operation

阅读更多关于 Spark combine multiple rows to Single row base on specific Column with out groupBy operation

问题 I have a spark data frame like below with 7k columns. +---+----+----+----+----+----+----+ | id| 1| 2| 3|sf_1|sf_2|sf_3| +---+----+----+----+----+----+----+ | 2|null|null|null| 102| 202| 302| | 4|null|null|null| 104| 204| 304| | 1|null|null|null| 101| 201| 301| | 3|null|null|null| 103| 203| 303| | 1| 11| 21| 31|null|null|null| | 2| 12| 22| 32|null|null|null| | 4| 14| 24| 34|null|null|null| | 3| 13| 23| 33|null|null|null| +---+----+----+----+----+----+----+ I wanted to transform data frame like

Create Spark DataFrame from Pandas DataFrame

阅读更多关于 Create Spark DataFrame from Pandas DataFrame

问题 I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow. import pandas as pd pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]}) spark_df = sqlContext.createDataFrame(pandas_df) spark_df.printSchema() Till' this point everything is OK. The output is: root |-- Letters: string (nullable = true) The problem comes when I try to print the DataFrame: spark_df.show() This is the result: An error occurred while calling o158.collectToPython. : org.apache

How to create a udf in PySpark which returns an array of strings?

阅读更多关于 How to create a udf in PySpark which returns an array of strings?

问题 I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType) . Now, somehow this is not working: the dataframe i'm operating on is df_subsets_concat and looks like this: df_subsets_concat.show(3,False) +----------------------+ |col1 | +----------------------+ |oculunt | |predistposed | |incredulous | +----------------------+ only showing top 3 rows and the code is from

How to create a udf in PySpark which returns an array of strings?

阅读更多关于 How to create a udf in PySpark which returns an array of strings?

Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes

阅读更多关于 Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes

问题 I have two dataframes df1 and df2 and I want to join these tables many times on a high cardinality field called visitor_id . I would like to perform only one initial shuffle and have all the joins take place without shuffling/exchanging data between spark executors. To do so, I have created another column called visitor_partition that consistently assigns each visitor_id a random value between [0, 1000) . I have used a custom partitioner to ensure that the df1 and df2 are exactly partitioned

Using Scala classes as UDF with pyspark

阅读更多关于 Using Scala classes as UDF with pyspark

问题 I'm trying to offload some computations from Python to Scala when using Apache Spark. I would like to use the class interface from Java to be able to use a persistent variable, like so (this is a nonsensical MWE based on my more complex use case): package mwe import org.apache.spark.sql.api.java.UDF1 class SomeFun extends UDF1[Int, Int] { private var prop: Int = 0 override def call(input: Int): Int = { if (prop == 0) { prop = input } prop + input } } Now I'm attempting to use this class from

How to drop rows with too many NULL values?

阅读更多关于 How to drop rows with too many NULL values?

问题 I want to do some preprocessing on my data and I want to drop the rows that are sparse (for some threshold value). For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it. I found some related topics but I cannot find any useful information for my purpose. stackoverflow.com/questions/3473778/count-number-of-nulls-in-a-row Examples like in the link above won't work for me, because I want to do this preprocessing automatically. I cannot

Why do I see multiple spark installations directories?

阅读更多关于 Why do I see multiple spark installations directories?

问题 I am working on a ubuntu server which has spark installed in it. I don't have sudo access to this server. So under my directory, I created a new virtual environment where I installed pyspark When I type the below command whereis spark-shell #see below /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/

Type Casting Large number of Struct Fields to String using Pyspark

阅读更多关于 Type Casting Large number of Struct Fields to String using Pyspark