apache-spark-sql

dataframe look up and optimization

半世苍凉 提交于 2020-07-25 03:45:18
问题 I am using spark-sql-2.4.3v with java. I have scenario below val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14), ("23", "score", "school", 11 ,14), ("24", "rate", "school", 12 ,12), ("25", "score", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") df.show //this look up data populated from DB. val ll = List( ("aaaa", 11), ("aaa", 12), ("aa", 13), ("a", 14) ) val codeValudeDf = ll.toDF( "code"

Spark combine multiple rows to Single row base on specific Column with out groupBy operation

a 夏天 提交于 2020-07-20 04:31:06
问题 I have a spark data frame like below with 7k columns. +---+----+----+----+----+----+----+ | id| 1| 2| 3|sf_1|sf_2|sf_3| +---+----+----+----+----+----+----+ | 2|null|null|null| 102| 202| 302| | 4|null|null|null| 104| 204| 304| | 1|null|null|null| 101| 201| 301| | 3|null|null|null| 103| 203| 303| | 1| 11| 21| 31|null|null|null| | 2| 12| 22| 32|null|null|null| | 4| 14| 24| 34|null|null|null| | 3| 13| 23| 33|null|null|null| +---+----+----+----+----+----+----+ I wanted to transform data frame like

Create Spark DataFrame from Pandas DataFrame

笑着哭i 提交于 2020-07-18 21:09:09
问题 I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow. import pandas as pd pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]}) spark_df = sqlContext.createDataFrame(pandas_df) spark_df.printSchema() Till' this point everything is OK. The output is: root |-- Letters: string (nullable = true) The problem comes when I try to print the DataFrame: spark_df.show() This is the result: An error occurred while calling o158.collectToPython. : org.apache

How to create a udf in PySpark which returns an array of strings?

不打扰是莪最后的温柔 提交于 2020-07-17 07:24:18
问题 I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType) . Now, somehow this is not working: the dataframe i'm operating on is df_subsets_concat and looks like this: df_subsets_concat.show(3,False) +----------------------+ |col1 | +----------------------+ |oculunt | |predistposed | |incredulous | +----------------------+ only showing top 3 rows and the code is from

How to create a udf in PySpark which returns an array of strings?

泪湿孤枕 提交于 2020-07-17 07:24:11
问题 I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType) . Now, somehow this is not working: the dataframe i'm operating on is df_subsets_concat and looks like this: df_subsets_concat.show(3,False) +----------------------+ |col1 | +----------------------+ |oculunt | |predistposed | |incredulous | +----------------------+ only showing top 3 rows and the code is from

Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes

陌路散爱 提交于 2020-07-17 05:50:10
问题 I have two dataframes df1 and df2 and I want to join these tables many times on a high cardinality field called visitor_id . I would like to perform only one initial shuffle and have all the joins take place without shuffling/exchanging data between spark executors. To do so, I have created another column called visitor_partition that consistently assigns each visitor_id a random value between [0, 1000) . I have used a custom partitioner to ensure that the df1 and df2 are exactly partitioned

Using Scala classes as UDF with pyspark

試著忘記壹切 提交于 2020-07-16 00:45:13
问题 I'm trying to offload some computations from Python to Scala when using Apache Spark. I would like to use the class interface from Java to be able to use a persistent variable, like so (this is a nonsensical MWE based on my more complex use case): package mwe import org.apache.spark.sql.api.java.UDF1 class SomeFun extends UDF1[Int, Int] { private var prop: Int = 0 override def call(input: Int): Int = { if (prop == 0) { prop = input } prop + input } } Now I'm attempting to use this class from

How to drop rows with too many NULL values?

对着背影说爱祢 提交于 2020-07-15 05:23:07
问题 I want to do some preprocessing on my data and I want to drop the rows that are sparse (for some threshold value). For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it. I found some related topics but I cannot find any useful information for my purpose. stackoverflow.com/questions/3473778/count-number-of-nulls-in-a-row Examples like in the link above won't work for me, because I want to do this preprocessing automatically. I cannot

Why do I see multiple spark installations directories?

北城余情 提交于 2020-07-10 10:27:08
问题 I am working on a ubuntu server which has spark installed in it. I don't have sudo access to this server. So under my directory, I created a new virtual environment where I installed pyspark When I type the below command whereis spark-shell #see below /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/

Type Casting Large number of Struct Fields to String using Pyspark

≯℡__Kan透↙ 提交于 2020-07-10 07:40:09
问题 I have a pyspark df who's schema looks like this root |-- company: struct (nullable = true) | |-- 0: long(nullable = true) | |-- 1: long(nullable = true) | |-- 10: long(nullable = true) | |-- 100: long(nullable = true) | |-- 101: long(nullable = true) | |-- 102: long(nullable = true) | |-- 103: long(nullable = true) | |-- 104: long(nullable = true) | |-- 105: long(nullable = true) | |-- 106: long(nullable = true) | |-- 107: long(nullable = true) | |-- 108: long(nullable = true) | |-- 109: