pyspark

Convert a standard python key value dictionary list to pyspark data frame

一曲冷凌霜 提交于 2020-04-07 13:51:09
问题 Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2? [{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}] How can i use the following construct to do it? df = sc.parallelize([ ... ]).toDF Where to place arg1 arg2 in the above code (...) 回答1: Old way: sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "",

Convert a standard python key value dictionary list to pyspark data frame

一曲冷凌霜 提交于 2020-04-07 13:50:45
问题 Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2? [{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}] How can i use the following construct to do it? df = sc.parallelize([ ... ]).toDF Where to place arg1 arg2 in the above code (...) 回答1: Old way: sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "",

Finding the number of users associated with the hive database [closed]

99封情书 提交于 2020-04-07 10:31:10
问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 3 days ago . Please let me know the method for finding the number of users assigned to the databases in Hive. 来源: https://stackoverflow.com/questions/60998388/finding-the-number-of-users-associated-with-the-hive-database

Finding the number of users associated with the hive database [closed]

青春壹個敷衍的年華 提交于 2020-04-07 10:30:12
问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 3 days ago . Please let me know the method for finding the number of users assigned to the databases in Hive. 来源: https://stackoverflow.com/questions/60998388/finding-the-number-of-users-associated-with-the-hive-database

Finding the number of users associated with the hive database [closed]

≡放荡痞女 提交于 2020-04-07 10:29:23
问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 3 days ago . Please let me know the method for finding the number of users assigned to the databases in Hive. 来源: https://stackoverflow.com/questions/60998388/finding-the-number-of-users-associated-with-the-hive-database

how to sort value before concatenate text columns in pyspark

纵然是瞬间 提交于 2020-04-07 08:00:13
问题 I need help to convert below code in Pyspark code or Pyspark sql code. df["full_name"] = df.apply(lambda x: "_".join(sorted((x["first"], x["last"]))), axis=1) Its basically adding one new column name full_name which have to concatenate values of the columns first and last in a sorted way. I have done below code but don't know how to apply to sort in a columns text value. df= df.withColumn('full_name', f.concat(f.col('first'),f.lit('_'), f.col('last'))) 回答1: From Spark-2.4+ : We can use array

Getting last value of group in Spark

给你一囗甜甜゛ 提交于 2020-04-07 03:44:12
问题 I have a SparkR DataFrame as shown below: #Create R data.frame custId <- c(rep(1001, 5), rep(1002, 3), 1003) date <- c('2013-08-01','2014-01-01','2014-02-01','2014-03-01','2014-04-01','2014-02-01','2014-03-01','2014-04-01','2014-04-01') desc <- c('New','New','Good','New', 'Bad','New','Good','Good','New') newcust <- c(1,1,0,1,0,1,0,0,1) df <- data.frame(custId, date, desc, newcust) #Create SparkR DataFrame df <- createDataFrame(df) display(df) custId| date | desc | newcust --------------------

How Count unique ID after groupBy in pyspark

て烟熏妆下的殇ゞ 提交于 2020-04-05 15:41:49
问题 I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year. from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped = gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year')) The result is : [students by year][1] The problem that I discovered that so many ID's are repeated So the result is wrong and huge. I want to agregate the students by year, count the total

How to extract floats from vector columns in PySpark?

风格不统一 提交于 2020-03-28 06:40:25
问题 My Spark DataFrame has data in the following format: The printSchema() shows that each column is of the type vector . I tried to get the values out of [ and ] using the code below (for 1 columns col1 ): from pyspark.sql.functions import udf from pyspark.sql.types import FloatType firstelement=udf(lambda v:float(v[0]),FloatType()) df.select(firstelement('col1')).show() However, how can I apply it to all columns of df ? 回答1: 1. Extract first element of a single vector column: To get the first

PySpark: Output of OneHotEncoder looks odd [duplicate]

こ雲淡風輕ζ 提交于 2020-03-25 18:23:16
问题 This question already has an answer here : Spark ML VectorAssembler returns strange output (1 answer) Closed 2 years ago . The Spark documentation contains a PySpark example for its OneHotEncoder : from pyspark.ml.feature import OneHotEncoder, StringIndexer df = spark.createDataFrame([ (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c") ], ["id", "category"]) stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex") model = stringIndexer.fit(df) indexed = model