pyspark

Pyspark create dictionary within groupby

陌路散爱 提交于 2020-06-27 17:10:22
问题 Is it possible in pyspark to create dictionary within groupBy.agg() ? Here is a toy example: import pyspark from pyspark.sql import Row import pyspark.sql.functions as F sc = pyspark.SparkContext() spark = pyspark.sql.SparkSession(sc) toy_data = spark.createDataFrame([ Row(id=1, key='a', value="123"), Row(id=1, key='b', value="234"), Row(id=1, key='c', value="345"), Row(id=2, key='a', value="12"), Row(id=2, key='x', value="23"), Row(id=2, key='y', value="123")]) toy_data.show() +---+---+-----

Pyspark UDF AttributeError: 'NoneType' object has no attribute '_jvm'

情到浓时终转凉″ 提交于 2020-06-27 17:01:05
问题 I have a udf function: @staticmethod @F.udf("array<int>") def create_users_array(val): """ Takes column of ints, returns column of arrays containing ints. """ return [val for _ in range(val)] I call it like so: df.withColumn("myArray", create_users_array(df["myNumber"])) I pass it a dataframe column of integers, and it returns an array of that integer. E.g. 4 --> [4,4,4,4] It was working until we upgraded from Python 2.7, and upgraded our EMR version (which I believe uses Pyspark 2.3) Anyone

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

[亡魂溺海] 提交于 2020-06-27 17:00:29
问题 I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column ( MeaningfulWords column) of df1 , according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list ( df1 ) is not in the dictionary ( df2 ), zero is scored. The Dataframes looks like this: df1.select("ID","MeaningfulWords").show

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

浪尽此生 提交于 2020-06-27 17:00:05
问题 I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column ( MeaningfulWords column) of df1 , according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list ( df1 ) is not in the dictionary ( df2 ), zero is scored. The Dataframes looks like this: df1.select("ID","MeaningfulWords").show

In spark, is it possible to reuse a DataFrame's execution plan to apply it to different data sources

会有一股神秘感。 提交于 2020-06-27 12:45:30
问题 I have a bit complex pipeline - pyspark which takes 20 minutes to come up with execution plan. Since I have to execute the same pipeline multiple times with different data frame (as source) Im wondering is there any option for me to avoid building execution plan every time? Build execution plan once and reuse it with different source data?` 回答1: There is a way to do what you ask but it requires advanced understanding of Spark internals. Spark plans are simply trees of objects. These trees are

java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark

元气小坏坏 提交于 2020-06-27 06:30:09
问题 I'm very new to pyspark/Apache Spark. I need to fetch multiple tables from a database on a server each containing around 120 million rows or more for analysis. I should be able to perform computations on the data. I am running pyspark on a server acting as both master and slave and has 7.45G of RAM. I have installed the jdbc driver and this is the code that I've used. from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext.getOrCreate() sqlContext = SQLContext(sc

Modify a struct column in spark dataframe

时光毁灭记忆、已成空白 提交于 2020-06-27 04:17:13
问题 I have a pyspark dataframe which contains a column "student" as follows: "student" : { "name" : "kaleem", "rollno" : "12" } Schema for this in dataframe is : structType(List( name: String, rollno: String)) I need to modify this column as "student" : { "student_details" : { "name" : "kaleem", "rollno" : "12" } } Schema for this in dataframe must be : structType(List( student_details: structType(List( name: String, rollno: String)) )) How to do this in spark? 回答1: Use named_struct function to

Processing Data on Spark Structured Streaming before outputting to the console

萝らか妹 提交于 2020-06-26 09:57:07
问题 I'll try to keep it simple. I periodically read some data from a kafka producer and output the following using Spark Structured streaming I have data that outputs like this: +------------------------------------------+-------------------+--------------+-----------------+ |window |timestamp |Online_Emp |Available_Emp | +------------------------------------------+-------------------+--------------+-----------------+ |[2017-12-31 16:01:00, 2017-12-31 16:02:00]|2017-12-31 16:01:27|1 |0 | |[2017

pyspark to_timestamp does not include milliseconds

瘦欲@ 提交于 2020-06-26 04:02:43
问题 I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ? I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function. This is my dataframe. +--------------------------+ |updated_date | +--------------------------+ |2019-01-04 11:09:21.152815| +--------------------------+ I use the millisecond format without any

pyspark create dictionary from data in two columns

大兔子大兔子 提交于 2020-06-25 10:05:46
问题 I have a pyspark dataframe with two columns: [Row(zip_code='58542', dma='MIN'), Row(zip_code='58701', dma='MIN'), Row(zip_code='57632', dma='MIN'), Row(zip_code='58734', dma='MIN')] How can I make a key:value pair out of the data inside the columns? e.g.: { "58542":"MIN", "58701:"MIN", etc.. } I would like to avoid using collect for performance reasons. I've tried a few things but can't seem to get just the values . 回答1: As Ankin says, you can use a MapType for this: import pyspark from