pyspark

Custom function over pyspark dataframe

孤人 提交于 2021-01-28 07:00:30
问题 I'm trying to apply a custom function over rows in a pyspark dataframe. This function takes the row and 2 other vectors of the same dimension. It outputs the sum of the values of the third vector for each matching values from the row in the second vector. import pandas as pd import numpy as np Function: def V_sum(row,b,c): return float(np.sum(c[row==b])) What I want to achieve is simple with pandas: pd_df = pd.DataFrame([[0,1,0,0],[1,1,0,0],[0,0,1,0],[1,0,1,1],[1,1,0,0]], columns=['t1', 't2',

Filter pyspark dataframe based on time difference between two columns

ぃ、小莉子 提交于 2021-01-28 06:42:41
问题 I have a dataframe with multiple columns, two of which are of type pyspark.sql.TimestampType . I would like to filter this dataframe to rows where the time difference between these two columns is less than one hour. I'm currently trying to do this like so: examples = data.filter((data.tstamp - data.date) < datetime.timedelta(hours=1)) But this fails with the following error message: org.apache.spark.sql.AnalysisException: cannot resolve '(`tstamp` - `date`)' due to data type mismatch: '(

Converting latitude and longitude to UTM coordinates in pyspark

旧时模样 提交于 2021-01-28 06:10:43
问题 I have dataframe contain longitude and latitude coordinates for each point. I want to convert the geographical coordinates for each point to UTM coordinates. I tried to use utm module (https://pypi.org/project/utm/) import utm df=df.withColumn('UTM',utm.from_latlon(fn.col('lat'),fn.col('lon'))) but I obtain this error : --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-39-8b21f98738ca> in <module>() ----> 1

DATABRICKS connect 6.4 not able to communicate with server anymore

浪尽此生 提交于 2021-01-28 06:10:19
问题 I am running Pycharm on my MacBook. Client settings: Python Interpreter -> Python 3.7 (dtabricks-connect-6.4) Cluster settings: Databricks Runtime Version -> 6.4 (includes Apache Spark 2.4.5, Scala 2.11) It worked well for months but suddenly, without any updates made, I cant run my python script from Pycharm against databricks cluster anymore. The Error is ... Caused by: `java.lang.IllegalArgumentException: The cluster is running server version `dbr-6.4` but this client only supports Set(dbr

PySpark - iterate rows of a Data Frame

十年热恋 提交于 2021-01-28 06:05:50
问题 I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame. I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas. If I do for row in myDF: it iterates columns.DataFrame Thanks 回答1: You can use select method to operate on your dataframe using a user defined function something like this : columns = header.columns my_udf = F.udf(lambda data: "do what ever you want here " , StringType()) myDF.select(*[my

Adding a List element as a column to existing pyspark dataframe

拜拜、爱过 提交于 2021-01-28 06:02:19
问题 I have a list lists=[0,1,2,3,5,6,7] . Order is not sequential. I have a pyspark dataframe with 9 columns. +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ | date|ftt (°c)|rtt (°c)|fbt (°c)|rbt (°c)|fmt (°c)|rmt (°c)|fmhhumidityunit|index|Diff| +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ |2019-02-01 05:29:47| NaN| NaN| NaN| NaN| NaN| NaN| NaN| 0| NaN| |2019-02-01 05:29:17| NaN| NaN|

Spark task runs on only one executor

本小妞迷上赌 提交于 2021-01-28 06:01:32
问题 Hello everyone first and foremost i'm aware of the existence of this thread, Task is running on only one executor in spark. However this is not my case as i'm using repartition(n) on my dataframe. Basically i'm loading a DataFrame by fetching data from an ElasticSearch index through Spark as follows: spark = SparkSession.builder \ .appName("elastic") \ .master("yarn")\ .config('spark.submit.deployMode','client')\ .config("spark.jars",pathElkJar) \ .enableHiveSupport() \ .getOrCreate() es

Adding a List element as a column to existing pyspark dataframe

只愿长相守 提交于 2021-01-28 05:57:14
问题 I have a list lists=[0,1,2,3,5,6,7] . Order is not sequential. I have a pyspark dataframe with 9 columns. +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ | date|ftt (°c)|rtt (°c)|fbt (°c)|rbt (°c)|fmt (°c)|rmt (°c)|fmhhumidityunit|index|Diff| +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ |2019-02-01 05:29:47| NaN| NaN| NaN| NaN| NaN| NaN| NaN| 0| NaN| |2019-02-01 05:29:17| NaN| NaN|

cannot resolve column due to data type mismatch PySpark

妖精的绣舞 提交于 2021-01-28 05:11:16
问题 Error being faced in PySpark: pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n" Data strucutre: -- result_set: struct (nullable = true) | |-- currency:

How can I get a distinct RDD of dicts in PySpark?

人走茶凉 提交于 2021-01-28 04:10:20
问题 I have an RDD of dictionaries, and I'd like to get an RDD of just the distinct elements. However, when I try to call rdd.distinct() PySpark gives me the following error TypeError: unhashable type: 'dict' at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD