pyspark | 易学教程

Custom function over pyspark dataframe

阅读更多关于 Custom function over pyspark dataframe

问题 I'm trying to apply a custom function over rows in a pyspark dataframe. This function takes the row and 2 other vectors of the same dimension. It outputs the sum of the values of the third vector for each matching values from the row in the second vector. import pandas as pd import numpy as np Function: def V_sum(row,b,c): return float(np.sum(c[row==b])) What I want to achieve is simple with pandas: pd_df = pd.DataFrame([[0,1,0,0],[1,1,0,0],[0,0,1,0],[1,0,1,1],[1,1,0,0]], columns=['t1', 't2',

Filter pyspark dataframe based on time difference between two columns

阅读更多关于 Filter pyspark dataframe based on time difference between two columns

问题 I have a dataframe with multiple columns, two of which are of type pyspark.sql.TimestampType . I would like to filter this dataframe to rows where the time difference between these two columns is less than one hour. I'm currently trying to do this like so: examples = data.filter((data.tstamp - data.date) < datetime.timedelta(hours=1)) But this fails with the following error message: org.apache.spark.sql.AnalysisException: cannot resolve '(`tstamp` - `date`)' due to data type mismatch: '(

Converting latitude and longitude to UTM coordinates in pyspark

阅读更多关于 Converting latitude and longitude to UTM coordinates in pyspark

问题 I have dataframe contain longitude and latitude coordinates for each point. I want to convert the geographical coordinates for each point to UTM coordinates. I tried to use utm module (https://pypi.org/project/utm/) import utm df=df.withColumn('UTM',utm.from_latlon(fn.col('lat'),fn.col('lon'))) but I obtain this error : --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-39-8b21f98738ca> in <module>() ----> 1

DATABRICKS connect 6.4 not able to communicate with server anymore

阅读更多关于 DATABRICKS connect 6.4 not able to communicate with server anymore

问题 I am running Pycharm on my MacBook. Client settings: Python Interpreter -> Python 3.7 (dtabricks-connect-6.4) Cluster settings: Databricks Runtime Version -> 6.4 (includes Apache Spark 2.4.5, Scala 2.11) It worked well for months but suddenly, without any updates made, I cant run my python script from Pycharm against databricks cluster anymore. The Error is ... Caused by: `java.lang.IllegalArgumentException: The cluster is running server version `dbr-6.4` but this client only supports Set(dbr

PySpark - iterate rows of a Data Frame

阅读更多关于 PySpark - iterate rows of a Data Frame

问题 I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame. I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas. If I do for row in myDF: it iterates columns.DataFrame Thanks 回答1: You can use select method to operate on your dataframe using a user defined function something like this : columns = header.columns my_udf = F.udf(lambda data: "do what ever you want here " , StringType()) myDF.select(*[my

Adding a List element as a column to existing pyspark dataframe

阅读更多关于 Adding a List element as a column to existing pyspark dataframe

问题 I have a list lists=[0,1,2,3,5,6,7] . Order is not sequential. I have a pyspark dataframe with 9 columns. +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ | date|ftt (°c)|rtt (°c)|fbt (°c)|rbt (°c)|fmt (°c)|rmt (°c)|fmhhumidityunit|index|Diff| +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ |2019-02-01 05:29:47| NaN| NaN| NaN| NaN| NaN| NaN| NaN| 0| NaN| |2019-02-01 05:29:17| NaN| NaN|

Spark task runs on only one executor

阅读更多关于 Spark task runs on only one executor

问题 Hello everyone first and foremost i'm aware of the existence of this thread, Task is running on only one executor in spark. However this is not my case as i'm using repartition(n) on my dataframe. Basically i'm loading a DataFrame by fetching data from an ElasticSearch index through Spark as follows: spark = SparkSession.builder \ .appName("elastic") \ .master("yarn")\ .config('spark.submit.deployMode','client')\ .config("spark.jars",pathElkJar) \ .enableHiveSupport() \ .getOrCreate() es

Adding a List element as a column to existing pyspark dataframe

阅读更多关于 Adding a List element as a column to existing pyspark dataframe

cannot resolve column due to data type mismatch PySpark

阅读更多关于 cannot resolve column due to data type mismatch PySpark

问题 Error being faced in PySpark: pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n" Data strucutre: -- result_set: struct (nullable = true) | |-- currency:

How can I get a distinct RDD of dicts in PySpark?

阅读更多关于 How can I get a distinct RDD of dicts in PySpark?

问题 I have an RDD of dictionaries, and I'd like to get an RDD of just the distinct elements. However, when I try to call rdd.distinct() PySpark gives me the following error TypeError: unhashable type: 'dict' at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD