pyspark

Python pandas_udf spark error

こ雲淡風輕ζ 提交于 2020-07-05 10:36:08
问题 I started playing around with spark locally and finding this weird issue 1) pip install pyspark==2.3.1 2) pyspark> import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType, udf df = pd.DataFrame({'x': [1,2,3], 'y':[1.0,2.0,3.0]}) sp_df = spark.createDataFrame(df) @pandas_udf('long', PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1 sp_df.withColumn('v2', pandas_plus_one(sp_df.x)).show() Taking this example from here https://databricks.com/blog/2017/10/30

Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

半世苍凉 提交于 2020-07-05 10:11:07
问题 I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector. When I try to submit my spark job without a dependency using sparkctl create sparkjob.yaml ... with below .yaml file, it works like a charm. apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-job namespace: my-namespace spec: type: Python pythonVersion: "3" hadoopConf:

Pyspark - Aggregation on multiple columns

雨燕双飞 提交于 2020-07-05 06:51:08
问题 I have data like below. Filename:babynames.csv. year name percent sex 1880 John 0.081541 boy 1880 William 0.080511 boy 1880 James 0.050057 boy I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD). year sex avg(percentage) count(rows) 1880 boy 0.070703 3 I am not sure how to proceed after the following step in pyspark. Need your help on this testrdd = sc.textFile("babynames.csv"); rows = testrdd.map(lambda y:y

How to get the schema definition from a dataframe in PySpark?

对着背影说爱祢 提交于 2020-07-05 02:39:09
问题 In PySpark it you can define a schema and read data sources with this pre-defined schema, e. g.: Schema = StructType([ StructField("temperature", DoubleType(), True), StructField("temperature_unit", StringType(), True), StructField("humidity", DoubleType(), True), StructField("humidity_unit", StringType(), True), StructField("pressure", DoubleType(), True), StructField("pressure_unit", StringType(), True) ]) For some datasources it is possible to infer the schema from the data-source and get

Why agg() in PySpark is only able to summarize one column at a time? [duplicate]

断了今生、忘了曾经 提交于 2020-07-04 13:49:12
问题 This question already has answers here : Multiple Aggregate operations on the same column of a spark dataframe (3 answers) Closed 3 years ago . For the below dataframe df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High']) When I try to find min & max I am only getting min value in output. df.agg({'High':'max','High':'min'}).show() +-----------+ |min(High) | +-----------+ | 2094900| +-----------+ Why can't agg() give both max & min like in Pandas? 回答1: As you

Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

天涯浪子 提交于 2020-07-03 08:09:06
问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka. However, the current version of Spark is 2.1.0, which does not allow me to set group id as a parameter and will generate a unique id for each query. But the Kafka connection is group-based authorization which requires a pre-set group id. Hence, is there any workaround to establish the connection without the need to update Spark to 2.2 since my team does not want it. My Code: if __name__ == "__main__":

Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

≡放荡痞女 提交于 2020-07-03 08:09:04
问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka. However, the current version of Spark is 2.1.0, which does not allow me to set group id as a parameter and will generate a unique id for each query. But the Kafka connection is group-based authorization which requires a pre-set group id. Hence, is there any workaround to establish the connection without the need to update Spark to 2.2 since my team does not want it. My Code: if __name__ == "__main__":

Error including a column in a join between spark dataframes

我怕爱的太早我们不能终老 提交于 2020-06-29 06:42:21
问题 I have a join between cleanDF and sentiment_df using array_contains that works fine (from solution 61687997). And I need include in the Result df a new column ('Year') from cleanDF . This is the join: from pyspark.sql import functions Result = cleanDF.join(sentiment_df, expr("""array_contains(MeaningfulWords,word)"""), how='left')\ .groupBy("ID")\ .agg(first("MeaningfulWords").alias("MeaningfulWords")\ ,collect_list("score").alias("ScoreList")\ ,mean("score").alias("MeanScore")) This is the

Parsing Nested JSON into a Spark DataFrame Using PySpark

陌路散爱 提交于 2020-06-29 05:44:49
问题 I would really love some help with parsing nested JSON data using PySpark-SQL. The data has the following schema (blank spaces are edits for confidentiality purposes...) Schema root |-- location_info: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- restaurant_type: string (nullable = true) | | | | | | | | |-- other_data: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- other_data_1 string (nullable = true) | | | | |-- other_data_2

How can I convert a Pyspark dataframe to a CSV without sending it to a file?

試著忘記壹切 提交于 2020-06-29 05:04:43
问题 I have a dataframe which I need to convert to a CSV file, and then I need to send this CSV to an API. As I'm sending it to an API, I do not want to save it to the local filesystem and need to keep it in memory. How can I do this? 回答1: Easy way: convert your dataframe to Pandas dataframe with toPandas() , then save to a string. To save to a string, not a file, you'll have to call to_csv with path_or_buf=None . Then send the string in an API call. From to_csv() documentation: Parameters path_or