pyspark | 易学教程

S3Guard and parquet magic commiter for S3A on EMR 6.x

阅读更多关于 S3Guard and parquet magic commiter for S3A on EMR 6.x

问题 We are using CDH 5.13 with Spark 2.3.0 and S3Guard. After running the same job on EMR 5.x / 6.x with the same resources we got 5-20x performance degradation. According to the https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html default committer(since 5.20) is not good for S3A. We tested EMR-5.15.1 and got the same results as on Hadoop. If I am trying to use Magic Commiter I am getting py4j.protocol.Py4JJavaError: An error occurred while calling o72.save. : java

S3Guard and parquet magic commiter for S3A on EMR 6.x

阅读更多关于 S3Guard and parquet magic commiter for S3A on EMR 6.x

ipython is not recognized as an internal or external command (pyspark)

阅读更多关于 ipython is not recognized as an internal or external command (pyspark)

问题 I have installed spark the release: spark-2.2.0-bin-hadoop2.7 . I'm using Windows 10 OS My java version 1.8.0_144 I have set my environment variables: SPARK_HOME D:\spark-2.2.0-bin-hadoop2.7 HADOOP_HOME D:\Hadoop ( where I put bin\winutils.exe ) PYSPARK_DRIVER_PYTHON ipython PYSPARK_DRIVER_PYTHON_OPTS notebook Path is D:\spark-2.2.0-bin-hadoop2.7\bin When I launch pyspark from command line I have this error: ipython is not recognized as an internal or external command I tried also to set

Pyspark: How to add ten days to existing date column

阅读更多关于 Pyspark: How to add ten days to existing date column

问题 I have a dataframe in Pyspark with a date column called "report_date". I want to create a new column called "report_date_10" that is 10 days added to the original report_date column. Below is the code I tried: df_dc["report_date_10"] = df_dc["report_date"] + timedelta(days=10) This is the error I got: AttributeError: 'datetime.timedelta' object has no attribute '_get_object_id' Help! thx 回答1: It seems you are using the pandas syntax for adding a column; For spark, you need to use withColumn

PySpark: Union of all the dataframes in a Python dictionary

阅读更多关于 PySpark: Union of all the dataframes in a Python dictionary

问题 I have a dictionary my_dict_of_df which consists of variable number of dataframes each time my program runs. I want to create a new dataframe that is a union of all these dataframes. My dataframes look like- my_dict_of_df["df_1"], my_dict_of_df["df_2"] and so on... How do I union all these dataframes? 回答1: Consulted the solution given here, thanks to @pault. from functools import reduce from pyspark.sql import DataFrame def union_all(*dfs): return reduce(DataFrame.union, dfs) df1 = sqlContext

PySpark: Union of all the dataframes in a Python dictionary

阅读更多关于 PySpark: Union of all the dataframes in a Python dictionary

Pyspark: reshape data without aggregation

阅读更多关于 Pyspark: reshape data without aggregation

问题 I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following: columns = ['FAULTY', 'value_HIGH', 'count'] vals = [ (1, 0, 141), (0, 0, 140), (1, 1, 21), (0, 1, 12) ] What I want is a contingency table with the second column as two new binary columns ( value_HIGH_1 , value_HIGH_0 ) and the values from the count column - meaning: columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0'] vals = [ (1, 21, 141), (0, 12, 140) ] 回答1: You can use pivot with

Pyspark: reshape data without aggregation

阅读更多关于 Pyspark: reshape data without aggregation

Filling Missing sales value with zero and calculate 3 month average in PySpark

阅读更多关于 Filling Missing sales value with zero and calculate 3 month average in PySpark

问题 I Want add missing values with zero sales and calculate 3 month average in pyspark My Input : product specialty date sales A pharma 1/3/2019 50 A pharma 1/4/2019 60 A pharma 1/5/2019 70 A pharma 1/8/2019 80 A ENT 1/8/2019 50 A ENT 1/9/2019 65 A ENT 1/11/2019 40 my output: product specialty date sales 3month_avg_sales A pharma 1/3/2019 50 16.67 A pharma 1/4/2019 60 36.67 A pharma 1/5/2019 70 60 A pharma 1/6/2019 0 43.33 A pharma 1/7/2019 0 23.33 A pharma 1/8/2019 80 26.67 A ENT 1/8/2019 50 16

How to rename my JSON generated by pyspark?

阅读更多关于 How to rename my JSON generated by pyspark?

问题 When i write my JSON file with dataframe.coalesce(1).write.format('json') on pyspark im not able to change the name of file in the partition Im writing my JSON like that: dataframe.coalesce(1).write.format('json').mode('overwrite').save('path') but im not able to change the name of file in the partition I want the path like that: /folder/my_name.json where 'my_name.json' is a json file 回答1: In spark we can't control name of the file written to the directory. First write the data to the HDFS