pyspark

S3Guard and parquet magic commiter for S3A on EMR 6.x

六眼飞鱼酱① 提交于 2020-12-27 07:12:39
问题 We are using CDH 5.13 with Spark 2.3.0 and S3Guard. After running the same job on EMR 5.x / 6.x with the same resources we got 5-20x performance degradation. According to the https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html default committer(since 5.20) is not good for S3A. We tested EMR-5.15.1 and got the same results as on Hadoop. If I am trying to use Magic Commiter I am getting py4j.protocol.Py4JJavaError: An error occurred while calling o72.save. : java

S3Guard and parquet magic commiter for S3A on EMR 6.x

瘦欲@ 提交于 2020-12-27 07:02:30
问题 We are using CDH 5.13 with Spark 2.3.0 and S3Guard. After running the same job on EMR 5.x / 6.x with the same resources we got 5-20x performance degradation. According to the https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html default committer(since 5.20) is not good for S3A. We tested EMR-5.15.1 and got the same results as on Hadoop. If I am trying to use Magic Commiter I am getting py4j.protocol.Py4JJavaError: An error occurred while calling o72.save. : java

ipython is not recognized as an internal or external command (pyspark)

删除回忆录丶 提交于 2020-12-26 07:44:47
问题 I have installed spark the release: spark-2.2.0-bin-hadoop2.7 . I'm using Windows 10 OS My java version 1.8.0_144 I have set my environment variables: SPARK_HOME D:\spark-2.2.0-bin-hadoop2.7 HADOOP_HOME D:\Hadoop ( where I put bin\winutils.exe ) PYSPARK_DRIVER_PYTHON ipython PYSPARK_DRIVER_PYTHON_OPTS notebook Path is D:\spark-2.2.0-bin-hadoop2.7\bin When I launch pyspark from command line I have this error: ipython is not recognized as an internal or external command I tried also to set

Pyspark: How to add ten days to existing date column

五迷三道 提交于 2020-12-26 06:22:40
问题 I have a dataframe in Pyspark with a date column called "report_date". I want to create a new column called "report_date_10" that is 10 days added to the original report_date column. Below is the code I tried: df_dc["report_date_10"] = df_dc["report_date"] + timedelta(days=10) This is the error I got: AttributeError: 'datetime.timedelta' object has no attribute '_get_object_id' Help! thx 回答1: It seems you are using the pandas syntax for adding a column; For spark, you need to use withColumn

PySpark: Union of all the dataframes in a Python dictionary

不打扰是莪最后的温柔 提交于 2020-12-26 05:02:51
问题 I have a dictionary my_dict_of_df which consists of variable number of dataframes each time my program runs. I want to create a new dataframe that is a union of all these dataframes. My dataframes look like- my_dict_of_df["df_1"], my_dict_of_df["df_2"] and so on... How do I union all these dataframes? 回答1: Consulted the solution given here, thanks to @pault. from functools import reduce from pyspark.sql import DataFrame def union_all(*dfs): return reduce(DataFrame.union, dfs) df1 = sqlContext

PySpark: Union of all the dataframes in a Python dictionary

我的梦境 提交于 2020-12-26 05:02:29
问题 I have a dictionary my_dict_of_df which consists of variable number of dataframes each time my program runs. I want to create a new dataframe that is a union of all these dataframes. My dataframes look like- my_dict_of_df["df_1"], my_dict_of_df["df_2"] and so on... How do I union all these dataframes? 回答1: Consulted the solution given here, thanks to @pault. from functools import reduce from pyspark.sql import DataFrame def union_all(*dfs): return reduce(DataFrame.union, dfs) df1 = sqlContext

Pyspark: reshape data without aggregation

泪湿孤枕 提交于 2020-12-26 05:00:28
问题 I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following: columns = ['FAULTY', 'value_HIGH', 'count'] vals = [ (1, 0, 141), (0, 0, 140), (1, 1, 21), (0, 1, 12) ] What I want is a contingency table with the second column as two new binary columns ( value_HIGH_1 , value_HIGH_0 ) and the values from the count column - meaning: columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0'] vals = [ (1, 21, 141), (0, 12, 140) ] 回答1: You can use pivot with

Pyspark: reshape data without aggregation

寵の児 提交于 2020-12-26 05:00:05
问题 I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following: columns = ['FAULTY', 'value_HIGH', 'count'] vals = [ (1, 0, 141), (0, 0, 140), (1, 1, 21), (0, 1, 12) ] What I want is a contingency table with the second column as two new binary columns ( value_HIGH_1 , value_HIGH_0 ) and the values from the count column - meaning: columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0'] vals = [ (1, 21, 141), (0, 12, 140) ] 回答1: You can use pivot with

Filling Missing sales value with zero and calculate 3 month average in PySpark

我的未来我决定 提交于 2020-12-26 04:31:33
问题 I Want add missing values with zero sales and calculate 3 month average in pyspark My Input : product specialty date sales A pharma 1/3/2019 50 A pharma 1/4/2019 60 A pharma 1/5/2019 70 A pharma 1/8/2019 80 A ENT 1/8/2019 50 A ENT 1/9/2019 65 A ENT 1/11/2019 40 my output: product specialty date sales 3month_avg_sales A pharma 1/3/2019 50 16.67 A pharma 1/4/2019 60 36.67 A pharma 1/5/2019 70 60 A pharma 1/6/2019 0 43.33 A pharma 1/7/2019 0 23.33 A pharma 1/8/2019 80 26.67 A ENT 1/8/2019 50 16

How to rename my JSON generated by pyspark?

那年仲夏 提交于 2020-12-25 05:48:05
问题 When i write my JSON file with dataframe.coalesce(1).write.format('json') on pyspark im not able to change the name of file in the partition Im writing my JSON like that: dataframe.coalesce(1).write.format('json').mode('overwrite').save('path') but im not able to change the name of file in the partition I want the path like that: /folder/my_name.json where 'my_name.json' is a json file 回答1: In spark we can't control name of the file written to the directory. First write the data to the HDFS