pyspark

Converting series from pandas to pyspark: need to use “groupby” and “size”, but pyspark yields error

不打扰是莪最后的温柔 提交于 2021-02-17 07:03:31
问题 I am converting some code from Pandas to pyspark. In pandas, lets imagine I have the following mock dataframe, df: And in pandas, I define a certain variable the following way: value = df.groupby(["Age", "Siblings"]).size() And the output is a series as follows: However, when trying to covert this to pyspark, an error comes up: AttributeError: 'GroupedData' object has no attribute 'size' . Can anyone help me solve this? 回答1: The equivalent of size in pyspark is count: df.groupby(["Age",

Pyspark - How to get basic stats (mean, min, max) along with quantiles (25%, 50%) for numerical cols in a single dataframe

此生再无相见时 提交于 2021-02-17 05:37:26
问题 I have a spark df spark_df = spark.createDataFrame( [(1, 7, 'foo'), (2, 6, 'bar'), (3, 4, 'foo'), (4, 8, 'bar'), (5, 1, 'bar') ], ['v1', 'v2', 'id'] ) Expected Output id avg(v1) avg(v2) min(v1) min(v2) 0.25(v1) 0.25(v2) 0.5(v1) 0.5(v2) 0 bar 3.666667 5.0 2 1 some-value some-value some-value some-value 1 foo 2.000000 5.5 1 4. some-value some-value some-value some-value Until, now I can achieve the basic stats like avg, min, max. But not able to get the quantiles. I know ,this can be achieved

Spark - Read and Write back to same S3 location

a 夏天 提交于 2021-02-17 02:48:10
问题 I am reading a dataset dataset1 and dataset2 from S3 locations. I then transform them and write back to the same location where dataset2 was read from. However, I get below error message: An error occurred while calling o118.save. No such file or directory 's3://<myPrefix>/part-00001-a123a120-7d11-581a-b9df-bc53076d57894-c000.snappy.parquet If I try to write to a new S3 location e.g. s3://dataset_new_path.../ then the code works fine. my_df \ .write.mode('overwrite') \ .format('parquet') \

Pyspark connection to the Microsoft SQL server?

大憨熊 提交于 2021-02-17 02:47:40
问题 I have a huge dataset in SQL server, I want to Connect the SQL server with python, then use pyspark to run the query. I've seen the JDBC driver but I don't find the way to do it, I did it with PYODBC but not with a spark. Any help would be appreciated. 回答1: Please use the following to connect to Microsoft SQL: def connect_to_sql( spark, jdbc_hostname, jdbc_port, database, data_table, username, password ): jdbc_url = "jdbc:sqlserver://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)

Pyspark connection to the Microsoft SQL server?

廉价感情. 提交于 2021-02-17 02:47:29
问题 I have a huge dataset in SQL server, I want to Connect the SQL server with python, then use pyspark to run the query. I've seen the JDBC driver but I don't find the way to do it, I did it with PYODBC but not with a spark. Any help would be appreciated. 回答1: Please use the following to connect to Microsoft SQL: def connect_to_sql( spark, jdbc_hostname, jdbc_port, database, data_table, username, password ): jdbc_url = "jdbc:sqlserver://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)

pyspark 'DataFrame' object has no attribute '_get_object_id'

百般思念 提交于 2021-02-16 18:57:44
问题 I am trying to run some code, but getting error: 'DataFrame' object has no attribute '_get_object_id' The code: items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),(3,300), (3,float('Nan'))] sc = spark.sparkContext rdd = sc.parallelize(items) df = rdd.toDF(["id", "col1"]) import pyspark.sql.functions as func means = df.groupby("id").agg(func.mean("col1")) # The error is thrown at this line df = df.withColumn("col1", func.when((df["col1"].isNull()), means.where(func

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

老子叫甜甜 提交于 2021-02-16 08:43:54
问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

感情迁移 提交于 2021-02-16 08:42:31
问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

99封情书 提交于 2021-02-16 08:42:11
问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

Read fixed width file using schema from json file in pyspark

流过昼夜 提交于 2021-02-16 05:33:52
问题 I have fixed width file as below 00120181120xyz12341 00220180203abc56792 00320181203pqr25483 And a corresponding JSON file that specifies the schema: {"Column":"id","From":"1","To":"3"} {"Column":"date","From":"4","To":"8"} {"Column":"name","From":"12","To":"3"} {"Column":"salary","From":"15","To":"5"} I read the schema file into DataFrame using: SchemaFile = spark.read\ .format("json")\ .option("header","true")\ .json('C:\Temp\schemaFile\schema.json') SchemaFile.show() #+------+----+---+ #