pyspark

Round double values and cast as integers

老子叫甜甜 提交于 2020-11-28 01:56:39
问题 I have a data frame in PySpark like below. import pyspark.sql.functions as func df = sqlContext.createDataFrame( [(0.0, 0.2, 3.45631), (0.4, 1.4, 2.82945), (0.5, 1.9, 7.76261), (0.6, 0.9, 2.76790), (1.2, 1.0, 9.87984)], ["col1", "col2", "col3"]) df.show() +----+----+-------+ |col1|col2| col3| +----+----+-------+ | 0.0| 0.2|3.45631| | 0.4| 1.4|2.82945| | 0.5| 1.9|7.76261| | 0.6| 0.9| 2.7679| | 1.2| 1.0|9.87984| +----+----+-------+ # round 'col3' in a new column: df2 = df.withColumn("col4",

Round double values and cast as integers

|▌冷眼眸甩不掉的悲伤 提交于 2020-11-28 01:56:32
问题 I have a data frame in PySpark like below. import pyspark.sql.functions as func df = sqlContext.createDataFrame( [(0.0, 0.2, 3.45631), (0.4, 1.4, 2.82945), (0.5, 1.9, 7.76261), (0.6, 0.9, 2.76790), (1.2, 1.0, 9.87984)], ["col1", "col2", "col3"]) df.show() +----+----+-------+ |col1|col2| col3| +----+----+-------+ | 0.0| 0.2|3.45631| | 0.4| 1.4|2.82945| | 0.5| 1.9|7.76261| | 0.6| 0.9| 2.7679| | 1.2| 1.0|9.87984| +----+----+-------+ # round 'col3' in a new column: df2 = df.withColumn("col4",

Spark DAG differs with 'withColumn' vs 'select'

你。 提交于 2020-11-27 20:59:04
问题 Context In a recent SO-post, I discovered that using withColumn may improve the DAG when dealing with stacked/chain column expressions in conjunction with distinct windows specifications. However, in this example, withColumn actually makes the DAG worse and differs to the outcome of using select instead. Reproducible example First, some test data (PySpark 2.4.4 standalone): import pandas as pd import numpy as np from pyspark.sql import SparkSession, Window from pyspark.sql import functions as