How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

五迷三道 提交于 2020-12-15 07:18:10

问题


I have a CSV like that:

COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123

I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below:

+-----+------------------+
|  COL|               VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2|    200000000.1234|
|TEST3|   9999.1234679123|
+-----+------------------+

The problem I'm facing is that whenever I load it, the numbers become scientific notation, and I cannot persist it back without having to inform the precision and scale of my data (I want to use the one that it is already in the file, whatever it is - I can't infer it). Here's what I have tried:

Loading it with DoubleType() it gives me scientific notation:

schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DoubleType())
])

csv_file = "Downloads/test.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))

df2.show()

+-----+--------------------+
|  COL|                 VAL|
+-----+--------------------+
| TEST|1.0000000012345679E8|
|TEST2|    2.000000001234E8|
|TEST3|     9999.1234679123|
+-----+--------------------+

Loading it with DecimalType() I'm required to specify precision and scale, otherwise, I lose the decimals after the dot. However, specifying it, besides the risk of not getting the correct value (as my data might be rounded), I get zeros after the dot: For example, using: StructField('VAL', DecimalType(38, 18)) I get:

[Row(COL='TEST', VAL=Decimal('100000000.123456790000000000')),
Row(COL='TEST2', VAL=Decimal('200000000.123400000000000000')),
Row(COL='TEST3', VAL=Decimal('9999.123467912300000000'))]

Realise that in this case, I have zeros on the right side that I don't want in my new file.

The only way I found to address it was using a UDF where I first use the float() to remove the scientific notation and then I convert it to string to make sure it will be persisted as I want:

to_decimal = udf(lambda n: str(float(n)))

df2 = df2.select("*", to_decimal("VAL").alias("VAL2"))
df2 = df2.select(["COL", "VAL2"]).withColumnRenamed("VAL2", "VAL")
df2.show()
display(df2.schema)

+-----+------------------+
|  COL|               VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2|    200000000.1234|
|TEST3|   9999.1234679123|
+-----+------------------+

StructType(List(StructField(COL,StringType,true),StructField(VAL,StringType,true)))

There's any way to reach the same without using the UDF trick?

Thank you!


回答1:


The best way I found to address it was as bellow. It is still using UDF, but now, without the workarounds with Strings to avoid scientific notation. I won't make it as correct answer yet, because I still expect someone coming over with a solution without UDF (or a good explanation of why it's not possible without UDFs).

  1. The CSV:
$ cat /Users/bambrozi/Downloads/testf.csv
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
TEST4,123456789.01234567
  1. Load the CSV applying the default PySpark DecimalType precision and scale:
schema = StructType([
    StructField('COL', StringType()),
    StructField('VAL', DecimalType(38, 18))
])

csv_file = "Downloads/testf.csv"
df2 = (spark.read.format("csv")
        .option("sep",",")
        .option("header", "true")
        .schema(schema)
        .load(csv_file))

df2.show(truncate=False)

output:

+-----+----------------------------+
|COL  |VAL                         |
+-----+----------------------------+
|TEST |100000000.123456790000000000|
|TEST2|200000000.123400000000000000|
|TEST3|9999.123467912300000000     |
|TEST4|123456789.012345670000000000|
+-----+----------------------------+
  1. When you are ready to report it (print or save in a new file) you apply a format to trailing zeros:
import decimal
import pyspark.sql.functions as F
normalize_decimals = F.udf(lambda dec: dec.normalize())
(df2
    .withColumn('VAL', normalize_decimals(F.col('VAL')))
    .show(truncate=False))

output:

+-----+------------------+
|COL  |VAL               |
+-----+------------------+
|TEST |100000000.12345679|
|TEST2|200000000.1234    |
|TEST3|9999.1234679123   |
|TEST4|123456789.01234567|
+-----+------------------+



回答2:


You can use spark to do that with sql query :

import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}

val sparkConf: SparkConf = new SparkConf(true)
    .setAppName(this.getClass.getName)
    .setMaster("local[*]")

implicit val spark: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

val df = spark.read.option("header", "true").format("csv").load(csv_file)
df.createOrReplaceTempView("table")

val query = "Select cast(VAL as BigDecimal) as VAL, COL from table"
val result = spark.sql(query)
result.show()
result.coalesce(1).write.option("header", "true").mode("overwrite").csv(outputPath + table)


来源:https://stackoverflow.com/questions/64772851/how-to-load-big-double-numbers-in-a-pyspark-dataframe-and-persist-it-back-withou

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!