How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

问题

I have a CSV like that:

COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123

I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below:

+-----+------------------+
|  COL|               VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2|    200000000.1234|
|TEST3|   9999.1234679123|
+-----+------------------+

The problem I'm facing is that whenever I load it, the numbers become scientific notation, and I cannot persist it back without having to inform the precision and scale of my data (I want to use the one that it is already in the file, whatever it is - I can't infer it). Here's what I have tried:

Loading it with DoubleType() it gives me scientific notation:

schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DoubleType())
])

csv_file = "Downloads/test.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))

df2.show()

+-----+--------------------+
|  COL|                 VAL|
+-----+--------------------+
| TEST|1.0000000012345679E8|
|TEST2|    2.000000001234E8|
|TEST3|     9999.1234679123|
+-----+--------------------+

Loading it with DecimalType() I'm required to specify precision and scale, otherwise, I lose the decimals after the dot. However, specifying it, besides the risk of not getting the correct value (as my data might be rounded), I get zeros after the dot: For example, using: StructField('VAL', DecimalType(38, 18)) I get:

[Row(COL='TEST', VAL=Decimal('100000000.123456790000000000')),
Row(COL='TEST2', VAL=Decimal('200000000.123400000000000000')),
Row(COL='TEST3', VAL=Decimal('9999.123467912300000000'))]

Realise that in this case, I have zeros on the right side that I don't want in my new file.

The only way I found to address it was using a UDF where I first use the float() to remove the scientific notation and then I convert it to string to make sure it will be persisted as I want:

to_decimal = udf(lambda n: str(float(n)))

df2 = df2.select("*", to_decimal("VAL").alias("VAL2"))
df2 = df2.select(["COL", "VAL2"]).withColumnRenamed("VAL2", "VAL")
df2.show()
display(df2.schema)

+-----+------------------+
|  COL|               VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2|    200000000.1234|
|TEST3|   9999.1234679123|
+-----+------------------+

StructType(List(StructField(COL,StringType,true),StructField(VAL,StringType,true)))

There's any way to reach the same without using the UDF trick?

Thank you!

回答1:

The best way I found to address it was as bellow. It is still using UDF, but now, without the workarounds with Strings to avoid scientific notation. I won't make it as correct answer yet, because I still expect someone coming over with a solution without UDF (or a good explanation of why it's not possible without UDFs).

The CSV:

$ cat /Users/bambrozi/Downloads/testf.csv
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
TEST4,123456789.01234567

Load the CSV applying the default PySpark DecimalType precision and scale:

schema = StructType([
    StructField('COL', StringType()),
    StructField('VAL', DecimalType(38, 18))
])

csv_file = "Downloads/testf.csv"
df2 = (spark.read.format("csv")
        .option("sep",",")
        .option("header", "true")
        .schema(schema)
        .load(csv_file))

df2.show(truncate=False)

output:

+-----+----------------------------+
|COL  |VAL                         |
+-----+----------------------------+
|TEST |100000000.123456790000000000|
|TEST2|200000000.123400000000000000|
|TEST3|9999.123467912300000000     |
|TEST4|123456789.012345670000000000|
+-----+----------------------------+

When you are ready to report it (print or save in a new file) you apply a format to trailing zeros:

import decimal
import pyspark.sql.functions as F
normalize_decimals = F.udf(lambda dec: dec.normalize())
(df2
    .withColumn('VAL', normalize_decimals(F.col('VAL')))
    .show(truncate=False))

output:

+-----+------------------+
|COL  |VAL               |
+-----+------------------+
|TEST |100000000.12345679|
|TEST2|200000000.1234    |
|TEST3|9999.1234679123   |
|TEST4|123456789.01234567|
+-----+------------------+

回答2:

You can use spark to do that with sql query :

import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}

val sparkConf: SparkConf = new SparkConf(true)
    .setAppName(this.getClass.getName)
    .setMaster("local[*]")

implicit val spark: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

val df = spark.read.option("header", "true").format("csv").load(csv_file)
df.createOrReplaceTempView("table")

val query = "Select cast(VAL as BigDecimal) as VAL, COL from table"
val result = spark.sql(query)
result.show()
result.coalesce(1).write.option("header", "true").mode("overwrite").csv(outputPath + table)

来源：https://stackoverflow.com/questions/64772851/how-to-load-big-double-numbers-in-a-pyspark-dataframe-and-persist-it-back-withou

标签

dataframe

pyspark

user-defined-functions

pyspark-dataframes

scientific-notation