datatype for handling big numbers in pyspark

问题

I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column. Following are my commands in pyspark.

>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)

Note: I have also tried 'long' and 'bigint' in place of 'float' in my variable test_temp, but the error in spark was 'keyword not found' And following is the output

[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]

The value in my csv file is as follows: 8.27370028700801e+21 is 8273700287008010012345 8.37670028702205e+21 is 8376700287022050054321

When I create a data frame out of it and then query it,

>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()

the test_column gives value 'null' for all the records.

So, how to solve this problem of parsing big number in spark, really appreciate your help

回答1:

Well, types matter. Since you convert your data to float you cannot use LongType in the DataFrame. It doesn't blow only because PySpark is relatively forgiving when it comes to types.

Also, 8273700287008010012345 is to large to be represented as LontType which can represent only the values between -9223372036854775808 and 9223372036854775807.

If you want to your data to a DataFrame you'll have to use DoubleType:

from pyspark.sql.types import *

rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()

## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

Typically it is a better idea to handle this with DataFrames directly:

from pyspark.sql.functions import col

str_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()

## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

If you don't want to use Double you can cast to Decimal with specified precision:

str_df.select(col("x").cast(DecimalType(38))).show(1, False)

## +----------------------+
## |x                     |
## +----------------------+
## |8273700287008010012345|
## +----------------------+

来源：https://stackoverflow.com/questions/36349585/datatype-for-handling-big-numbers-in-pyspark

标签

python

apache-spark

pyspark

apache-spark-sql