问题
I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column. Following are my commands in pyspark.
>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)
Note: I have also tried 'long' and 'bigint' in place of 'float' in my variable test_temp, but the error in spark was 'keyword not found' And following is the output
[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]
The value in my csv file is as follows: 8.27370028700801e+21 is 8273700287008010012345 8.37670028702205e+21 is 8376700287022050054321
When I create a data frame out of it and then query it,
>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()
the test_column
gives value 'null' for all the records.
So, how to solve this problem of parsing big number in spark, really appreciate your help
回答1:
Well, types matter. Since you convert your data to float
you cannot use LongType
in the DataFrame
. It doesn't blow only because PySpark is relatively forgiving when it comes to types.
Also, 8273700287008010012345
is to large to be represented as LontType
which can represent only the values between -9223372036854775808 and 9223372036854775807.
If you want to your data to a DataFrame
you'll have to use DoubleType
:
from pyspark.sql.types import *
rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
Typically it is a better idea to handle this with DataFrames
directly:
from pyspark.sql.functions import col
str_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
If you don't want to use Double
you can cast to Decimal
with specified precision:
str_df.select(col("x").cast(DecimalType(38))).show(1, False)
## +----------------------+
## |x |
## +----------------------+
## |8273700287008010012345|
## +----------------------+
来源:https://stackoverflow.com/questions/36349585/datatype-for-handling-big-numbers-in-pyspark