how to read a fixed character length format file in spark [duplicate]

问题

The data is as below.

[Row(_c0='ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD                       '),
 Row(_c0='ACW00011647  17.1333  -61.7833   19.2    ST JOHNS                                    '),
 Row(_c0='AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196')]

I have defined the schema_stn with correct column widths etc as per the documentation. My code for reading it into a dataframe using pyspark is as under:

df.select(
    df.value.substr(1, 11).alias('id'),
    df.value.substr(13, 20).alias('LATITUDE'),
    df.value.substr(22, 30).alias('LONGITUDE'),
    df.value.substr(32, 37).alias('LATITUDE'),
    df.value.substr(39, 40).alias('LONGITUDE'),
    df.value.substr(42, 71).alias('LATITUDE'),
    df.value.substr(73, 75).alias('LONGITUDE'),
    df.value.substr(77, 79).alias('LATITUDE'),
    df.value.substr(81, 85).alias('LONGITUDE'))

df = sqlContext.read.csv("hdfs:////data/stn") 
df = (sqlContext.read.format("csv")
        .schema(schema_stn)
        .option("delimiter", " ")
        .load("hdfs:////data/stn")
        )
df.cache()
df.show(3)

I am getting following output.

In [62]: df.show(3)
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
|         ID|LATITUDE|LONGITUDE|ELEVATION|   STATE|   NAME|GSN FLAG|HCN/CRN FLAG|WMO ID|
+-----------+--------+---------+---------+--------+-------+--------+------------+------+
|ACW00011604|    null|  17.1167|     null|-61.7833|   null|    null|        10.1|  null|
|ACW00011647|    null|  17.1333|     null|-61.7833|   null|    null|        19.2|  null|
|AE000041196|    null|   25.333|     null|    null|55.5170|    null|        null|  34.0|
+-----------+--------+---------+---------+--------+-------+--------+------------+------+

I am not able to remove these 'null' (which represent the whitespace.) What am missing here please.

回答1:

You need to read as lines of text. Otherwise the delimiter is wrong

df = spark.read.text("hdfs:////data/stn")

And then parse

df = df.select(
    df.value.substr(1, 11).alias('id'),
    df.value.substr(13, 20).alias('LATITUDE'),
    df.value.substr(22, 30).alias('LONGITUDE'),
    df.value.substr(32, 37).alias('c3'),
    df.value.substr(39, 40).alias('c4'),
    df.value.substr(42, 71).alias('c5'),
    df.value.substr(73, 75).alias('c6'),
    df.value.substr(77, 79).alias('c7'),
    df.value.substr(81, 85).alias('c8'))
df.show(3)

来源：https://stackoverflow.com/questions/52031127/how-to-read-a-fixed-character-length-format-file-in-spark

标签

python

python-3.x

apache-spark

pyspark