How do I read a fix width format text file in pandas

后端 未结 4 1970
Happy的楠姐
Happy的楠姐 2020-12-07 01:16

I just got my hands on pandas and am figuring out how I can read a file. The file is from WRDS database and is the SP500 constituents list all the way back to 1960s. I check

4条回答
  •  失恋的感觉
    2020-12-07 01:52

    user, if you need to deal with the fixed format right now, you can use something like the following:

    def fixed_width_to_items(filename, fields, first_column_is_index=False, ignore_first_rows=0):
        reader = open(filename, 'r')
        # skip first rows 
        for i in xrange(ignore_first_rows):
            reader.next()
        if first_column_is_index:
            index = slice(0, fields[1])
            fields = [slice(*x) for x  in zip(fields[1:-1], fields[2:])]
            return ((line[index], [line[x].strip() for x in fields]) for line in reader)
        else:
            fields = [slice(*x) for x  in zip(fields[:-1], fields[1:])]
            return ((i, [line[x].strip() for x in fields]) for i,line in enumerate(reader)) 
    

    Here's a test program:

    import pandas
    import numpy
    import tempfile
    
    # create a data frame
    df = pandas.DataFrame(numpy.random.randn(100, 5))
    file_ = tempfile.NamedTemporaryFile(delete=True)
    file_.write(df.to_string())
    file_.flush()
    
    # specify fields
    fields = [0, 3, 12, 22, 32, 42, 52]
    df2 = pandas.DataFrame.from_items( fixed_width_to_items(file_.name, fields, first_column_is_index=True, ignore_first_rows=1) ).T
    
    # need to specify the datatypes, otherwise everything is a string
    df2 = pandas.DataFrame(df2, dtype=float)
    df2.index = [int(x) for x in df2.index]
    
    # check
    assert (df - df2).abs().max().max() < 1E-6
    

    This should do the trick if you need it right now, but bear in mind that the function above is very simple, in particular it doesn't do anything about data types.

提交回复
热议问题