How do I read a fix width format text file in pandas

后端 未结 4 1972
Happy的楠姐
Happy的楠姐 2020-12-07 01:16

I just got my hands on pandas and am figuring out how I can read a file. The file is from WRDS database and is the SP500 constituents list all the way back to 1960s. I check

相关标签:
4条回答
  • 2020-12-07 01:29

    What do you mean by display? Doesn't df['gvkey'] give you the data in the gvkey column?

    If what you do is print the whole data frame to the console, then take a look at df.to_string(), but it'll be hard to read if you have too many columns. Pandas won't print the whole thing by default if you have too many columns:

    import pandas
    import numpy 
    
    df1 = pandas.DataFrame(numpy.random.randn(10, 3), columns=['col%d' % d for d in range(3)] )
    df2 = pandas.DataFrame(numpy.random.randn(10, 30), columns=['col%d' % d for d in range(30)] )
    
    print df1   # <--- substitute by df2 to see the difference
    print
    print df1['col1']
    print
    print df1.to_string()
    
    0 讨论(0)
  • 2020-12-07 01:35

    Wes answered me in an email. Cheers.

    This is a fixed-width-format file (not delimited by commas or tabs as usual). I realize that pandas does not have a fixed-width reader like R does, though one can be fashioned very easily. I'll see what I can do. In the meantime if you can export the data in another format (like csv--truly comma separated) you'll be able to read it with read_csv. I suspect with some unix magic you can transform a FWF file into a CSV file.

    I recommend following the issue on github as your e-mail is about to disappear from my inbox :)

    https://github.com/pydata/pandas/issues/920

    best, Wes

    0 讨论(0)
  • 2020-12-07 01:45

    pandas.read_fwf() was added in pandas 0.7.3 (April 2012) to handle fixed-width files.

    1. API reference

    2. An example from other question

    0 讨论(0)
  • 2020-12-07 01:52

    user, if you need to deal with the fixed format right now, you can use something like the following:

    def fixed_width_to_items(filename, fields, first_column_is_index=False, ignore_first_rows=0):
        reader = open(filename, 'r')
        # skip first rows 
        for i in xrange(ignore_first_rows):
            reader.next()
        if first_column_is_index:
            index = slice(0, fields[1])
            fields = [slice(*x) for x  in zip(fields[1:-1], fields[2:])]
            return ((line[index], [line[x].strip() for x in fields]) for line in reader)
        else:
            fields = [slice(*x) for x  in zip(fields[:-1], fields[1:])]
            return ((i, [line[x].strip() for x in fields]) for i,line in enumerate(reader)) 
    

    Here's a test program:

    import pandas
    import numpy
    import tempfile
    
    # create a data frame
    df = pandas.DataFrame(numpy.random.randn(100, 5))
    file_ = tempfile.NamedTemporaryFile(delete=True)
    file_.write(df.to_string())
    file_.flush()
    
    # specify fields
    fields = [0, 3, 12, 22, 32, 42, 52]
    df2 = pandas.DataFrame.from_items( fixed_width_to_items(file_.name, fields, first_column_is_index=True, ignore_first_rows=1) ).T
    
    # need to specify the datatypes, otherwise everything is a string
    df2 = pandas.DataFrame(df2, dtype=float)
    df2.index = [int(x) for x in df2.index]
    
    # check
    assert (df - df2).abs().max().max() < 1E-6
    

    This should do the trick if you need it right now, but bear in mind that the function above is very simple, in particular it doesn't do anything about data types.

    0 讨论(0)
提交回复
热议问题