Using a custom object in pandas.read_csv()

前端 未结 3 1616
清酒与你
清酒与你 2020-12-15 14:31

I am interested in streaming a custom object into a pandas dataframe. According to the documentation, any object with a read() method can be used. However, even after implem

相关标签:
3条回答
  • 2020-12-15 15:03

    One way to make a file-like object in Python3 by subclassing io.RawIOBase. And using Mechanical snail's iterstream, you can convert any iterable of bytes into a file-like object:

    import tempfile
    import io
    import pandas as pd
    
    def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
        """
        http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
        Lets you use an iterable (e.g. a generator) that yields bytestrings as a
        read-only input stream.
    
        The stream implements Python 3's newer I/O API (available in Python 2's io
        module).
    
        For efficiency, the stream is buffered.
        """
        class IterStream(io.RawIOBase):
            def __init__(self):
                self.leftover = None
            def readable(self):
                return True
            def readinto(self, b):
                try:
                    l = len(b)  # We're supposed to return at most this much
                    chunk = self.leftover or next(iterable)
                    output, self.leftover = chunk[:l], chunk[l:]
                    b[:len(output)] = output
                    return len(output)
                except StopIteration:
                    return 0    # indicate EOF
        return io.BufferedReader(IterStream(), buffer_size=buffer_size)
    
    
    class DataFile(object):
        def __init__(self, files):
            self.files = files
    
        def read(self):
            for file_name in self.files:
                with open(file_name, 'rb') as f:
                    for line in f:
                        yield line
    
    def make_files(num):
        filenames = []
        for i in range(num):
            with tempfile.NamedTemporaryFile(mode='wb', delete=False) as f:
                f.write(b'''1,2,3\n4,5,6\n''')
                filenames.append(f.name)
        return filenames
    
    # hours = ['file1.csv', 'file2.csv', 'file3.csv']
    hours = make_files(3)
    print(hours)
    data = DataFile(hours)
    df = pd.read_csv(iterstream(data.read()), header=None)
    
    print(df)
    

    prints

       0  1  2
    0  1  2  3
    1  4  5  6
    2  1  2  3
    3  4  5  6
    4  1  2  3
    5  4  5  6
    
    0 讨论(0)
  • 2020-12-15 15:03

    The documentation mentions the read method but it's actually checking if it's a is_file_like argument (that's where the exception is thrown). That function is actually very simple:

    def is_file_like(obj):
        if not (hasattr(obj, 'read') or hasattr(obj, 'write')):
            return False
        if not hasattr(obj, "__iter__"):
            return False
        return True
    

    So it also needs an __iter__ method.

    But that's not the only problem. Pandas requires that it actually behaves file-like. So the read method should accept an additional argument for the number of bytes (so you can't make read a generator - because it has to be callable with 2 arguments and should return a string).

    So for example:

    class DataFile(object):
        def __init__(self, files):
            self.data = """a b
    1 2
    2 3
    """
            self.pos = 0
    
        def read(self, x):
            nxt = self.pos + x
            ret = self.data[self.pos:nxt]
            self.pos = nxt
            return ret
    
        def __iter__(self):
            yield from self.data.split('\n')
    

    will be recognized as valid input.

    However it's harder for multiple files, I hoped that fileinput could have some appropriate routines but it doesn't seem like it:

    import fileinput
    
    pd.read_csv(fileinput.input([...]))
    # ValueError: Invalid file path or buffer object type: <class 'fileinput.FileInput'>
    
    0 讨论(0)
  • 2020-12-15 15:12

    How about this alternative approach:

    def get_merged_csv(flist, **kwargs):
        return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
    
    df = get_merged_csv(hours)
    
    0 讨论(0)
提交回复
热议问题