How to handle incoming real time data with python pandas

后端未结

关注

 2  2067

日久生厌

Which is the most recommended/pythonic way of handling live incoming data with pandas?

Every few seconds I\'m receiving a data point in the format below:

相关标签:

2条回答

长情又很酷

2020-12-07 09:25
I would use HDF5/pytables as follows:
1. Keep the data as a python list "as long as possible".
2. Append your results to that list.
3. When it gets "big":
  - push to HDF5 Store using pandas io (and an appendable table).
  - clear the list.
4. Repeat.
In fact, the function I define uses a list for each "key" so that you can store multiple DataFrames to the HDF5 Store in the same process.

We define a function which you call with each row d:
```
CACHE = {}
STORE = 'store.h5'   # Note: another option is to keep the actual file open

def process_row(d, key, max_len=5000, _cache=CACHE):
    """
    Append row d to the store 'key'.

    When the number of items in the key's cache reaches max_len,
    append the list of rows to the HDF5 store and clear the list.

    """
    # keep the rows for each key separate.
    lst = _cache.setdefault(key, [])
    if len(lst) >= max_len:
        store_and_clear(lst, key)
    lst.append(d)

def store_and_clear(lst, key):
    """
    Convert key's cache list to a DataFrame and append that to HDF5.
    """
    df = pd.DataFrame(lst)
    with pd.HDFStore(STORE) as store:
        store.append(key, df)
    lst.clear()
```
Note: we use the with statement to automatically close the store after each write. It may be faster to keep it open, but if so it's recommended that you flush regularly (closing flushes). Also note it may be more readable to have used a collections deque rather than a list, but the performance of a list will be slightly better here.

To use this you call as:
```
process_row({'time' :'2013-01-01 00:00:00', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0},
            key="df")
```
Note: "df" is the stored key used in the pytables store.

Once the job has finished ensure you store_and_clear the remaining cache:
```
for k, lst in CACHE.items():  # you can instead use .iteritems() in python 2
    store_and_clear(lst, k)
```
Now your complete DataFrame is available via:
```
with pd.HDFStore(STORE) as store:
    df = store["df"]                    # other keys will be store[key]
```
Some comments:
- 5000 can be adjusted, try with some smaller/larger numbers to suit your needs.
- List append is O(1), DataFrame append is O(len(df)).
- Until you're doing stats or data-munging you don't need pandas, use what's fastest.
- This code works with multiple key's (data points) coming in.
- This is very little code, and we're staying in vanilla python list and then pandas dataframe...
Additionally, to get the up to date reads you could define a get method which stores and clears before reading. In this way you would get the most up to date data:
```
def get_latest(key, _cache=CACHE):
    store_and_clear(_cache[key], key)
    with pd.HDFStore(STORE) as store:
        return store[key]
```
Now when you access with:
```
df = get_latest("df")
```
you'll get the latest "df" available.

Another option is slightly more involved: define a custom table in vanilla pytables, see the tutorial.

Note: You need to know the field-names to create the column descriptor.
0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-12-07 09:28

You are actually trying to solve two problems: capturing real-time data and analyzing that data. The first problem can be solved with Python logging, which is designed for this purpose. Then the other problem can be solved by reading that same log file.

0 讨论(0)
发布评论:

提交评论
- 加载中...

How to handle incoming real time data with python pandas

Some comments: