Python pandas persistent cache

人走茶凉 提交于 2019-12-23 15:48:19

问题


Is there an implementation for python pandas that cache the data on disk so I can avoid to reproduce it every time?

In particular is there a caching method for get_yahoo_data for financial?

A very plus would be:

  • very few lines of code to write
  • possibility to integrate the persisted series when new data is downloaded for the same source

回答1:


There are many ways to achieve this, however probably the easiest way is to use the build in methods for writing and reading Python pickles. You can use pandas.DataFrame.to_pickle to store the DataFrame to disk and pandas.read_pickle to read the stored DataFrame from disk.

An example for a pandas.DataFrame:

# Store your DataFrame
df.to_pickle('cached_dataframe.pkl') # will be stored in current directory

# Read your DataFrame
df = pandas.read_pickle('cached_dataframe.pkl') # read from current directory

The same methods also work for pandas.Series:

# Store your Series
series.to_pickle('cached_series.pkl') # will be stored in current directory

# Read your DataFrame
series = pandas.read_pickle('cached_series.pkl') # read from current directory



回答2:


Depend on different requirements, there are a dozen of methods to do that, to and fro, in CSV, Excel, JSON, Python Pickle Format, HDF5 and even SQL with DB, etc.

In terms of code lines, to/read many of these formats are just one line of code for each direction. Python and Pandas already make the code as clean as possible, so you could worry less about that.

I think there is no single solution to fit all requirements, really case by case:

  • for human readability of saved data: CSV, Excel
  • for binary python object serialization (use-cases): Pickle
  • for data-interchange: JSON
  • for long-time and incrementally updating: SQL
  • etc.

And if you want to daily update the stock prices and for later usage, I prefer Pandas with SQL Queries, of course this will add few lines of code to set up DB connection:

from sqlalchemy import create_engine

new_data = getting_daily_price()
# You can also choose other db drivers instead of `sqlalchemy`
engine = create_engine('sqlite:///:memory:')
with engine.connect() as conn:
    new_data.to_sql('table_name', conn) # To Write
    df = pd.read_sql_table('sql_query', conn) # To Read


来源:https://stackoverflow.com/questions/51235360/python-pandas-persistent-cache

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!