问题
I often want to quickly save some Python data, but I would also like to save it in a stable file format in case the date lingers for a long time. And so I have the question, how can I save my data?
In data science, there are three kinds of data I want to store -- arbitrary Python objects, numpy arrays, and Pandas dataframes. -- what are the stable ways of storing these?
回答1:
Arbitrary Python data and code can be stored in the .pkl
pickle format. While pickle files have security concerns because loading them can execute arbitrary code, if you can trust the source of a pickle file, it is a stable format.
The Python standard library's pickle page:
The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.
Most python data can also be stored in the json format. I haven't used this format much myself, but dawg
recommends it. Like the CSV and tab-delimited format I recommend for Pandas, the json format is a plain-text format that is very stable.
Numpy arrays can be stored in the .npy or .npz numpy formats. The npy format is a very simple format that stores a single numpy array. I imagine it would be easy to read this format in any language. The npz format allows the storing of multiple arrays in the same file. Adapted from the docs,
x = np.arange(10)
np.save('example.npy',x)
y = np.load('example.npy')
If the integrity of the file being loaded is not guaranteed, be sure to use allow_pickle=False
to avoid arbitrary code execution.
Pandas dataframes can be stored in a variety of formats. As I wrote in a previous answer, Pandas offers a wide variety of formats. For small datasets, I find plaintext file formats such as CSV and tab-delimited to work well for most purposes. These formats are readable in a wide variety of languages and I have had no issues in working in a bilingual R and Python environment where both environments read from these files.
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
When writing csv and tab files from pandas, I often use the index=False
option to avoid saving the index, which loads as an oddly-named column by default.
来源:https://stackoverflow.com/questions/63583264/what-are-the-standard-stable-file-formats-used-in-python-for-data-science