Is there any way to save and read multi-dimension data with efficiency?

问题

Introduction

I have a bunch of data series with 1000 stations and each station all have 4 features (e.g Temperature, Wind, CO2 concentration, solar radiation).
All the features are in time-series with hourly resolution.

I read this data in .csv files with the support of Pandas.

Now I need to save and organize them together for better re-use.

My solution

I creat columns entitled by 'sample_x, feature_y'. And each column contain the time series data of feature_y for sample_x.

This method is doable but not show efficiency. Because I had to creat like 4000 columns with long column name.

My question

Is there any better way to save multi-demensions data in Python. I want a simple solution that can help me assessing and handling with specific data directly.

Any advices or solution is appreciated!

回答1:

I think you can use MultiIndex or Panel and then if necessary save data to hdf5.

Also function concat have parameter keys which create MultiIndex from list of DataFrames.

Sample:

df1 = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9],
                   'D':[1,3,5]})

print (df1)
   A  B  C  D
0  1  4  7  1
1  2  5  8  3
2  3  6  9  5

df2 = df1 * 10

dfs = [df1, df2]

df3 = pd.concat(dfs, keys=['a','b'])
print (df3)
      A   B   C   D
a 0   1   4   7   1
  1   2   5   8   3
  2   3   6   9   5
b 0  10  40  70  10
  1  20  50  80  30
  2  30  60  90  50

print (df3.index)
MultiIndex(levels=[['a', 'b'], [0, 1, 2]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

wp = pd.Panel({'a' : df1, 'b' : df2})
print (wp)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: a to b
Major_axis axis: 0 to 2
Minor_axis axis: A to D

回答2:

You may want to use HDF, which has been specifically designed to handle huge arrays of multidimensional data.

回答3:

The simplest answer may be just to create a sqlite3 database.

It sounds like you have 6 pieces of data per hour (station, timestamp, feature1..feature4) times 1000 stations, times however-many hours.

So that's 6000 data items (at, say, 4 bytes each = 24k), times 24 hours/day times 365 days/year (* 8760), or about 200mb, per year. Depending on how far back you're going, that's not too bad for a db file. (If you're going to do more than 10 years, then yeah, go to something bigger, or maybe compress the data or break it up by year or something...)

来源：https://stackoverflow.com/questions/42641335/is-there-any-way-to-save-and-read-multi-dimension-data-with-efficiency

标签

python

csv

pandas

multiple-columns