How to store 7.3 billion rows of market data (optimized to be read)?

后端未结

关注

 13  630

故里飘歌 2020-12-12 09:53

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.

Most (99.9%) of the time

13条回答

执念已碎 (楼主)

2020-12-12 10:13
I have a dataset of 1 minute data of 1000 stocks [...] most (99.9%) of the time I will perform only read requests.

Storing once and reading many times time-based numerical data is a use case termed "time series". Other common time series are sensor data in the Internet of Things, server monitoring statistics, application events etc.

This question was asked in 2012, and since then, several database engines have been developing features specifically for managing time series. I've had great results with the InfluxDB, which is open sourced, written in Go, and MIT-licensed.

InfluxDB has been specifically optimized to store and query time series data. Much more so than Cassandra, which is often touted as great for storing time series:

Optimizing for time series involved certain tradeoffs. For example:

Updates to existing data are a rare occurrence and contentious updates never happen. Time series data is predominantly new data that is never updated.

Pro: Restricting access to updates allows for increased query and write performance

Con: Update functionality is significantly restricted

In open sourced benchmarks,

InfluxDB outperformed MongoDB in all three tests with 27x greater write throughput, while using 84x less disk space, and delivering relatively equal performance when it came to query speed.

Queries are also very simple. If your rows look like , with InfluxDB you can store just that, then query easily. Say, for the last 10 minutes of data:
```
SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'
```
There are no IDs, no keys, and no joins to make. You can do a lot of interesting aggregations. You don't have to vertically partition the table as with PostgreSQL, or contort your schema into arrays of seconds as with MongoDB. Also, InfluxDB compresses really well, while PostgreSQL won't be able to perform any compression on the type of data you have.
0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...