How to store 7.3 billion rows of market data (optimized to be read)?

后端未结

关注

 13  603

故里飘歌 2020-12-12 09:53

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.

Most (99.9%) of the time

13条回答

無奈伤痛 (楼主)

2020-12-12 10:27
Tell us about the queries, and your hardware environment.

I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.

Update

Okay, why?

First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because
- My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.
- If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.
- If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because
  - you can better exploit parallelism on relatively small commodity hardware.
  - you can also better implement high reliability and redundancy
  - many of those problems lend themselves naturally to the MapReduce paradigm.
But the fact is, until we know about your workload, it's impossible to say anything definitive.
0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...