How to store 7.3 billion rows of market data (optimized to be read)?

后端未结

关注

 13  615

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.

Most (99.9%) of the time

相关标签:

13条回答

栀梦

2020-12-12 10:13

It is my understanding that HDF5 was designed specifically with the time-series storage of stock data as one potential application. Fellow stackers have demonstrated that HDF5 is good for large amounts of data: chromosomes, physics.

0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2020-12-12 10:16

Okay, so this is somewhat away from the other answers, but... it feels to me like if you have the data in a file system (one stock per file, perhaps) with a fixed record size, you can get at the data really easily: given a query for a particular stock and time range, you can seek to the right place, fetch all the data you need (you'll know exactly how many bytes), transform the data into the format you need (which could be very quick depending on your storage format) and you're away.

I don't know anything about Amazon storage, but if you don't have anything like direct file access, you could basically have blobs - you'd need to balance large blobs (fewer records, but probably reading more data than you need each time) with small blobs (more records giving more overhead and probably more requests to get at them, but less useless data returned each time).

Next you add caching - I'd suggest giving different servers different stocks to handle for example - and you can pretty much just serve from memory. If you can afford enough memory on enough servers, bypass the "load on demand" part and just load all the files on start-up. That would simplify things, at the cost of slower start-up (which obviously impacts failover, unless you can afford to always have two servers for any particular stock, which would be helpful).

Note that you don't need to store the stock symbol, date or minute for each record - because they're implicit in the file you're loading and the position within the file. You should also consider what accuracy you need for each value, and how to store that efficiently - you've given 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64 bits of storage: read it as a long (or whatever your 64-bit integer value will be) and use masking/shifting to get it back to three integers. You'll need to know what scale to use, of course - which you could probably encode in the spare 4 bits, if you can't make it constant.

You haven't said what the other three integer columns are like, but if you could get away with 64 bits for those three as well, you could store a whole record in 16 bytes. That's only ~110GB for the whole database, which isn't really very much...

EDIT: The other thing to consider is that presumably the stock doesn't change over the weekend - or indeed overnight. If the stock market is only open 8 hours per day, 5 days per week, then you only need 40 values per week instead of 168. At that point you could end up with only about 28GB of data in your files... which sounds a lot smaller than you were probably originally thinking. Having that much data in memory is very reasonable.

EDIT: I think I've missed out the explanation of why this approach is a good fit here: you've got a very predictable aspect for a large part of your data - the stock ticker, date and time. By expressing the ticker once (as the filename) and leaving the date/time entirely implicit in the position of the data, you're removing a whole bunch of work. It's a bit like the difference between a String[] and a Map<Integer, String> - knowing that your array index always starts at 0 and goes up in increments of 1 up to the length of the array allows for quick access and more efficient storage.

0 讨论(0)
发布评论:

提交评论
- 加载中...
广开言路

2020-12-12 10:21
Let me recommend that you take a look at apache solr, which I think would be ideal for your particular problem. Basically, you would first index your data (each row being a "document"). Solr is optimized for searching and natively supports range queries on dates. Your nominal query,

"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"

would translate to something like:
```
?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]
```
Assuming "stock" is the stock name and "date" is a "DateField" created from the "date" and "minute" columns of your input data on indexing. Solr is incredibly flexible and I really can't say enough good things about it. So, for example, if you needed to maintain the fields in the original data, you can probably find a way to dynamically create the "DateField" as part of the query (or filter).
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-12-12 10:21

If your use case is to simple read rows without aggregation, you can use Aerospike cluster. It's in memory database with support of file system for persistence. It's also SSD optimized.

If your use case needs aggregated data, go for Mongo DB cluster with date range sharding. You can club year vise data in shards.

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-12-12 10:24

First, there isn't 365 trading days in the year, with holidays 52 weekends (104) = say 250 x the actual hours of day market is opened like someone said, and to use the symbol as the primary key is not a good idea since symbols change, use a k_equity_id (numeric) with a symbol (char) since symbols can be like this A , or GAC-DB-B.TO , then in your data tables of price info, you have, so your estimate of 7.3 billion is vastly over calculated since it's only about 1.7 million rows per symbol for 14 years.

k_equity_id k_date k_minute

and for the EOD table (that will be viewed 1000x over the other data)

k_equity_id k_date

Second, don't store your OHLC by minute data in the same DB table as and EOD table (end of day) , since anyone wanting to look at a pnf, or line chart, over a year period , has zero interest in the by the minute information.

0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-12-12 10:27
Tell us about the queries, and your hardware environment.

I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.

Update

Okay, why?

First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because
- My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.
- If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.
- If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because
  - you can better exploit parallelism on relatively small commodity hardware.
  - you can also better implement high reliability and redundancy
  - many of those problems lend themselves naturally to the MapReduce paradigm.
But the fact is, until we know about your workload, it's impossible to say anything definitive.
0 讨论(0)
发布评论:

提交评论
- 加载中...