How to store 7.3 billion rows of market data (optimized to be read)?

后端 未结 13 613
故里飘歌
故里飘歌 2020-12-12 09:53

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.

Most (99.9%) of the time

13条回答
  •  借酒劲吻你
    2020-12-12 10:05

    So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:

    Prepare a C/C++ struct to hold the record format:

    struct StockPrice
    {
        char ticker_code[2];
        double stock_price;
        timespec when;
        etc
    };
    

    Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.

    Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:

    //pseduo-code
    file = open("my.data", WRITE_ONLY);
    truncate(file, sizeof(StockPrice[N]));
    void* p = mmap(file, WRITE_ONLY);
    

    Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.

    StockPrice* stocks = (StockPrice*) p;
    for (size_t i = 0; i < N; i++)
    {
        stocks[i] = ParseNextStock(stock_indata_file);
    }
    close(file);
    

    You can now mmap it again read-only from any program and your data will be readily available:

    file = open("my.data", READ_ONLY);
    StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);
    
    // do stuff with stocks;
    

    So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.

    If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.

提交回复
热议问题