Best way to store huge log data

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-07 00:16:44

Partitioning in postgresql works great for big logs. First create the parent table:

create table  game_history_log (
    gameid integer,
    views integer,
    plays integer,
    likes integer,
    log_date date
);

Now create the partitions. In this case one for each month, 900 k rows, would be good:

create table game_history_log_201210 (
    check (log_date between '2012-10-01' and '2012-10-31')
) inherits (game_history_log);

create table game_history_log_201211 (
    check (log_date between '2012-11-01' and '2012-11-30')
) inherits (game_history_log);

Notice the check constraints in each partition. If you try to insert in the wrong partition:

insert into game_history_log_201210 (
    gameid, views, plays, likes, log_date
) values (1, 2, 3, 4, '2012-09-30');
ERROR:  new row for relation "game_history_log_201210" violates check constraint "game_history_log_201210_log_date_check"
DETAIL:  Failing row contains (1, 2, 3, 4, 2012-09-30).

One of the advantages of partitioning is that it will only search in the correct partition reducing drastically and consistently the search size regardless of how many years of data there is. Here the explain for the search for a certain date:

explain
select *
from game_history_log
where log_date = date '2012-10-02';
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Result  (cost=0.00..30.38 rows=9 width=20)
   ->  Append  (cost=0.00..30.38 rows=9 width=20)
         ->  Seq Scan on game_history_log  (cost=0.00..0.00 rows=1 width=20)
               Filter: (log_date = '2012-10-02'::date)
         ->  Seq Scan on game_history_log_201210 game_history_log  (cost=0.00..30.38 rows=8 width=20)
               Filter: (log_date = '2012-10-02'::date)

Notice that apart from the parent table it only scanned the correct partition. Obviously you can have indexes on the partitions to avoid a sequential scan.

Inheritance Partitioning

11M rows ins't excessive, but indexing in general and the clustering of your primary key will matter more (on InnoDB). I suggest (game_id, date) for a primary key so that queries for all data about a certain game are in sequential rows. Also, you may want to keep a separate table of just the current values for ranking games, etc, when just the latest figures are necessary.

There is no performance problems with MySQL with 10kk data. You can just apply partitioning by game id (requires atleast 5.5 version).

I have MySQL DB with data like this and currently there is no problems with 980kk rows.

Instead of keeping every row, keep recent data at high precision, middle data at mdeium precision, and long-term at low precision. This is the appoach taken by rrdtool which might be better for this than mysql.

I would advice not to use relational database at all. Statictics is sort of things which is changing pretty rapidly, because new data are constantly arriving. I believe smth like HBase will fit better - because adding new records works faster here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!