NoSql solution to store 20[TB] of data, as vector/array?

橙三吉。 提交于 2019-12-10 18:11:14

问题


I need to build a system to efficiently store & maintain a huge amount (20 [TB]) of data (and be able to access it in 'vector' form). Here are my dimensions:

(1) time (given as an integer of the form YYYYMMDDHHMMSS)

(2) field (a string of any given length, representing a name of a hospital)

(3) instrumentID (an integer representing a uniqueID for the instrument)

I will need a way to be able to store data individually, meaning, something like:

STORE 23789.46 as the data for instrumentID = 5 on field = 'Nhsdg' on time = 20040713113500

Yet, I would need the following query to to run FAST: give me all instruments for field 'X' on timestamp 'Y'.

In order to build these systems, I am given 60 duo-core machines (each with 1GB of RAM, 1.5TB disk)

Any recommendation on a suitable NoSQL soltuion (that would ideally work with python)?

NOTE: the system will first store historical data (which is roughly 20[TB]). Every day I will add just about 200[MB] at most. I just need a solution that would scale and scale. my use case would be just a simple query: give me all instruments for field 'X' on timestamp 'Y'


回答1:


MongoDB scales awesomely and supports many of the indexing features you'd typically find in an RDBMS such as compound key indexes. You can use a compound index on the name and time attributes in your data. Then you can retrieve all instrument readings with a particular name and date range.

[Now in the simple case where you're strictly interested in just that one basic query and nothing else, you can just combine the name and timestamp and call that your key, which would work in any key-value store...]

HBase is another excellent option. You can use a composite row key on name and date.

As others have mentioned, you can definitely use a relational database. MySQL & PostgreSQL can certainly handle the load and table partitioning might be desirable in this scenario as well since your dealing with time ranges. You can use bulk loading (and disabling indexes during loading) to decrease insertion time.



来源:https://stackoverflow.com/questions/5560394/nosql-solution-to-store-20tb-of-data-as-vector-array

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!