Does Google BigQuery/ Amazon Redshift use column-based relational database or NoSQL database?

匿名 (未验证) 提交于 2019-12-03 02:26:02

问题:

I'm still not very clear about the difference between a column-based relational database vs. column-based NoSQL database.

Google BigQuery enables SQL-like query so how can it be NoSQL?

Column-based relational database I know of are InfoBright, Vertica and Sybase IQ.

Column-based NoSQL database I know of are Cassandra and HBase.

The following article about Redshift starts with saying "NoSQL" but ends with PostgreSQL (which is relational) being used: http://nosqlguide.com/column-store/intro-to-amazon-redshift-a-columnar-nosql-database/

回答1:

A few things to clarify here mostly about Google BigQuery.

BigQuery is a hybrid system that allows you to store data in columns, but it takes into the NoSQL world with additional features, like the record type, and the nested feature. Also you can have a 2Mbyte STRING column in which you can store raw document like a JSON document. See other data formats and limits that apply. Also you can write User Defined Functions in Javascript, eg: you can paste in a library that does NLP javascript library.

Now that you have all these capabilities to store data you can use JSON Functions for example to query your document stored in one of the columns, hence this can be used as no schema storage, because you didn't defined your JSON document structure for that column, you just stored it as JSON. Got it?

Basic example to query from the meta column, which is a JSON document, the reason key, and doing a contains language construct to find out how many users have in that key the "unsubscribed" word:

SELECT    SUM(IF(JSON_EXTRACT_SCALAR(meta,'$.reason') contains 'unsubscribed',1,0))   FROM ... 

On the other hand you have table-wildcard querying. This is needed if you have your rows across many tables. Table wildcard functions are a cost-effective way to query data from a specific set of tables. When you use a table wildcard function, BigQuery only accesses and charges you for tables that match the wildcard. So this means that it's advised to store data in similar tables just partitioned in different tables per a set time frame eg: daily, monthly tables.

We should not forget that BigQuery is append only by design, so you cannot update old records, there is no UPDATE language construct (Update: There is now DML language construct to do some update/delete ops). Instead you need to append a new record and your queries must be written in a way that always work with the last version of your data. If your system is event driven, than this is very simple because each event will be appended in the BQ. But if the user updates it's profile, you need to store the profile again, you cannot update old row. You need to have a column version/date that tells you which is the most recent version, and your queries will be written first to obtain the most recent version of your rows then deal with the logic.

You can use something like over/partition by that field and use the most recent value seqnum=1.

This returns from profile, the last email for each user_id defined by the most recent entry by timestamp column.

SELECT email    FROM      (SELECT email              row_number() over (partition BY user_id                                 ORDER BY TIMESTAMP DESC) seqnum       FROM [profile]     )    WHERE seqnum=1 


回答2:

First, remember that NOSQL is commonly considered as abbreviation to "Not Only SQL", so there is no contradiction for the system of having both SQL interface, and some NOSQL features. Having said that, both Redshift and BigQuery have their foundation in column based databases. Redshift is based on Paraccel which is classic column based RDBMS targeted towards data warehousing, and BigQuery is based on internal Google's column based data processing technology called "dremel".



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!