DISTRIBUTE BY notices in Greenplum

匿名 (未验证) 提交于 2019-12-03 10:24:21

问题:

Say I run the following query on psql:

> select a.c1, b.c2 into temp_table from db.A as a inner join db.B as b  > on a.x = b.x limit 10; 

I get the following message:

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'c1' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

  1. What is a DISTRIBUTED BY column?
  2. Where is temp_table stored? Is it stored on my client or on the server?

回答1:

  1. DISTRIBUTED BY is how Greenplum determines which segment will store each row. Because Greenplum is an MPP database in most production databases you will have multiple segment servers. You want to make sure that the Distribution column is the column you will join on usaly.

  2. temp_table is a table that will be created for you on the Greenplum cluster. If you haven't set search_path to something else it will be in the public schema.



回答2:

For your first question, the DISTRIBUTE BY clause is used for telling the database server how to store the database on the disk. (Create Table Documentation)

I did see one thing right away that could be wrong with the syntax on your Join clause where you say on a.x = s.x --> there is no table referenced as s. Maybe your problem is as simple as changing this to on a.x = b.x?

As far as where the temp table is stored, I believe it is generally stored on the database server. This would be a question for your DBA as it is a setup item when installing the database. You can always dump your data to a file on your computer and reload at a later time if you want to save your results (without printing.)



回答3:

As I know, tmp table is stored in memory. It is faster when there are less data and it is recommended to use temp table. In the opposite, as temp table is stored into memory, if there are too much data it will consume very large memory. It is recommended to use regular tables with distributed clause. As it will be distributed across your cluster.

In addition, tmp table is stored into a special schema, so you don't need to specify the schema name when creating the temp table, and it only exist in the current connection, after you close the current connection, postgresql will drop the table automatically.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!