greenplum

Greenplum, Pivotal HD + Spark, or HAWQ for TBs of Structured Data?

狂风中的少年 提交于 2019-12-04 13:35:35
I have TBs of structured data in a Greenplum DB. I need to run what is essentially a MapReduce job on my data. I found myself reimplementing at least the features of MapReduce just so that this data would fit in memory (in a streaming fashion). Then I decided to look elsewhere for a more complete solution. I looked at Pivotal HD + Spark because I am using Scala and Spark benchmarks are a wow-factor. But I believe the datastore behind this, HDFS, is going to be less efficient than Greenplum. (NOTE the "I believe". I would be happy to know I am wrong but please give some evidence.) So to keep

DISTRIBUTE BY notices in Greenplum

匿名 (未验证) 提交于 2019-12-03 10:24:21
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Say I run the following query on psql: > select a.c1, b.c2 into temp_table from db.A as a inner join db.B as b > on a.x = b.x limit 10; I get the following message: NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'c1' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. What is a DISTRIBUTED BY column? Where is temp_table stored? Is it stored on my

greenplum hang forever when doing any search or insert actions with psql and centos7

匿名 (未验证) 提交于 2019-12-03 01:44:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: greenplum version is 5.3.0 centos 7 As title, The following is result of gplogfilter SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n WHERE substring(pg_catalog.quote_ident(n.nspname) || '.',1,7)='test_vb' AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,7) = substring('test_vb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION SELECT pg_catalog.quote_ident(n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM pg_catalog

How should I deal with my UNIQUE constraints during my data migration from Postgres9.4 to Greenplum

匿名 (未验证) 提交于 2019-12-03 00:44:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: when I execute the following sql (which is contained by a sql file generated by pg_dump of Postgres9.4) in greenplum: CREATE TABLE "public"."trm_concept" ( "pid" int8 NOT NULL, "code" varchar(100) NOT NULL, "codesystem_pid" int8, "display" varchar(400) , "index_status" int8, CONSTRAINT "trm_concept_pkey" PRIMARY KEY ("pid"), CONSTRAINT "idx_concept_cs_code" UNIQUE ("codesystem_pid", "code") ); I got this error: ERROR: Greenplum Database does not allow having both PRIMARY KEY and UNIQUE constraints why greenplum doesn't allow this? I really

MPP - GreenPlum数据库安装以及简单使用

匿名 (未验证) 提交于 2019-12-03 00:39:02
一、集群介绍 架构图入下 二、服务器修改(all host) 2.1配置hosts   vi /etc/hosts 192.168.0.93 gpdb-1 mdw 192.168.0.94 gpdb-2 sdw1 192.168.0.95 gpdb-3 sdw2 2.2创建用户及用户组 2.2.1创建用户组,组id为530 groupadd -g 530 gpadmin 2.2.2创建用户,赋予gpadmin用户组,并自定用户根目录 useradd -g 530 -u 530 -d /home/gpadmin -s /bin/bash gpadmi 2.2.3授权/home/gpadmin chown -R gpadmin:gpadmin /home/gpadmin 2.2.4修改密码 passwd gpadmin 2.3关闭防火墙 2.3.1关闭默认防火墙 systemctl stop firewalld 2.3.2关闭iptables systemctl stop iptables 2.4修改network文件   vi /etc/sysconfig/network NETWORKING = yes HOSTNAME =对应的主机名称 2.5修改系统文件 2.5.1修改内核配置   vi /etc/sysctl.conf kernel.shmmax = 5000000000

greenplum安装遇Failed Update port number to 40000错误

匿名 (未验证) 提交于 2019-12-03 00:22:01
在安装greenplum过程中,遇到Failed Update port number to 40000错误 信息: os: centos6.5 gp version:4.3.8 初始化时日志中遇到如下问题: 20180605:11:37:53:010114 gpcreateseg.sh:gp-s0011:gpadmin-[FATAL][3]:-Failed Update port number to 40000 解决方法: yum -y install ed. 文章来源: greenplum安装遇Failed Update port number to 40000错误

Greenplum中装载和卸载数据

匿名 (未验证) 提交于 2019-12-03 00:18:01
装载和卸载数据 GP装载概述 关于外部表 WEB:访问动态数据源(比如wen服务或者OS的命令或脚本) 关于gpload 2) 需要创建一个按照YAML格式定义的装载说明控制文件 关于copy 2) 不具有并行装载/卸载的机制 定义外部表 概述 在创建外部表定义时,必须指定文件格式和文件位置;三种用来访问外部表数据源的协议:gpfdist, gpfdists和gphdfs。 gpfdist 5) 可使用通配符或者C风格的模式匹配多个文件 gpfdists 1) gpfdists是gpfdist的安全版本,其开启的加密通信并确保文件与GP之间的安全认证 file 4) pg_max_external_files用来确定每个外部表中允许多少个外部文件 gphdfs 4) 对于写来说,每个GP Segment实例值写该实例包含的数据 外部文件格式 3) 自定义格式适用于gphdfs 外部表中的错误数据 为了在装载正确格式的记录时隔离错误数据,需在定义外部表时使用单条记录出错处理 外部表备份恢复 在备份或者恢复操作中,仅仅外部表或者WEB外部表的定义会被备份或恢复。 使用GP并行文件服务(gpfdist) b) 在后台启动gpfdist(日志信息和出错信息输出到日志文件) 1 c) 要在同一个ETL主机启动多个gpfdist服务,为每个服务指定不同的目录和端口。例如, 1 2

20 Billion Rows/Month - Hbase / Hive / Greenplum / What?

折月煮酒 提交于 2019-12-03 00:04:50
问题 I'd like to use your wisdom for picking up the right solution for a data-warehouse system. Here are some details to better understand the problem: Data is organized in a star schema structure with one BIG fact and ~15 dimensions. 20B fact rows per month 10 dimensions with hundred rows (somewhat hierarchy) 5 dimensions with thousands rows 2 dimensions with ~200K rows 2 big dimensions with 50M-100M rows Two typical queries run against this DB Top members in dimq: select top X dimq, count(id)

设置greenplum用户和密码访问:

匿名 (未验证) 提交于 2019-12-02 23:55:01
设置greenplum用户和密码访问: 1、创建gp用户 create user tableau with nosuperuser nocreatedb password 'tableau' ; 2、赋表的读的权限 create table test( id integer ) GRANT select on table test to tableau; 3、设置配置文件: vim /extsdd1/gpadmin/data/master/gpseg-1/pg_hba.conf 增加下面两行: host all gpadmin 0.0.0.0/0 trust host all tableau 0.0.0.0/0 md5 来源:博客园 作者: xmanman 链接:https://www.cnblogs.com/zhangwensi/p/11413146.html

Greenplum数据库集群

匿名 (未验证) 提交于 2019-12-02 23:47:01
Greenplum数据库集群 首选操作系统 Red Hat Enterprise Linux (RHEL)是首选操作系统。应该使用最新的受支持的主版本,当前是RHEL 6。 我使用的系统版本:centos7.6 文件系统 XFS是Greenplum数据库数据目录的最佳实践文件系统。XFS应该用下列选项挂载: rw,noatime,inode64 端口配置 ip_local_port_range 应该被设置为不与Greenplum数据库端口范围冲突。例如: net.ipv4.ip_local_port_range = 3000 65535 PORT_BASE=2000 MIRROR_PORT_BASE=2100 REPLICATION_PORT_BASE=2200 MIRROR_REPLICATION_PORT_BASE=2300