date: 2020-01-11 15:51:39
前段时间导师要求了解 GreenPlum
数据库,后来安装和使用了一下,感觉和其他数据库没有什么不同,于是就不了了之了。现在重新看一遍 GPDB
的特性并尝试使用这些特性。
其实,官方宣传页上写的特性
才是真正需要我去了解的。
特性
首先,GPDB
的特性是什么?从哪里找?产品的宣传页上肯定有。
最大特性
GPDB
的最大特性就是 MPP
, 即 Massively Parallel Processing
。 在首页上,最明显的就是这两个:
- Massively Parallel, 大规模并行
- Analytics,分析
然后,下滑页面,两个明显的特性是:
- Power at scale: High performance on petabyte-scale data volumes. PB级数据高性能处理。
- True Flexibility: Deploy anywhere. 部署灵活。
主要特性
首页接着往下滑,写明了 GPDB
的主要特性:
- MPP Architecture
- Petabyte-Scale Loading: 加载速度随着每个额外节点的增加而增加,每个机架的加载速度超过10Tb/h ( 约为 347.22 GB/s ) 。
- Innovative Query Optimization: 工业界中首个大数据工作负载的
cost-based query optimizer
. - Polymorphic Data Storage: 多态数据存储。完全控制表和*–**分区存储、执行和压缩的配置。
- Integrated In-Database Analytics:
Apache MADlib
提供的一个库,用于可伸缩的数据库内分析,通过用户定义的函数扩展Greenplum数据库的SQL功能。 - Federated Data Access: 联邦数据访问。使用Greenplum optimzer和 query processing engine 查询外部数据源。包括Hadoop、Cloud Storage、ORC、AVRO、Parquet等 Polygot 数据存储。
遗憾的是,首页上的这些图标都不能点击。所以要体验这些特性,只能自己去探索了。
TODO
- 从 hadoop 中查询数据。
GPDB vs. Hadoop + Hive
来源:Greenplum介绍, 这是 Alibaba 在 2011.02.17 做的汇报, GPDB 版本为 4.x.
对比 Hadoop + Hive
, GPDB 的查询性能比 Hive 好,但是 GPDB 支持的集群节点数太少,最多可以 1000 个 segment, 而Hive 可支持上万个节点。
Data Loading
主要有 3 种加载方法:
- the SQL INSERT statement: 在加载大量数据时低效,适用于小数据集。
- the COPY command: 可以自定义 the format of the text file 以解析成 columns and rows. 比 INSERT 快, 但是不是一个 parallel process.
gpfdist
andgpload
: 可以高效的将外部数据转储到数据表中。快速,并行加载。Administrator 可以定义single row error isolation mode
以继续加载格式正常的 rows。gpload
需要提前编写一个YAML-formated
control file, 用于描述 source data location,format, transformations required, participating hosts, database destinations, and others. 这允许你执行一个复杂的加载任务。
以下只演示 gpfdist
和 gpload
的使用:
gpfdist
gpfdist [-d directory] [-p http_port] [-l log_file] [-t timeout]
[-S] [-w time] [-v | -V] [-s] [-m max_length]
[--ssl certificate_path [--sslclean wait_time] ]
[-c config.yml]
示例:
Uses the gpfdist
protocol to create a readable external table, ext_expenses, from all files with the txt extension. The column delimiter is a pipe(|) and NULL (’ ') is a space. Access to the external table is single row error isolation mode. If the error count on a segment is greater than five (the SEGMENT REJECT LIMIT value), the entire external table operation fails and no rows are processed.
=# CREATE EXTERNAL TABLE ext_expenses ( name text,
date date, amount float4, category text, desc1 text )
LOCATION ('gpfdist://etlhost-1:8081/*.txt',
'gpfdist://etlhost-2:8082/*.txt')
FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
LOG ERRORS SEGMENT REJECT LIMIT 5;
To create the readable ext_expenses table from CSV-formatted text files:
=# CREATE EXTERNAL TABLE ext_expenses ( name text,
date date, amount float4, category text, desc1 text )
LOCATION ('gpfdist://etlhost-1:8081/*.txt',
'gpfdist://etlhost-2:8082/*.txt')
FORMAT 'CSV' ( DELIMITER ',' )
LOG ERRORS SEGMENT REJECT LIMIT 5;
以下是我的个人实验:
-
创建一个数据目录
~/Datasets/baike
, 将数据集baike_triple.txt
移动到该目录下。这里的数据集来自 CN-DBpedia, 包含 900万+ 的 百科实体 以及 6700万+ 的 三元组关系。其中 mention2entity 信息 110万+,摘要信息 400万+,标签信息 1980万+,infobox 信息 4100万+。大小约为 4.0 G。 数据实例如下:
"1+8"时代广场 中文名 "1+8"时代广场 "1+8"时代广场 地点 咸宁大道与银泉大道交叉口 "1+8"时代广场 实质 城市综合体项目 "1+8"时代广场 总建面 约11.28万方 "1.4"河南兰考火灾事故 中文名 "1.4"河南兰考火灾事故 "1.4"河南兰考火灾事故 地点 河南<a>兰考县</a>城关镇 "1.4"河南兰考火灾事故 时间 2013年1月4日
-
开启
gpfdist
后台:gpfdist -d ~/Datasets/baike -p 8081 > /tmp/gpfdist.log 2>&1 & ps -A | grep gpfdist # 查看进程号 30693 pts/8 00:00:00 gpfdist # 表示进程号为 30693
选项说明:
- -d directory: 指定一个目录,gpfdist 将从该目录中为可读外部表提供文件,或为可写外部表创建输出文件。如果没有指定,默认为当前目录。
- -p http_port: gpfdist 提供文件要使用的HTTP端口。默认为8080。
查看日志:
gt@vm1:~$ more /tmp/gpfdist.log 2020-01-13 02:49:38 30693 INFO Before opening listening sockets - following listening sockets are a vailable: 2020-01-13 02:49:38 30693 INFO IPV6 socket: [::]:8081 2020-01-13 02:49:38 30693 INFO IPV4 socket: 0.0.0.0:8081 2020-01-13 02:49:38 30693 INFO Trying to open listening socket: 2020-01-13 02:49:38 30693 INFO IPV6 socket: [::]:8081 2020-01-13 02:49:38 30693 INFO Opening listening socket succeeded 2020-01-13 02:49:38 30693 INFO Trying to open listening socket: 2020-01-13 02:49:38 30693 INFO IPV4 socket: 0.0.0.0:8081 Serving HTTP on port 8081, directory /home/gt/Datasets/baike
-
以
gpadmin
身份开启一个 psql session, 创建tables
:ext_baike
用于存放加载的数据,ext_load_baike_err
用于存放加载错误的日志。
psql -h localhost -d db_kg # 进入数据库 db_kg
# 创建外部表
CREATE EXTERNAL TABLE ext_baike (
head text, rel text, tail text)
LOCATION ('gpfdist://vm1:8081/baike_triples.txt')
FORMAT 'TEXT' (DELIMITER E'\t')
LOG ERRORS SEGMENT REJECT LIMIT 50000;
# 创建内部存储表
CREATE TABLE tb_baike (
id SERIAL PRIMARY KEY, head text, rel text, tail text);
创建外部表语法详细: CREATE EXTERNAL TABLE .
创建 外部表后,就可以直接从外部表读取数据了,例如:
db_kg=# select * from ext_baike limit 10;
head | rel | tail
-----------------+----------+---------------
*裔* | 中文名 | *裔*
*裔* | 作者 | Amarantine
*裔* | 小说进度 | 暂停
*裔* | 连载网站 | 晋江文学城
*西方犯罪学概论 | BaiduTAG | 书籍
*西方犯罪学概论 | ISBN | 9787811399967
*西方犯罪学概论 | 作者 | 李明琪 编
*西方犯罪学概论 | 出版时间 | 2010-4
*西方犯罪学概论 | 定价 | 22.00元
*西方犯罪学概论 | 页数 | 305
(10 rows)
-
将外部表数据导入到内部表:
INSERT INTO tb_baike(head, rel, tail) SELECT * FROM ext_baike;
由于 虚拟机的存储空间不足,最后运行失败。可以在日志中看到:
more greenplum/data/data1/primary/gpseg0/pg_log/gpdb-2020-01-13_000000.csv 2020-01-13 04:10:14.623480 UTC,,,p24832,th930718592,,,,0,,,seg0,,,,,"PANIC","53100","could not writ e to file ""pg_xlog/xlogtemp.24832"": No space left on device",,,,,,,0,,"xlog.c"
start: 15:33:30
abnormally end: 16:03
错误:
db_kg=# INSERT INTO tb_baike(head, rel, tail) SELECT * FROM ext_baike;
ERROR: gpfdist error: unknown meta type 108 (url_curl.c:1635) (seg0 slice1 127.0.1.1:6000 pid=15880) (url_curl.c:1635)
CONTEXT: External table ext_baike, file gpfdist://vm1:8081/baike_triples.txtLIMIT 10000;
start 16:10:00
end: 17:12:42
ERROR: interconnect encountered a network error, please check your network (seg3 slice1 192.168.5
6.6:6001 pid=11612)
DETAIL: Failed to send packet (seq 1) to 127.0.1.1:56414 (pid 15913 cid -1) after 3562 retries in
3600 seconds
Cost-based Query Optimizer
当master接受到一条SQL语句,会将这条语句解析为执行计划 DAG,将 DAG 中不需要进行数据交换的划分为 slice ,join,aggregate,sort 的时候,都会涉及到 slice
的重分布,会有一个 motion
任务来执行数据的重分布。将 slice 下发到涉及到的相关 segment 中。来源:GreenPlum:基于PostgreSQL的分布式关系型数据库
参考 About Greenplum Query Processing
slice
: To achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices. A slice is a portion of the plan that segments can work on independently. A query plan is sliced wherever a motion operation occurs in the plan, with one slice on each side of the motion.
motion
: A motion operation involves moving tuples between the segments during query processing. Note that not every query requires a motion. For example, a targeted query plan does not require data to move across the interconnect.
- a redistribute motion that moves tuples between the segments to complete the join.
- A gather motion is when the segments send results back up to the master for presentation to the client. Because a query plan is always sliced wherever a motion occurs, this plan also has an implicit slice at the very top of the plan (slice 3).
Federated Data Access
MADlib 扩展
添加 madlib扩展
从 Pivotal Network 上面下载 MADlib 1.16+8 for RHEL 7
.
使用 gppkg
安装:
出现错误:
gt@vm1:~/madlib-1.16+8-gp6-rhel7-x86_64$ gppkg -i madlib-1.16+8-gp6-rhel7-x86_64.gppkg
20200113:10:01:15:018921 gppkg:vm1:gt-[INFO]:-Starting gppkg with args: -i madlib-1.16+8-gp6-rhel7-x86_64.gppkg
20200113:10:01:15:018921 gppkg:vm1:gt-[CRITICAL]:-gppkg failed. (Reason='__init__() takes exactly 17 arguments (16 given)') exiting...
最后, 重新安装了系统: Centos 7. 在 Pivotal Network 上说明了 madlib
可以安装的系统:Pivotal Network
只给出了 Redhat 6.x
, Redhat 7.x
的 binary package. 之前装的 Ubuntu 18.04
并不能使用。Redhat Enterprise Linux
咱也用不起。。所以考虑使用免费的 Fedora
or CentOS
系统。 我装的是 CentOS 7
,
[gpadmin@vm1 ~]$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
安装的 Greenplum
是 6.3
版本的。 因此下载的文件是 madlib-1.16+9-gp6-rhel7-x86_64.tar.gz
, 解压后如下:
[gpadmin@vm1 madlib-1.16+9-gp6-rhel7-x86_64]$ ls -la
total 3048
drwxr-xr-x. 2 gpadmin gpadmin 147 Jan 14 16:51 .
drwx------. 7 gpadmin gpadmin 4096 Jan 14 16:51 ..
-rw-r--r--. 1 gpadmin gpadmin 2904455 Jan 10 07:31 madlib-1.16+9-gp6-rhel7-x86_64.gppkg
-rw-r--r--. 1 gpadmin gpadmin 135530 Jan 10 07:31 open_source_license_MADlib_1.16_GA.txt
-rw-r--r--. 1 gpadmin gpadmin 61836 Jan 10 07:31 ReleaseNotes.txt
里面的 madlib-1.16+9-gp6-rhel7-x86_64.gppkg
文件需要使用 GreenPlum 的工具 gppkg
安装。 安装方式为:
${GPHOME}/bin/gppgk [-i <package>| -u <package> | -r <name-version> | -c]
[-d <master_data_directory>] [-a] [-v]
安装:
${GPHOME}/bin/gppgk -i madlib-1.16+9-gp6-rhel7-x86_64.gppkg
然后,安装 MADlib
Object to Database:
$GPHOME/madlib/bin/madpack install -s madlib -p greenplum -c gpadmin@vm1:5432/testdb
安装记录如下(需要预先安装 m4
, yum install m4
):
[gpadmin@vm1 madlib-1.16+9-gp6-rhel7-x86_64]$ $GPHOME/madlib/bin/madpack install -p greenplum
madpack.py: INFO : Detected Greenplum DB version 6.3.0.
madpack.py: INFO : *** Installing MADlib ***
madpack.py: INFO : MADlib tools version = 1.16 (/usr/local/greenplum-db-6.3.0/madlib/Versions/1.16/bin/../madpack/madpack.py)
madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=gpadmin, schema=madlib)
madpack.py: INFO : Testing PL/Python environment...
madpack.py: INFO : > PL/Python environment OK (version: 2.7.12)
madpack.py: INFO : > Preparing objects for the following modules:
madpack.py: INFO : > - array_ops
madpack.py: INFO : > - bayes
madpack.py: INFO : > - crf
madpack.py: INFO : > - elastic_net
madpack.py: INFO : > - linalg
madpack.py: INFO : > - pmml
madpack.py: INFO : > - prob
madpack.py: INFO : > - sketch
madpack.py: INFO : > - svec
madpack.py: INFO : > - svm
madpack.py: INFO : > - tsa
madpack.py: INFO : > - stemmer
madpack.py: INFO : > - conjugate_gradient
madpack.py: INFO : > - knn
madpack.py: INFO : > - lda
madpack.py: INFO : > - stats
madpack.py: INFO : > - svec_util
madpack.py: INFO : > - utilities
madpack.py: INFO : > - assoc_rules
madpack.py: INFO : > - convex
madpack.py: INFO : > - deep_learning
madpack.py: INFO : > - glm
madpack.py: INFO : > - graph
madpack.py: INFO : > - linear_systems
madpack.py: INFO : > - recursive_partitioning
madpack.py: INFO : > - regress
madpack.py: INFO : > - sample
madpack.py: INFO : > - summary
madpack.py: INFO : > - kmeans
madpack.py: INFO : > - pca
madpack.py: INFO : > - validation
madpack.py: INFO : Installing MADlib:
madpack.py: INFO : > Created madlib schema
madpack.py: INFO : > Created madlib.MigrationHistory table
madpack.py: INFO : > Wrote version info in MigrationHistory table
madpack.py: INFO : MADlib 1.16 installed successfully in madlib schema.
madlib 的使用
错误(在数据库 db_kg
):
ERROR: schema “madlib” does not exist
由于使用 madpack
安装 madlib
时没有指定 database, 而在 db_kg 中使用 madlib
时会出现错误,使用 madpack install-check
检查:
[gpadmin@vm1 ~]$ $GPHOME/madlib/bin/madpack install-check -p greenplum -c gpadmin@vm1:5432/db_madlib_demo
madpack.py: INFO : Detected Greenplum DB version 6.3.0.
madpack.py: INFO : MADlib is not installed in the schema madlib. Install-check stopped.
发现确实在 数据库 db_kg 中没有安装 madlib. 看看上面 madpack install
的日志,发现它默认选择的数据库是 gpadmin
, 即用户名同名数据库:
madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=gpadmin, schema=madlib)
在 数据库 db_kg 上测试发现确实已经安装:
[gpadmin@vm1 ~]$ $GPHOME/madlib/bin/madpack install-check -p greenplum
madpack.py: INFO : Detected Greenplum DB version 6.3.0.
TEST CASE RESULT|Module: array_ops|array_ops.ic.sql_in|PASS|Time: 203 milliseconds
TEST CASE RESULT|Module: bayes|bayes.ic.sql_in|PASS|Time: 1102 milliseconds
TEST CASE RESULT|Module: crf|crf_test_small.ic.sql_in|PASS|Time: 931 milliseconds
TEST CASE RESULT|Module: crf|crf_train_small.ic.sql_in|PASS|Time: 927 milliseconds
TEST CASE RESULT|Module: elastic_net|elastic_net.ic.sql_in|PASS|Time: 1041 milliseconds
TEST CASE RESULT|Module: linalg|linalg.ic.sql_in|PASS|Time: 274 milliseconds
TEST CASE RESULT|Module: linalg|matrix_ops.ic.sql_in|PASS|Time: 4158 milliseconds
TEST CASE RESULT|Module: linalg|svd.ic.sql_in|PASS|Time: 2050 milliseconds
TEST CASE RESULT|Module: pmml|pmml.ic.sql_in|PASS|Time: 2597 milliseconds
TEST CASE RESULT|Module: prob|prob.ic.sql_in|PASS|Time: 67 milliseconds
TEST CASE RESULT|Module: svm|svm.ic.sql_in|PASS|Time: 1479 milliseconds
TEST CASE RESULT|Module: tsa|arima.ic.sql_in|PASS|Time: 2058 milliseconds
TEST CASE RESULT|Module: stemmer|porter_stemmer.ic.sql_in|PASS|Time: 107 milliseconds
TEST CASE RESULT|Module: conjugate_gradient|conj_grad.ic.sql_in|PASS|Time: 555 milliseconds
TEST CASE RESULT|Module: knn|knn.ic.sql_in|PASS|Time: 574 milliseconds
TEST CASE RESULT|Module: lda|lda.ic.sql_in|PASS|Time: 641 milliseconds
TEST CASE RESULT|Module: stats|anova_test.ic.sql_in|PASS|Time: 147 milliseconds
TEST CASE RESULT|Module: stats|chi2_test.ic.sql_in|PASS|Time: 174 milliseconds
TEST CASE RESULT|Module: stats|correlation.ic.sql_in|PASS|Time: 426 milliseconds
TEST CASE RESULT|Module: stats|cox_prop_hazards.ic.sql_in|PASS|Time: 586 milliseconds
TEST CASE RESULT|Module: stats|f_test.ic.sql_in|PASS|Time: 129 milliseconds
TEST CASE RESULT|Module: stats|ks_test.ic.sql_in|PASS|Time: 138 milliseconds
TEST CASE RESULT|Module: stats|mw_test.ic.sql_in|PASS|Time: 118 milliseconds
TEST CASE RESULT|Module: stats|pred_metrics.ic.sql_in|PASS|Time: 779 milliseconds
TEST CASE RESULT|Module: stats|robust_and_clustered_variance_coxph.ic.sql_in|PASS|Time: 864 milliseconds
TEST CASE RESULT|Module: stats|t_test.ic.sql_in|PASS|Time: 145 milliseconds
TEST CASE RESULT|Module: stats|wsr_test.ic.sql_in|PASS|Time: 149 milliseconds
TEST CASE RESULT|Module: utilities|encode_categorical.ic.sql_in|PASS|Time: 425 milliseconds
TEST CASE RESULT|Module: utilities|minibatch_preprocessing.ic.sql_in|PASS|Time: 534 milliseconds
TEST CASE RESULT|Module: utilities|path.ic.sql_in|PASS|Time: 387 milliseconds
TEST CASE RESULT|Module: utilities|pivot.ic.sql_in|PASS|Time: 287 milliseconds
TEST CASE RESULT|Module: utilities|sessionize.ic.sql_in|PASS|Time: 245 milliseconds
TEST CASE RESULT|Module: utilities|text_utilities.ic.sql_in|PASS|Time: 334 milliseconds
TEST CASE RESULT|Module: utilities|transform_vec_cols.ic.sql_in|PASS|Time: 427 milliseconds
TEST CASE RESULT|Module: utilities|utilities.ic.sql_in|PASS|Time: 311 milliseconds
TEST CASE RESULT|Module: assoc_rules|assoc_rules.ic.sql_in|PASS|Time: 1328 milliseconds
TEST CASE RESULT|Module: convex|lmf.ic.sql_in|PASS|Time: 409 milliseconds
TEST CASE RESULT|Module: convex|mlp.ic.sql_in|PASS|Time: 2032 milliseconds
TEST CASE RESULT|Module: deep_learning|keras_model_arch_table.ic.sql_in|PASS|Time: 498 milliseconds
TEST CASE RESULT|Module: glm|glm.ic.sql_in|PASS|Time: 4514 milliseconds
TEST CASE RESULT|Module: graph|graph.ic.sql_in|PASS|Time: 4471 milliseconds
TEST CASE RESULT|Module: linear_systems|dense_linear_sytems.ic.sql_in|PASS|Time: 313 milliseconds
TEST CASE RESULT|Module: linear_systems|sparse_linear_sytems.ic.sql_in|PASS|Time: 376 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|decision_tree.ic.sql_in|PASS|Time: 807 milliseconds
TEST CASE RESULT|Module: recursive_partitioning|random_forest.ic.sql_in|PASS|Time: 649 milliseconds
TEST CASE RESULT|Module: regress|clustered.ic.sql_in|PASS|Time: 549 milliseconds
TEST CASE RESULT|Module: regress|linear.ic.sql_in|PASS|Time: 104 milliseconds
TEST CASE RESULT|Module: regress|logistic.ic.sql_in|PASS|Time: 720 milliseconds
TEST CASE RESULT|Module: regress|marginal.ic.sql_in|PASS|Time: 1230 milliseconds
TEST CASE RESULT|Module: regress|multilogistic.ic.sql_in|PASS|Time: 1052 milliseconds
TEST CASE RESULT|Module: regress|robust.ic.sql_in|PASS|Time: 498 milliseconds
TEST CASE RESULT|Module: sample|balance_sample.ic.sql_in|PASS|Time: 389 milliseconds
TEST CASE RESULT|Module: sample|sample.ic.sql_in|PASS|Time: 79 milliseconds
TEST CASE RESULT|Module: sample|stratified_sample.ic.sql_in|PASS|Time: 252 milliseconds
TEST CASE RESULT|Module: sample|train_test_split.ic.sql_in|PASS|Time: 506 milliseconds
TEST CASE RESULT|Module: summary|summary.ic.sql_in|PASS|Time: 457 milliseconds
TEST CASE RESULT|Module: kmeans|kmeans.ic.sql_in|PASS|Time: 2581 milliseconds
TEST CASE RESULT|Module: pca|pca.ic.sql_in|PASS|Time: 4804 milliseconds
TEST CASE RESULT|Module: pca|pca_project.ic.sql_in|PASS|Time: 1948 milliseconds
TEST CASE RESULT|Module: validation|cross_validation.ic.sql_in|PASS|Time: 756 milliseconds
所以在 db_kg 上安装 madlib 即可。
在数据库内检查是否已经安装 madlib
:
[gpadmin@vm1 ~]$ psql -d db_madlib_demo
psql (9.4.24)
Type "help" for help.
db_madlib_demo=# \dn madlib
List of schemas
Name | Owner
--------+---------
madlib | gpadmin
(1 row)
正文开始…
MADlib
的使用详见: MADlib Documentation. 内容太多,只挑几个案例学习。
Array Operations
Array Operations, copy from Array Operations:
Operation | Description |
---|---|
array_add() | Adds two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
sum() | Aggregate, sums vector element-wisely. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_sub() | Subtracts two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_mult() | Element-wise product of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_div() | Element-wise division of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_dot() | Dot-product of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_contains() | Checks whether one array contains the other. This function returns TRUE if each non-zero element in the right array equals to the element with the same index in the left array. |
array_max() | This function finds the maximum value in the array. NULLs are ignored. Return type is the same as the input type. |
array_max_index() | This function finds the maximum value and corresponding index in the array. NULLs are ignored. Return type is array in format [max, index], and its element type is the same as the input type. |
array_min() | This function finds the minimum value in the array. NULLs are ignored. Return type is the same as the input type. |
array_min_index() | This function finds the minimum value and corresponding index in the array. NULLs are ignored. Return type is array in format [min, index], and its element type is the same as the input type. |
array_sum() | This function finds the sum of the values in the array. NULLs are ignored. Return type is the same as the input type. |
array_sum_big() | This function finds the sum of the values in the array. NULLs are ignored. Return type is always FLOAT8 regardless of input. This function is meant to replace array_sum() in cases when a sum may overflow the element type. |
array_abs_sum() | This function finds the sum of abs of the values in the array. NULLs are ignored. Return type is the same as the input type. |
array_abs() | This function takes an array as the input and finds abs of each element in the array, returning the resulting array. It requires that all the values are NON-NULL. |
array_mean() | This function finds the mean of the values in the array. NULLs are ignored. |
array_stddev() | This function finds the standard deviation of the values in the array. NULLs are ignored. |
array_of_float() | This function creates an array of set size (the argument value) of FLOAT8, initializing the values to 0.0. |
array_of_bigint() | This function creates an array of set size (the argument value) of BIGINT, initializing the values to 0. |
array_fill() | This functions set every value in the array to some desired value (provided as the argument). |
array_filter() | This function takes an array as the input and keep only elements that satisfy the operator on specified scalar. It requires that the array is 1-D and all the values are NON-NULL. Return type is the same as the input type. By default, this function removes all zeros. |
array_scalar_mult() | This function takes an array as the input and executes element-wise multiplication by the scalar provided as the second argument, returning the resulting array. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_scalar_add() | This function takes an array as the input and executes element-wise addition of the scalar provided as the second argument, returning the resulting array. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_sqrt() | This function takes an array as the input and finds square root of each element in the array, returning the resulting array. It requires that all the values are NON-NULL. |
array_pow() | This function takes an array and a float8 as the input and finds power of each element in the array, returning the resulting array. It requires that all the values are NON-NULL. |
array_square() | This function takes an array as the input and finds square of each element in the array, returning the resulting array. It requires that all the values are NON-NULL. |
normalize() | This function normalizes an array as sum of squares to be 1. It requires that the array is 1-D and all the values are NON-NULL. |
array_unnest_2d_to_1d() | This function takes a 2-D array as the input and unnests it by one level. It returns a set of 1-D arrays that correspond to rows of the input array as well as an ID column with values corresponding to row positions occupied by those 1-D arrays within the 2-D array. |
示例:
-
创建数据表, 并插入数据
CREATE TABLE array_tbl ( id integer NOT NULL PRIMARY KEY, array1 integer[], array2 integer[] ); INSERT INTO array_tbl VALUES ( 1, '{1,2,3,4,5,6,7,8,9}', '{9,8,7,6,5,4,3,2,1}' ), ( 2, '{1,1,0,1,1,2,3,99,8}','{0,0,0,-5,4,1,1,7,6}' );
-
使用 functions
db_madlib_demo=# select id, madlib.array_sum(array1), madlib.array_sub(array2, array1), madlib.array_max(array1), madlib.array_min(array1), madlib.array_mean(array1), madlib.normalize(array1) from array_tbl group by id; id | array_sum | array_sub | array_max | array_min | array_mean | normalize ----+-----------+-----------------------------+-----------+-----------+------------------+----------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- 1 | 45 | {8,6,4,2,0,-2,-4,-6,-8} | 9 | 1 | 5 | {0.0592348877759092,0.118469775551818,0.177704663327 728,0.236939551103637,0.296174438879546,0.355409326655455,0.414644214431365,0.473879102207274,0.533113989983183} 2 | 116 | {-1,-1,0,-6,3,-1,-2,-92,-2} | 99 | 0 | 12.8888888888889 | {0.0100595273380576,0.0100595273380576,0,0.010059527 3380576,0.0100595273380576,0.0201190546761152,0.0301785820141728,0.995893206467704,0.0804762187044609} (2 rows)
Low-Rank Matrix Faxtorization
这个模块用一个low-rank approximation
实现了 incomplete matrix
的分解。
Mathematically, this model seeks to find matrices U and V (also referred as factors) that, for any given incomplete matrix A, minimizes:
, s.t. ,
where denotes the Frobenius norm. Let be a matrix, then will be and will be , in dimension, and . This model is not intended to do the full decomposition, or to be used as part of inverse procedure. This model has been widely used in recommendation systems (e.g., Netflix [2]) and feature selection (e.g., image processing [3]).
Function Syntax
lmf_igd_run( rel_output, # TEXT. The name of the table to receive the output.
rel_source, # TEXT. The name of the table containing the input data.{sparse}
col_row, # TEXT. The name of the column containing the row number.
col_column, # TEXT. The name of the column containing the column number.
col_value, # DOUBLE PRECISION. The value at (row, col).
row_dim, # INTEGER, default: "SELECT max(col_row) FROM rel_source".
column_dim, # INTEGER, default: "SELECT max(col_col) FROM rel_source".
max_rank, # INTEGER, default: 20. The rank of desired approximation.
stepsize, # DOUBLE PRECISION, default: 0.01. Hyper-parameter that decides how aggressive the gradient steps are.
scale_factor, # DOUBLE PRECISION, default: 0.1. Hyper-parameter that decides scale of initial factors.
num_iterations, # INTEGER, default: 10. Maximum number if iterations to perform regardless of convergence.
tolerance # DOUBLE PRECISION, default: 0.0001. Acceptable level of error in convergence.
)
Examples
-
Create table, insert data.
CREATE TABLE lmf_data (row INT, col INT, val FLOAT8, primary key (row, col)); INSERT INTO lmf_data VALUES (1, 1, 5.0), (3, 100, 1.0), (999, 10000, 2.0);
-
execute function
db_madlib_demo=# SELECT madlib.lmf_igd_run('lmf_model', 'lmf_data', 'row', 'col', 'val', 999,10000, 3, 0.1, 2, 10, 1e-9 );
NOTICE: Matrix lmf_data to be factorized: 999 x 10000
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CONTEXT: SQL statement "
CREATE TABLE lmf_model (
id SERIAL,
matrix_u DOUBLE PRECISION[],
matrix_v DOUBLE PRECISION[],
rmse DOUBLE PRECISION)"
PL/pgSQL function madlib.lmf_igd_run(character varying,regclass,character varying,character varying,character varying,integer,integer,integer,double precision,double precision,integer,double precision) line 47 at EXECUTE statement
NOTICE:
Finished low-rank matrix factorization using incremental gradient
DETAIL:
* table : lmf_data (row, col, val)
Results:
* RMSE = 3.61661832699015e-06
Output:
* view : SELECT * FROM lmf_model WHERE id = 1
lmf_igd_run
-------------
1
(1 row)
-
Check the result.
db_madlib_demo=# SELECT array_dims(matrix_u) AS u_dims, array_dims(matrix_v) AS v_dims FROM lmf_model WHERE id=1; u_dims | v_dims --------------+---------------- [1:999][1:3] | [1:10000][1:3] (1 row)
-
Query the result value
db_madlib_demo=# SELECT matrix_u[2:2][1:3] AS row_2_features FROM lmf_model WHERE id = 1; row_2_features --------------------------------------------------------- {{1.97037281095982,0.312463999725878,1.06016968935728}} (1 row)
-
Make prediction of a missing entry (row=2, col=7654).
db_madlib_demo=# SELECT madlib.array_dot( matrix_u[2:2][1:3], matrix_v[7654:7654][1:3] ) AS row_2_col_7654 FROM lmf_model WHERE id = 1; row_2_col_7654 ------------------ 2.37682774869935 (1 row)
Singular Value Decomposition
# SVD Function for Dense Matrices
svd( source_table, # TEXT. Source table name (dense matrix).
output_table_prefix, # TEXT. Prefix for output tables.
row_id, # TEXT. ID for each row.
k, # INTEGER. Number of singular values to compute.
n_iterations, # INTEGER. Number of iterations to run.
result_summary_table # TEXT. The name of the table to store the result summary.
);
# SVD Function for Sparse Matrices
svd_sparse( source_table,
output_table_prefix,
row_id,
col_id,
value,
row_dim,
col_dim,
k,
n_iterations,
result_summary_table
);
Neural Network
Classification Training Function
mlp_classification(
source_table,
output_table,
independent_varname,
dependent_varname,
hidden_layer_sizes,
optimizer_params,
activation,
weights,
warm_start,
verbose,
grouping_col
)
Regression Training Function
mlp_regression(
source_table,
output_table,
independent_varname,
dependent_varname,
hidden_layer_sizes,
optimizer_params,
activation,
weights,
warm_start,
verbose,
grouping_col
)
Optimizer Parameters
'learning_rate_init = <value>,
learning_rate_policy = <value>,
gamma = <value>,
power = <value>,
iterations_per_step = <value>,
n_iterations = <value>,
n_tries = <value>,
lambda = <value>,
tolerance = <value>,
batch_size = <value>,
n_epochs = <value>,
momentum = <value>,
nesterov = <value>'
Prediction Function
mlp_predict(
model_table,
data_table,
id_col_name,
output_table,
pred_type
)
…
还有其他很多的函数,没有一一的实验,需要用到了再查找吧。
总之,这个东西,个人感觉是 数据库的 边缘用例 了。而 GreenPlum
的核心特性是 Analysis, 但是吧,这个东西是 Apache 的,到底可不可以算是卖点呢。。而且,使用数据库来分析数据,emmm, 总感觉有点违和。并且,Greenplum 本身也不稳定,bug 还是比较多。前面加载海量数据就失败了(gpfdist
)。
当然,这个工具是方便的,在数据存储的地方就把数据处理完了。
总之,知道有这个东西就好了。
来源:CSDN
作者:HTMK_GT_MK
链接:https://blog.csdn.net/HTMK_GT_MK/article/details/103984356