01
物化视图的使用场景
02
物化视图的分类
同步物化视图需要与基表的数据保持强一致性。 异步物化视图与基表的数据保持最终一致性,可能会有一定的延迟。它通常用于对数据时效性要求不高的场景,一般使用 T+1 或小时级别的数据来构建物化视图。如果时效性要求高,则考虑使用同步物化视图。
对于异步物化视图,可以使用单表或多表。 对于同步物化视图,只能使用单表。
对于异步物化视图
对于同步物化视图
这或许是一个对你有用的开源项目,data-warehouse-learning 项目是一套基于 MySQL + Kafka + Hadoop + Hive + Dolphinscheduler + Doris + Seatunnel + Paimon + Hudi + Iceberg + Flink + Dinky + DataRT + SuperSet 实现的实时离线数仓(数据湖)系统,以大家最熟悉的电商业务为切入点,详细讲述并实现了数据产生、同步、数据建模、数仓(数据湖)建设、数据服务、BI报表展示等数据全链路处理流程。
https://gitee.com/wzylzjtn/data-warehouse-learning
https://github.com/Mrkuhuo/data-warehouse-learning
https://bigdatacircle.top/
项目演示:
03
同步物化视图
加速耗时的聚合运算 查询需要匹配不同的前缀索引 通过预先过滤减少需要扫描的数据量 通过预先完成复杂的表达式计算来加速查询
同步物化视图只支持针对单个表的 SELECT 语句,支持 WHERE、GROUP BY、ORDER BY 等子句,但不支持 JOIN、HAVING、LIMIT 子句和 LATERAL VIEW。 与异步物化视图不同,不能直接查询同步物化视图。 SELECT 列表中,不能包含自增列,不能包含常量,不能有重复表达式,也不支持窗口函数。 如果 SELECT 列表包含聚合函数,则聚合函数必须是根表达式(不支持 sum(a) + 1,支持 sum(a + 1)),且聚合函数之后不能有其他非聚合函数表达式(例如,SELECT x, sum(a) 可以,而 SELECT sum(a), x 不行)。 如果删除语句的条件列在物化视图中存在,则不能进行删除操作。如果确实需要删除数据,则需要先将物化视图删除,然后才能删除数据。 单表上过多的物化视图会影响导入的效率:导入数据时,物化视图和 Base 表的数据是同步更新的。如果一张表的物化视图表过多,可能会导致导入速度变慢,这就像单次导入需要同时导入多张表的数据一样。 物化视图针对 Unique Key 数据模型时,只能改变列的顺序,不能起到聚合的作用。因此,在 Unique Key 模型上不能通过创建物化视图的方式对数据进行粗粒度的聚合操作。
-- 创建一个 test_db
create database test_db;
use test_db;
-- 创建表
create table sales_records
(
record_id int,
seller_id int,
store_id int,
sale_date date,
sale_amt bigint
)
distributed by hash(record_id)
properties("replication_num" = "1");
-- 插入数据
insert into sales_records values(1,1,1,'2020-02-02',1);
create materialized view store_amt as
select store_id, sum(sale_amt) from sales_records group by store_id;
show alter table materialized view from test_db;
+--------+---------------+---------------------+---------------------+---------------+-----------------+----------+---------------+----------+------+----------+---------+
| JobId | TableName | CreateTime | FinishTime | BaseIndexName | RollupIndexName | RollupId | TransactionId | State | Msg | Progress | Timeout |
+--------+---------------+---------------------+---------------------+---------------+-----------------+----------+---------------+----------+------+----------+---------+
| 494349 | sales_records | 2020-07-30 20:04:56 | 2020-07-30 20:04:57 | sales_records | store_amt | 494350 | 133107 | FINISHED | | NULL | 2592000 |
+--------+---------------+---------------------+---------------------+---------------+-----------------+----------+---------------+----------+------+----------+---------+
cancel alter table materialized view from test_db.sales_records;
desc sales_records all;
+---------------+---------------+---------------------+--------+--------------+------+-------+---------+-------+---------+------------+-------------+
| IndexName | IndexKeysType | Field | Type | InternalType | Null | Key | Default | Extra | Visible | DefineExpr | WhereClause |
+---------------+---------------+---------------------+--------+--------------+------+-------+---------+-------+---------+------------+-------------+
| sales_records | DUP_KEYS | record_id | INT | INT | Yes | true | NULL | | true | | |
| | | seller_id | INT | INT | Yes | true | NULL | | true | | |
| | | store_id | INT | INT | Yes | true | NULL | | true | | |
| | | sale_date | DATE | DATEV2 | Yes | false | NULL | NONE | true | | |
| | | sale_amt | BIGINT | BIGINT | Yes | false | NULL | NONE | true | | |
| | | | | | | | | | | | |
| store_amt | AGG_KEYS | mv_store_id | INT | INT | Yes | true | NULL | | true | `store_id` | |
| | | mva_SUM__`sale_amt` | BIGINT | BIGINT | Yes | false | NULL | SUM | true | `sale_amt` | |
+---------------+---------------+---------------------+--------+--------------+------+-------+---------+-------+---------+------------+-------------+
show create materialized view store_amt on sales_records;
+---------------+-----------+------------------------------------------------------------------------------------------------------------+
| TableName | ViewName | CreateStmt |
+---------------+-----------+------------------------------------------------------------------------------------------------------------+
| sales_records | store_amt | create materialized view store_amt as select store_id, sum(sale_amt) from sales_records group by store_id |
+---------------+-----------+------------------------------------------------------------------------------------------------------------+
select store_id, sum(sale_amt) from sales_records group by store_id;
explain select store_id, sum(sale_amt) from sales_records group by store_id;
+-----------------------------------------------------------------------------------------------+
| Explain String(Nereids Planner) |
+-----------------------------------------------------------------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS: |
| store_id[#11] |
| sum(sale_amt)[#12] |
| PARTITION: HASH_PARTITIONED: mv_store_id[#7] |
| |
| HAS_COLO_PLAN_NODE: false |
| |
| VRESULT SINK |
| MYSQL_PROTOCAL |
| |
| 3:VAGGREGATE (merge finalize)(145) |
| | output: sum(partial_sum(mva_SUM__sale_amt)[#8])[#10] |
| | group by: mv_store_id[#7] |
| | sortByGroupKey:false |
| | cardinality=1 |
| | final projections: mv_store_id[#9], sum(mva_SUM__sale_amt)[#10] |
| | final project output tuple id: 4 |
| | distribute expr lists: mv_store_id[#7] |
| | |
| 2:VEXCHANGE |
| offset: 0 |
| distribute expr lists: |
| |
| PLAN FRAGMENT 1 |
| |
| PARTITION: HASH_PARTITIONED: record_id[#2] |
| |
| HAS_COLO_PLAN_NODE: false |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 02 |
| HASH_PARTITIONED: mv_store_id[#7] |
| |
| 1:VAGGREGATE (update serialize)(139) |
| | STREAMING |
| | output: partial_sum(mva_SUM__sale_amt[#1])[#8] |
| | group by: mv_store_id[#0] |
| | sortByGroupKey:false |
| | cardinality=1 |
| | distribute expr lists: |
| | |
| 0:VOlapScanNode(136) |
| TABLE: test_db.sales_records(store_amt), PREAGGREGATION: ON |
| partitions=1/1 (sales_records) |
| tablets=10/10, tabletList=494505,494507,494509 ... |
| cardinality=1, avgRowSize=0.0, numNodes=1 |
| pushAggOp=NONE |
| |
| |
| Statistics |
| planed with unknown column statistics |
+-----------------------------------------------------------------------------------------------+
drop materialized view store_amt on sales_records;
04
异步物化视图
CREATE TABLE `t1` (
`user_id` LARGEINT NOT NULL,
`o_date` DATE NOT NULL,
`num` SMALLINT NOT NULL
) ENGINE=OLAP
COMMENT 'OLAP'
PARTITION BY RANGE(`o_date`)
(
PARTITION p20170101 VALUES [('2017-01-01'), ('2017-01-02')),
PARTITION p20170102 VALUES [('2017-01-02'), ('2017-01-03')),
PARTITION p20170201 VALUES [('2017-02-01'), ('2017-02-02'))
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 2
PROPERTIES ('replication_num' = '1') ;
CREATE TABLE `t2` (
`user_id` LARGEINT NOT NULL,
`age` SMALLINT NOT NULL
) ENGINE=OLAP
PARTITION BY LIST(`age`)
(
PARTITION `p1` VALUES IN ('1'),
PARTITION `p2` VALUES IN ('2')
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 2
PROPERTIES ('replication_num' = '1') ;
CREATE MATERIALIZED VIEW mv1
BUILD DEFERRED REFRESH AUTO ON MANUAL
partition by(`order_date`)
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES (
'replication_num' = '1'
)
AS
SELECT t1.o_date as order_date, t1.user_id as user_id, t1.num, t2.age FROM t1 join t2 on t1.user_id=t2.user_id;
[('2017-01-01'), ('2017-01-02')) [('2017-01-02'), ('2017-01-03')) [('2017-02-01'), ('2017-02-02'))
CREATE MATERIALIZED VIEW mv2
BUILD DEFERRED REFRESH AUTO ON MANUAL
partition by(`age`)
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES (
'replication_num' = '1'
)
AS
SELECT t1.o_date as order_date, t1.user_id as user_id, t1.num, t2.age FROM t1 join t2 on t1.user_id=t2.user_id;
('1') ('2')
CREATE TABLE hive1 (
`k1` int)
PARTITIONED BY (
`year` int,
`region` string)
STORED AS ORC;
alter table hive1 add if not exists
partition(year=2020,region="bj")
partition(year=2020,region="sh")
partition(year=2021,region="bj")
partition(year=2021,region="sh")
partition(year=2022,region="bj")
partition(year=2022,region="sh")
CREATE MATERIALIZED VIEW mv_hive
BUILD DEFERRED REFRESH AUTO ON MANUAL
partition by(`year`)
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES ('replication_num' = '1')
AS
SELECT k1,year,region FROM hive1;
CREATE MATERIALIZED VIEW mv_hive2
BUILD DEFERRED REFRESH AUTO ON MANUAL
partition by(`region`)
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES ('replication_num' = '1')
AS
SELECT k1,year,region FROM hive1;
CREATE TABLE t1 (
`k1` INT,
`k2` DATE NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`k1`)
COMMENT 'OLAP'
PARTITION BY range(`k2`)
(
PARTITION p26 VALUES [("2024-03-26"),("2024-03-27")),
PARTITION p27 VALUES [("2024-03-27"),("2024-03-28")),
PARTITION p28 VALUES [("2024-03-28"),("2024-03-29"))
)
DISTRIBUTED BY HASH(`k1`) BUCKETS 2
PROPERTIES (
'replication_num' = '1'
);
CREATE MATERIALIZED VIEW mv1
BUILD DEFERRED REFRESH AUTO ON MANUAL
partition by(`k2`)
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES (
'replication_num' = '1',
'partition_sync_limit'='1',
'partition_sync_time_unit'='DAY'
)
AS
SELECT * FROM t1;
List 分区
CREATE TABLE `t1` (
`k1` INT NOT NULL,
`k2` DATE NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`k1`)
COMMENT 'OLAP'
PARTITION BY list(`k2`)
(
PARTITION p_20200101 VALUES IN ("2020-01-01"),
PARTITION p_20200102 VALUES IN ("2020-01-02"),
PARTITION p_20200201 VALUES IN ("2020-02-01")
)
DISTRIBUTED BY HASH(`k1`) BUCKETS 2
PROPERTIES ('replication_num' = '1') ;
CREATE MATERIALIZED VIEW mv1
BUILD DEFERRED REFRESH AUTO ON MANUAL
partition by (date_trunc(`k2`,'month'))
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES (
'replication_num' = '1'
)
AS
SELECT * FROM t1;
CREATE MATERIALIZED VIEW mv1
BUILD DEFERRED REFRESH AUTO ON MANUAL
partition by (date_trunc(`k2`,'year'))
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES (
'replication_num' = '1'
)
AS
SELECT * FROM t1;
Range 分区
CREATE TABLE `t1` (
`k1` LARGEINT NOT NULL,
`k2` DATE NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`k1`)
COMMENT 'OLAP'
PARTITION BY range(`k2`)
(
PARTITION p_20200101 VALUES [("2020-01-01"),("2020-01-02")),
PARTITION p_20200102 VALUES [("2020-01-02"),("2020-01-03")),
PARTITION p_20200201 VALUES [("2020-02-01"),("2020-02-02"))
)
DISTRIBUTED BY HASH(`k1`) BUCKETS 2
PROPERTIES ('replication_num' = '1') ;
CREATE MATERIALIZED VIEW mv1
BUILD DEFERRED REFRESH AUTO ON MANUAL
partition by (date_trunc(`k2`,'month'))
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES (
'replication_num' = '1'
)
AS
SELECT * FROM t1;
CREATE MATERIALIZED VIEW mv1
BUILD DEFERRED REFRESH AUTO ON MANUAL
partition by (date_trunc(`k2`,'year'))
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES (
'replication_num' = '1'
)
AS
SELECT * FROM t1;
REFRESH MATERIALIZED VIEW mvName AUTO;
REFRESH MATERIALIZED VIEW mvName COMPLETE;
REFRESH MATERIALIZED VIEW mvName partitions(partitionName1,partitionName2);
CREATE MATERIALIZED VIEW mv1
REFRESH COMPLETE ON SCHEDULE EVERY 10 hour
partition by(`xxx`)
AS
select ...;
CREATE MATERIALIZED VIEW mv1
REFRESH AUTO ON SCHEDULE EVERY 10 hour
partition by(`xxx`)
AS
select ...;
4.2.3 自动触发
CREATE MATERIALIZED VIEW mv1
REFRESH ON COMMIT
partition by(`xxx`)
AS
select ... from t1;
05
进交流群群添加作者
推荐阅读系列文章
建议收藏 | Dinky系列总结篇 建议收藏 | Flink系列总结篇 建议收藏 | Flink CDC 系列总结篇 建议收藏 | Doris实战文章汇总
如果喜欢 请点个在看分享给身边的朋友