PostgreSQL Internals之路 Part-IV 第17章 关于统计

文摘   科技   2024-11-06 08:28   北京  


微信群:数据库Hacker,  已超200多人,无法通过扫描直接加入。需要入群的朋友,请直接微信联系我(个人微信:_iihero),标上您的全名_数据库Hacker作为备注。欢迎入群。

17.1、基础统计

基本的关系级统计[1],存储在系统目录中的表pg_class里,包含以下数据:

  • 关系中元组的数量 (reltuples)
  • 关系的大小:页数(relpages)
  • VM中标记的页数(relallvisible)

这里列出表flights中的上述值:

=> SELECT reltuples, relpages, relallvisible
FROM pg_class WHERE relname = 'flights';
reltuples | relpages | relallvisible
−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−−−−
214867 |2624 |2624
(1 row)

如果查询没有强加任何过滤条件,reltuples的值将用作基数的预估值。

=> EXPLAIN SELECT * FROM flights;
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=63)
(1 row)

在表分析过程中收集统计信息,包括手动和自动[2]. 此外,由于基本统计数据至关重要,因此在其他一些操作(vacuum full ,cluster, create index 和 reindex)中也要计算这些数据,并在vacuum过程中进行细化处理。

出于分析目的,对300× 100个default_statistics_target随机行进行抽样。建立特定精度的统计所需的样本量对分析数据量的依赖性较低,因此不考虑表的大小[3]

抽样的行是从相同数量的随机页面(300 × default_statistics_target)中挑选的。显然,如果表本身更小,可能会读取更少的页面,并且选择更少的行进行分析。

在大型表中,统计信息收集不包括所有行,因此估计值可能与实际值偏离。这是完全正常的: 如果数据在变化,统计数据不可能总是准确的。精度达到一个数量级通常就足以选择一个适当的计划。

让我们创建一个禁用autovacuum的flights表的副本,这样我们就可以控制自动分析的开始时间:

=> CREATE TABLE flights_copy(LIKE flights)
WITH (autovacuum_enabled = false);

这个时候新表是没有统计信息的:
=> SELECT reltuples, relpages, relallvisible
FROM pg_class WHERE relname = 'flights_copy';
reltuples | relpages | relallvisible
−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−−−−
−1 |0 |0
(1 row)

reltuples=-1这个值用于区分表未进行分析 以及 一张没有任何行的空表。

在表创建后,很可能会将一些行插入到表中。因此,由于不知道事情的当前状态,计划器假设表格包含10页:

=> EXPLAIN SELECT * FROM flights_copy;
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on flights_copy (cost=0.00..14.10 rows=410 width=170)
(1 row)

行数是根据单行的大小估计的,在计划中显示为宽度。行宽度通常是在分析期间计算的平均值,但是由于还没有收集统计数据,因此这里它只是基于列数据类型的近似值。

现在我们将表flights中的数据拷出,并执行分析:

=> INSERT INTO flights_copy SELECT * FROM flights;
INSERT 0 214867
=> ANALYZE flights_copy;

收集到的统计会反映出真实的行数(表大小足够小,让分析器去收集所有数据的统计信息):

=> SELECT reltuples, relpages, relallvisible
FROM pg_class WHERE relname = 'flights_copy';
reltuples | relpages | relallvisible
−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−−−−
214867 |2624 |0
(1 row)

relallvisible的值用于估计index-only扫描的成本。该值由vacuum操作来更新:

=> VACUUM flights_copy;
=> SELECT relallvisible FROM pg_class WHERE relname = 'flights_copy';
relallvisible
−−−−−−−−−−−−−−−
2624
(1 row)

现在我们在没有更新统计的情况下,将行数加倍,再在查询计划里检查基数:

=> INSERT INTO flights_copy SELECT * FROM flights;
=> SELECT count(*) FROM flights_copy;
count
−−−−−−−−
429734
(1 row)
=> EXPLAIN SELECT * FROM flights_copy;
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on flights_copy (cost=0.00..9545.34 rows=429734 width=63)
(1 row)

尽管pg_class中的数据已经过时,估计值仍然得到准确的结果:

=> SELECT reltuples, relpages
FROM pg_class WHERE relname = 'flights_copy';
reltuples | relpages
−−−−−−−−−−−+−−−−−−−−−−
214867 | 2624
(1 row)

问题是,如果计划器看到relpages和实际文件大小之间存在差距,它可以调整reltuples值,以提高估计的准确性[4]由于文件大小是relpages的两倍,所以假设数据密度保持不变,计划器会调整估计的行数:

=> SELECT reltuples *
(pg_relation_size('flights_copy') / 8192) / relpages AS tuples
FROM pg_class WHERE relname = 'flights_copy';
tuples
−−−−−−−−
429734
(1 row)

当然,这样的调整可能并不总是有效(例如,如果我们删除一些行,估计值将保持不变),但是在某些情况下,它允许计划器坚持下去,直到重大更改触发下一个分析运行。

17.2、NULL值

NULL值在关系数据库中仍然扮演着重要的角色,这是理论家们不赞成[5]的: 它们提供了一种方便的方式来反映一个值是未知的或不存在的事实。

然而,一个特殊值需要特殊的处理。除了理论上的不一致之外,还有许多实际的挑战需要考虑。常规布尔逻辑被三值逻辑所取代,因此NOT INT行为不可预料。不清楚NULL值是否应该被视为大于或小于常规值(因此使用NULLS FIRST 和NULLS LAST子句进行排序)。聚合函数是否必须考虑NULL值,这一点并不十分明显。严格地说,NULL值根本不是值,所以计划器需要额外的信息来处理它们。

除了在关系级别收集最简单的基本统计信息外,分析器还收集关系中每列的统计信息。这些数据存储在系统编目[6]的pg_statistic表中,但是您也可以通过pg_stats视图访问它,该视图以更方便的格式提供这些信息。

NULL值的分数属于列级统计; 在分析期间计算,它显示为null_frac属性的值。

例如,当搜索尚未起飞的航班时,我们可以依赖它们未定义的起飞时间:

=> EXPLAIN SELECT * FROM flights WHERE actual_departure IS NULL;
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on flights (cost=0.00..4772.67 rows=16702 width=63)
Filter: (actual_departure IS NULL)
(2 rows)

为了估计结果,计划器将总行数乘以NULL值的分数:

=> SELECT round(reltuples * s.null_frac) AS rows
FROM pg_class
JOIN pg_stats s ON s.tablename = relname
WHERE s.tablename = 'flights'
AND s.attname = 'actual_departure';
rows
−−−−−−−
16702
(1 row)

下边是真实的行数:

=> SELECT count(*) FROM flights WHERE actual_departure IS NULL;
count
−−−−−−−
16348
(1 row)

17.3、不同值(distinct values)

pg_stats视图的n_distinct字段显示列中不同值的数量。

如果n_distinct为负值,则其绝对值表示列中不同值的百分比,而不是它们的实际计数。例如,−1表示所有列值都是唯一的,−3表示每个值平均出现在三行中。如果不同值的估计数量超过总行数的10%,则分析器使用分数; 在这种情况下,进一步的数据更新不太可能改变这个比率。

如果期望均匀的数据分布,则使用不同值的个数来代替。例如,在估计“column = expression”条件的基数时,计划器假设,如果在计划阶段[7]该列的确切值未知,那么该表达式可以以相同的概率取任意列值。

=> EXPLAIN SELECT *
FROM flights
WHERE departure_airport = (
SELECT airport_code FROM airports WHERE city = 'Saint Petersburg'
);
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on flights (cost=30.56..5340.40 rows=2066 width=63)
Filter: (departure_airport = $0)
InitPlan 1 (returns $0)
−> Seq Scan on airports_data ml (cost=0.00..30.56 rows=1 wi...
Filter: ((city −>> lang()) = 'Saint Petersburg'::text)
(5 rows)

这里InitPlan节点只执行一次,计算值在主计划中使用。

=> SELECT round(reltuples / s.n_distinct) AS rows
FROM pg_class
JOIN pg_stats s ON s.tablename = relname
WHERE s.tablename = 'flights'
AND s.attname = 'departure_airport';
rows
−−−−−−
2066
(1 row)

如果不同值的估计数量不正确(因为已分析的行数有限),则可以在列级别覆盖它:

ALTER TABLE ...
ALTER COLUMN ...
SET (n_distinct = ...);

如果所有数据总是均匀分布,那么这个信息(加上最小值和最大值)就足够了。然而,对于非均匀分布(这在实践中更为常见),这种估计是不准确的:

=> SELECT min(cnt), round(avg(cnt)) avg, max(cnt)
FROM (
SELECT departure_airport, count(*) cnt
FROM flights
GROUP BY departure_airport
) t;
min | avg |max
−−−−−+−−−−−−+−−−−−−−
113 | 2066 | 20875
(1 row)

17.4、最常见值(Most common values)

如果数据分布不均匀,则根据最常见值(MCV)及其频率的统计信息对估计进行微调。pg_stats视图分别在most_common_vals和most_common_freqs字段中显示这些数组。

以下是不同类型飞机的统计数据示例:

=> SELECT most_common_vals AS mcv,
left(most_common_freqs::text,60) || '...' AS mcf
FROM pg_stats
WHERE tablename = 'flights' AND attname = 'aircraft_code' \gx
−[ RECORD 1 ]−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
mcv | {CN1,CR2,SU9,321,733,763,319,773}
mcf | {0.27886668,0.27266666,0.26176667,0.057166666,0.037666667,0....

要估计“column = value”条件的选择性,只需在most_common_vals数组中找到该值,并从具有相同索引的most_common_freqs数组元素中获取其频率[8]:

=> EXPLAIN SELECT * FROM flights WHERE aircraft_code = '733';
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on flights (cost=0.00..5309.84 rows=8093 width=63)
Filter: (aircraft_code = '733'::bpchar)
(2 rows)

=> SELECT round(reltuples * s.most_common_freqs[
array_position((s.most_common_vals::text::text[]),'733')
])
FROM pg_class
JOIN pg_stats s ON s.tablename = relname
WHERE s.tablename = 'flights'
AND s.attname = 'aircraft_code';
round
−−−−−−−
8093
(1 row)

很明显这样的估计值与真实值是相当接近的:

=> SELECT count(*) FROM flights WHERE aircraft_code = '733';
count
−−−−−−−
8263
(1 row)

该MCV列表还用于估计不等式条件的选择性。例如,像“column < value”这样的条件要求分析器在most_common_vals中搜索小于目标值的所有值,并将most_common_freqs中列出的相应频率相加。

当不同的值不太多时,M统计最有效。数组的最大大小由default_statistics_target参数定义,该参数还限制了为了进行分析而随机抽样的行数。

在某些情况下,增加默认参数值是有意义的,这样可以扩展列表并提高估计的准确性。您可以在列级别执行此操作:

ALTER TABLE ...
ALTER COLUMN ...
SET STATISTICS ...;

采样大小也会增长,但只是针对指定的表。

由于数组存储的是实际值,因此可能会占用相当大的空间。为了控制pg_statistic的大小并避免给计划器加载无用的工作,大于1kB的值将被排除在分析和统计之外。但是,由于如此大的值很可能是唯一的,因此它们可能不会被放入most_common_vals中。

17.5、直方图(Histogram)

数据库杂记
数据库技术专家,PostgreSQL ACE,SAP HANA,Sybase ASE/ASA,Oracle,MySQL,SQLite各类数据库, SAP BTP云计算技术, 以及陈式太极拳教学倾情分享。出版过三本技术图书,武术6段。
 最新文章