PG备份出来的数据库表文件比原数据库还大,为啥？

文摘科技 2024-10-05 12:41 北京

我是【Sean】, 欢迎大家长按关注并加星公众号：数据库杂记。有好资源相送，同时为你提供及时更新。同时建了一个群: 数据库Hacker。可以联系我微信邀请进群。已关注的朋友，发送0、1到7，都有好资源相送。

快速扫码入群：

个人微信：_iihero
CSDN: iihero
墨天轮：https://www.modb.pro/u/16258 (Sean)

pgfans: iihero

前言

关于PostgreSQL的备份，前几天，碰到有人问起这个问题。

奇怪吗？

于是有人就说，是不是备份用的全备，把pg_wal里头的大量的WAL文件也算在内了。提问的同学说，不是。他用的就是简单的pg_dump。

这问题有点儿意思。

重现与分析

那么，有没有这种可能呢？既然有人提起这个现象，那大概率是有的。

为了把问题简化，我们就用一张表来模拟。

建表及填充数据

1、不带toast数据

postgres=# create database foo;
CREATE DATABASE

postgres=# \c foo
You are now connected to database "foo" as user "postgres".
foo=# create table test(id int primary key, col2 text);
CREATE TABLE
foo=# insert into test select n, repeat('test ', 200) || n from generate_series(1, 10000) as n;
INSERT 0 10000
foo=# checkpoint;
CHECKPOINT

大意是插入10000条记录到表TEST当中。第二列，差不多每次都是长度1K的样子。

好，接着我们看一看数据库以及相应的表大小：

foo=# select pg_size_pretty(pg_database_size('foo'));
 pg_size_pretty
----------------
 20 MB
(1 row)

foo=# select pg_total_relation_size('test');
 pg_total_relation_size
------------------------
               11993088
(1 row)

我们再将数据库foo导出看看到底有多大？

[08:57:01-postgres@centos1:/var/lib/pgsql]$ pg_dump -d foo -f /iihero/tmp/foo.dat
[08:57:27-postgres@centos1:/var/lib/pgsql]$ ls -la /iihero/tmp/foo.dat
-rw-r--r-- 1 postgres postgres 10098818 Oct  5 08:57 /iihero/tmp/foo.dat

导出来的文件比test表的总大小还略小一点。再考查一下toast表：

foo=# select oid, relname, reltoastrelid, (select relname from pg_class where oid=a.reltoastrelid) as toastname from pg_class a where relname in ('test');
  oid  | relname | reltoastrelid |   toastname
-------+---------+---------------+----------------
 34169 | test    |         34172 | pg_toast_34169
(1 row)

foo=# select count(*) from pg_toast.pg_toast_34169;
 count
-------
     0
(1 row)

foo=# select pg_total_relation_size('pg_toast.pg_toast_34169');
 pg_total_relation_size
------------------------
                   8192
(1 row)

2、使用带有toast数据的表来看看：

foo=# drop table test;
DROP TABLE
foo=# select pg_size_pretty(pg_database_size('foo'));
 pg_size_pretty
----------------
 8809 kB
(1 row)

CREATE OR REPLACE FUNCTION random_string( int ) RETURNS TEXT as $$
    SELECT string_agg(substring('abcdefghijiklmnopqrstuvwxyz', round(random() * 25)::integer, 1), '') FROM generate_series(1, $1);
$$ language sql;

 create table test(id int primary key, col2 text);
 insert into test select n, random_string(1024) from generate_series(1, 200000) as n;
INSERT 0 200000

foo=# select pg_size_pretty(pg_database_size('foo'));
 pg_size_pretty
----------------
 236 MB
(1 row)
[09:12:19-postgres@centos1:/var/lib/pgsql]$ pg_dump -d foo -f /iihero/tmp/foo.dat
[09:13:42-postgres@centos1:/var/lib/pgsql]$  ls -lah /iihero/tmp/foo.dat
-rw-r--r-- 1 postgres postgres 193M Oct  5 09:13 /iihero/tmp/foo.dat

 pg_dump -d foo -F c -f /iihero/tmp/foo.dat
  ls -lah /iihero/tmp/foo.dat
-rw-r--r-- 1 postgres postgres 122M Oct  5 09:20 /iihero/tmp/foo.dat

从上边来看，常规的这些值，似乎备份出来的文件，哪怕是针对单个表，依然大小是小于原库的大小的。

3、使用toast数据，并且会出现压缩

下边的情形，就以一条记录为例：

create table test(id int , col2 text);

foo=# insert into test values(1, random_string(20));
INSERT 0 1

foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1
foo=# update test set col2 = col2 || col2;
UPDATE 1

-- 执行update，来个15次
foo=# select length(col2) from test;
 length
--------
 622592
(1 row)

再看看相关大小：

foo=# select oid, relname, reltoastrelid, (select relname from pg_class where oid=a.reltoastrelid) as toastname from pg_class a where relname in ('test');
  oid  | relname | reltoastrelid |   toastname
-------+---------+---------------+----------------
 34210 | test    |         34213 | pg_toast_34210
(1 row)

foo=# select ctid, chunk_id, chunk_seq, length(chunk_data) from pg_toast.pg_toast_34210;
 ctid  | chunk_id | chunk_seq | length
-------+----------+-----------+--------
 (0,3) |    34216 |         0 |   1996
 (0,4) |    34216 |         1 |   1996
 (1,1) |    34216 |         2 |   1996
 (1,2) |    34216 |         3 |   1166
(4 rows)

这意味着622592的总长，压缩至1996 + 1996 + 1996 + 1166个字节。

foo=# select pg_total_relation_size('test');
 pg_total_relation_size
------------------------
                  65536
(1 row)

[09:36:43-postgres@centos1:/iihero/tmp]$ pg_dump -d foo -f /iihero/tmp/foo.sql
[12:29:04-postgres@centos1:/iihero/tmp]$ ls -lrt /iihero/tmp/foo.sql
-rw-r--r-- 1 postgres postgres 623834 Oct  5 12:29 /iihero/tmp/foo.sql

foo=# select pg_relation_filepath('test');
 pg_relation_filepath
----------------------
 base/34168/34210
(1 row)

foo=# \! ls -lrt $PGDATA/base/34168/34210
-rw------- 1 postgres postgres 8192 Oct  5 12:28 /var/lib/pgsql/14/data/base/34168/34210
foo=# select pg_relation_filepath('pg_toast.pg_toast_34210');
 pg_relation_filepath
----------------------
 base/34168/34213
(1 row)

foo=# \! ls -lrt $PGDATA/base/34168/34213
-rw------- 1 postgres postgres 16384 Oct  5 12:28 /var/lib/pgsql/14/data/base/34168/34213

从上边，我们能清楚的看到，导出的文件，大小为: 623834B.

而数据库中原表的文件大小分别为：test:　8192B， pg_toast_34210: 16384B.

导出的文件大小明显大于表数据文件的大小。

什么原因：因为toast数据被高度压缩了。

这是单个表是这样的，如果这样的表多了，组成源库的所有的表文件的大小与导出的文件的大小的差距就会进一步拉大。最终就有可能导出的文件比源库大小还要大。

往期导读:
1. PostgreSQL中配置单双向SSL连接详解
2. 提升PSQL使用技巧：PostgreSQL中PSQL使用技巧汇集(1)
3. 提升PSQL使用技巧：PostgreSQL中PSQL使用技巧汇集(2)
4. PostgreSQL SQL的基础使用及技巧
5. PostgreSQL开发技术基础:过程与函数
6. PostgreSQL中vacuum 物理文件truncate发生的条件
7. PostgreSQL中表的年龄与Vacuum的实验探索：Vacuum有大用
8. PostgreSQL利用分区表来弥补AutoVacuum的不足
9. 也聊聊PostgreSQL中的空间膨胀与AutoVacuum
10. 正确理解SAP BTP中hyperscaler PG中的IOPS (AWS篇)

http://mp.weixin.qq.com/s?__biz=MzAxODk3Nzg5Nw==&mid=2247487709&idx=1&sn=4fa268c119dc217ef56429a164308702

数据库杂记

数据库技术专家，PostgreSQL ACE,SAP HANA,Sybase ASE/ASA,Oracle,MySQL,SQLite各类数据库, SAP BTP云计算技术, 以及陈式太极拳教学倾情分享。出版过三本技术图书，武术6段。

最新文章

PostgreSQL Internals之路 Part-IV 第20章索引扫描

DuckDB是真火！PolarDB都跑来蹭热度了

PostgreSQL Internals之路 Part-IV 第19章索引访问方法

“丁奇MySQL Seed” 群答疑: 最快地恢复表结构的方案

回忆我的外公外婆

PostgreSQL Internals之路 Part-IV 第18章表访问方法

能载舟也能覆舟,也聊聊商业(数据库)软件中的license

PostgreSQL Internals之路 Part-IV 第17章关于统计

PostgreSQL Internals之路 Part-IV 第16章查询执行各阶段

PostgreSQL Internals之路 Part-III 第15章内存结构上的锁详细介绍

PostgreSQL17为事件触发器添加LOGIN事件

如何优化Postgres表布局以获得最大的效率

太极拳给我带来了什么

实锤！SAP大中华区总裁黄陈宏去职，“在中国，为中国”谁来扛旗？

对想学习PG数据库的朋友的一些建议

PostgreSQL Internals之路 Part-III 第14章其它各类锁

PostgreSQL中有许多个timeout，逐个弄懂它们的含义和使用

压测也不难，使用HyBench对PostgreSQL进行简单压测尝尝鲜

强烈推荐！！！阿里顶级Java开源项目Top20

一文简单搞懂PostgreSQL中OOM几个重要的相关内核参数

PostgreSQL Internals之路 Part-III 第13章行级锁

10月5日WTT中国大满贯国乒双打单打全收老将新将尽显风采