手动执行VACUUM,无法清理表中的dead tuple

1、问题描述

在业务并不繁忙的情况下发现服务器负载较高,执行SQL的速度明显缓慢。检查数据库TOP SQL,发现与表test有关,查看该表的统计值,live_tuple只有91966条,dead tuple为702346条。
 
highgo=# select n_live_tup,n_dead_tup from pg_stat_all_tables where relname = 'test';
n_live_tup | n_dead_tup
------------+------------
91966 | 702346
(1 row)

手动对该表执行VACUUM同样无法回收dead tuple,报错如下:

highgo=# vacuum(verbose,analyze) test;
INFO: 00000: vacuuming "public.test"
INFO: 00000: index "index_test_content_id" now contains 146076 row versions in 698 pages
DETAIL: 0 index row versions were removed.
175 index pages have been deleted, 126 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: 00000: index "test_pkey" now contains 146076 row versions in 568 pages
DETAIL: 0 index row versions were removed.
77 index pages have been deleted, 68 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: 00000: "test": found 0 removable, 789485 nonremovable row versions in 10178 out of 10276 pages
DETAIL: 702346 dead row versions cannot be removed yet.
There were 59707 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.04s/0.15u sec elapsed 0.20 sec.
INFO: 00000: vacuuming "pg_toast.pg_toast_16737"
INFO: 00000: index "pg_toast_16737_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.13 sec.
INFO: 00000: "pg_toast_16737": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.59 sec.
INFO: 00000: analyzing "public.test"
INFO: 00000: "test": scanned 10276 of 10276 pages, containing 90959 live rows and 702346 dead rows; 30000 rows in sample, 90959 estimated total row

2、问题原因

当我们开启一个事务后,为了实现多版本并发控制,数据库会阻止vacuum清理未结束的长事务之后产生的所有dead tuple,即使与该事务无关的表也是如此。对于未结束的长事务而言,之后产生的所有dead tuple对于该事务都是可见的,所以即使手动执行vacuum也无法进行清理。

  • Live transactions performing a write operation in any table will prevent vacuuming dead rows
    generated by commited transactions that started after first live transaction in any other table.

3、解决方案

检查数据库中存在的长事务:

highgo=#select        
usename,
datname,
state,
waiting,
xact_start,
now(),query,pid,
now()-xact_start as time
from
pg_stat_activity
where
now()-xact_start>interval '5 sec'
and query !~ '^COPY' and state<>'idle'
order by xact_start;
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------
usename | highgo
datname | highgo
state | idle in transaction
waiting | f
xact_start | 2018-07-02 11:09:22.607884+08
now | 2018-07-04 16:59:43.212717+08
query | select xxxxxxxx
pid | 17924
time | 2 day 05:50:26.909971

检查该SQL执行未结束的原因,如需尽快恢复前台应用,可以在确认安全的情况下使用如下命令,将该长事务终止。

SELECT pg_terminate_backend(17924);

再次对该表手动VACUUM:

highgo=# vacuum(verbose,analyze) test;
INFO: 00000: vacuuming "public.test"
INFO: 00000: scanned index "index_test_content_id" to remove 62482 row versions
DETAIL: CPU 0.01s/0.09u sec elapsed 0.26 sec
INFO: 00000: scanned index "test_pkey" to remove 62482 row versions
DETAIL: CPU 0.01s/0.08u sec elapsed 0.19 sec
INFO: 00000: "test": removed 62482 row versions in 11041 pages
DETAIL: CPU 0.11s/1.30u sec elapsed 1.52 sec
INFO: 00000: index "index_test_content_id" now contains 91106 row versions in 698 pages
DETAIL: 62482 index row versions were removed.
346 index pages have been deleted, 152 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO: 00000: index "test_pkey" now contains 91106 row versions in 568 pages
DETAIL: 62482 index row versions were removed.
244 index pages have been deleted, 50 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: 00000: "test": found 781909 removable, 87305 nonremovable row versions in 11179 out of 11215 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 777839 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.32s/2.97u sec elapsed 3.96 sec.
INFO: 00000: vacuuming "pg_toast.pg_toast_16737"
INFO: 00000: index "pg_toast_16737_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: 00000: "pg_toast_16737": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: 00000: analyzing "public.test"
INFO: 00000: "test": scanned 11215 of 11215 pages, containing 91106 live rows and 0 dead rows; 30000 rows in sample, 91106 estimated total rows
VACUUM

测试问题SQL,确保该问题不会再次发生。