将 "数据与人" 设为 "星标⭐"
第一时间收到文章更新
背景
基础环境:
操作系统:DB:RetHet 7.8 存储:SSD 内存:16 G CPU核数:16CORE 数据库环境:5.7.12 事务隔离级别:RR
问题现象:
应用反馈 出现了大量的数据库连接报错, 之后恢复正常。MySQL日志显示数据库似乎崩溃了。
排查原因
07:42:44 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
key_buffer_size=8388608
read_buffer_size=16777216
max_used_connections=29
max_threads=5000
thread_count=32
connection_count=22
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 245834871 K bytes of memory
Hope that is ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f607c0072c0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f6141b36e80 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0xe77fec]
/usr/sbin/mysqld(handle_fatal_signal+0x459)[0x7a7019]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f643257a8d0]
/usr/sbin/mysqld(_ZN16Partition_helper25handle_ordered_index_scanEPh+0x5c)[0xbbabec]
/usr/sbin/mysqld(_ZN7handler13ha_index_lastEPh+0x1b0)[0x7f4410]
/usr/sbin/mysqld(_Z14join_read_lastP7QEP_TAB+0x65)[0xc1f605]
/usr/sbin/mysqld(_Z10sub_selectP4JOINP7QEP_TABb+0x11b)[0xc25e4b]
/usr/sbin/mysqld(_ZN4JOIN4execEv+0x3b8)[0xc1ea78]
/usr/sbin/mysqld(_Z12handle_queryP3THDP3LEXP12Query_resultyy+0x238)[0xc8e408]
/usr/sbin/mysqld[0x770ccf]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THDb+0x3403)[0xc51103]
/usr/sbin/mysqld(_Z11mysql_parseP3THDP12Parser_state+0x3ad)[0xc531bd]
/usr/sbin/mysqld(_Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command+0x817)[0xc53a47]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x18f)[0xc54faf]
/usr/sbin/mysqld(handle_connection+0x278)[0xd108d8]
/usr/sbin/mysqld(pfs_spawn_thread+0x1b4)[0xe90784]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f64325730a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6430e1b87d]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f607c015ad0): select * from test where time>='2016-07-29 00:00:00' and time<='2016-07-29 23:59:59' and tag in (2,3,6) order by id desc limit 2000
Connection ID (thread ID): 138760
Status: NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
首先是第一部分的信息:
mysqld got signal 11 ;
Resource temporarily unavailable
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 245834871 K bytes of memory
这一段计算了当前配置下, 大概换算了一下, 结合内存使用率的报警信息, 推测是内存耗尽造成的;
加上BP的9.6G, 有11G了, 算上MySQL本身占用的一部分内存, 确实达到了比较高的程度;
但是看了一下kernel和message, 都没有发现系统出现OOM的日志, 应该不是系统kill的;
再看看堆栈相关的信息, 在里面记录了MySQL crash时的状态;
stack_bottom = 7f6141b36e80 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0xe77fec]
/usr/sbin/mysqld(handle_fatal_signal+0x459)[0x7a7019]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f643257a8d0]
/usr/sbin/mysqld(_ZN16Partition_helper25handle_ordered_index_scanEPh+0x5c)[0xbbabec]
/usr/sbin/mysqld(_ZN7handler13ha_index_lastEPh+0x1b0)[0x7f4410]
/usr/sbin/mysqld(_Z14join_read_lastP7QEP_TAB+0x65)[0xc1f605]
/usr/sbin/mysqld(_Z10sub_selectP4JOINP7QEP_TABb+0x11b)[0xc25e4b]
/usr/sbin/mysqld(_ZN4JOIN4execEv+0x3b8)[0xc1ea78]
/usr/sbin/mysqld(_Z12handle_queryP3THDP3LEXP12Query_resultyy+0x238)[0xc8e408]
/usr/sbin/mysqld[0x770ccf]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THDb+0x3403)[0xc51103]
/usr/sbin/mysqld(_Z11mysql_parseP3THDP12Parser_state+0x3ad)[0xc531bd]
/usr/sbin/mysqld(_Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command+0x817)[0xc53a47]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x18f)[0xc54faf]
/usr/sbin/mysqld(handle_connection+0x278)[0xd108d8]
/usr/sbin/mysqld(pfs_spawn_thread+0x1b4)[0xe90784]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f64325730a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6430e1b87d]
可以推断出当时MySQL是正在执行查询, 这些查询中有join, 也有subquery, 且查询的表包含了分区表。
同时还有分区表的使用, 看了一下当时候分区表的大小,发现当天的数据超过了BP的大小, 且用到分区表的查询走的全表扫描, 并且还有order by, 会用到sort的buffer, 且由于全表扫描的数据很多, 这个buffer有可能是需要申请满的。
综合这些信息, 基本确认是内存耗尽造成了MySQL crash。
那么根据堆栈的信息尝试还原crash时的场景:
在内存占用率很高的情况下, MySQL thread在执行较大表的查询时, 无法再申请到足够的内存(sort的buffer, join的buffer等), 因此发生了crash;
处理方式:
事情并不简单
select * from test where time>='2016-07-29 00:00:00' and time<='2016-07-29 23:59:59' and tag in (2,3,6) order by id desc limit 2000
在后来重现的时候, 两次crash的语句中, 记录的是同样的语句, (而且堆栈的输出信息也是完全一样) , 仅仅只是时间不一样。
select * from test where time>='2016-08-09 00:00:00' and time<='2016-08-09 23:59:59' and tag in (2,3,6) order by id desc limit 2000
在被拉起来的备库上跑了一下这个语句, 结果MySQL马上就crash了。
那么明显就是这个语句的问题了, order by desc + limit, 看上去并没什么问题, 看看explain的结果
filtered在100%的情况下, rows只有1还是挺奇怪的, 一整张表只有一行数据, 但是还有这种查询一整天的语句;
表结构:关键的部分。
分区表的分区有问题。
考虑到用那条语句可以必现这个crash, 且输出的堆栈信息和之前完全一致,所以确定是这个分区表的分区缺失的前提下, 触发那个查询语句的时导致了MySQL的Crash;
处理方式:
虽然最后还是找到了问题所在, 但是最开始的时候还是被buffer和内存使用率的现象误导了。