问题描述
在自建的MySQL或者是使用RDS MySQL时,我们可能会遇到CPU 100%的问题,如何去troubleshooting
分析解决
对于数据CPU 100%的问题来说,一般都是慢SQL致的,我们可以从如下方面来排查:
1. 查看当前数据库正在运行的语句
SELECT
trx_mysql_thread_id,
trx_id,
trx_state,
trx_started,
trx_query
FROM
INFORMATION_SCHEMA.INNODB_TRX\G
或者是
show full processlist
cpu消耗过大通常情况下都是有慢sql造成的,这里的慢sql包括全表扫描,扫描数据量过大,内存排序,磁盘排序,锁争用等待,等待磁盘io等。
重点关注SQL语句的执行状态:show processlist后出现大量的语句,Copying to tmp table,Copying to tmp table on disk,Sorting result,locked,Using filesort,都是有潜在有性能问题的SQL, 一般出现在结果集巨大,SQL不够优化导致大量的临时表导致的
2. 查看锁等待
查看锁情况:
SELECT
r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_thread,
TIMESTAMPDIFF( SECOND, r.trx_wait_started, CURRENT_TIMESTAMP ) AS wait_time,
r.trx_query AS waiting_query,
l.lock_table AS waiting_table_lock,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_thread,
SUBSTRING( p.HOST, 1, INSTR( p.HOST, ':' ) - 1 ) AS blocking_host,
SUBSTRING( p.HOST, INSTR( p.HOST, ':' ) + 1 ) AS blocking_port,
IF
( p.command = "Sleep", p.time, 0 ) AS idle_in_trx,
b.trx_query AS blocking_query
FROM
INFORMATION_SCHEMA.INNODB_LOCK_WAITS AS w
INNER JOIN INFORMATION_SCHEMA.INNODB_TRX AS b ON b.trx_id = w.blocking_trx_id
INNER JOIN INFORMATION_SCHEMA.INNODB_TRX AS r ON r.trx_id = w.requesting_trx_id
INNER JOIN INFORMATION_SCHEMA.INNODB_LOCKS AS l ON w.requested_lock_id = l.lock_id
LEFT JOIN INFORMATION_SCHEMA.PROCESSLIST AS p ON p.id = b.trx_mysql_thread_id
ORDER BY
wait_time DESC \G
3. 查看slow log
如果没有配置的话,需要进行如下配置
还可以查看下慢日志,但是您当前没有开启,需要您先手动配置下: 需要配置的参数如下:
low_query_log:要创建慢速查询日志,请设置为 1。默认值为 0。
long_query_time:要防止在慢速查询日志中记录快速运行的查询,请指定需要记录的最短查询执行时间值,以秒为单位。默认值为 10 秒;最小值为 0。将 long_query_time 设置为 0.1 可防止记录任何运行时间少于 100 毫秒的查询。
log_queries_not_using_indexes=1:要将所有不使用索引的查询记录到慢速查询日志,请设置为 1。默认值为 0。将记录不使用索引的查询,即使它们的执行时间小于 long_query_time 参数的值。
log_output=FILE
然后在RDS点击,修改,将日志导出到cloudwatch。
请查阅AWS RDS Logs
4. 可以及时将show engine innodb status打印出来
show engine innodb status的输出非常强大,其中包含了
信号量等待信息
有关最新外键和死锁错误的数据信息
事务的锁等待信息
活跃事务持有的表锁、记录锁信息
等等...
5. Top SQL耗时分析
SELECT
EVENT_ID,
TRUNCATE ( TIMER_WAIT / 1000000000000, 6 ) AS Duration,
SQL_TEXT
FROM
PERFORMANCE_SCHEMA.events_statements_history_long
ORDER BY
Duration DESC
LIMIT 5;
+----------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EVENT_ID | Duration | SQL_TEXT |
+----------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1436 | 0.033460 | SELECT EVENT_ID, TRUNCATE(TIMER_WAIT/1000000000000,6) as Duration, SQL_TEXT FROM performance_schema.events_statements_history_long order by Duration desc limit 5 |
| 4323480 | 0.003146 | update node set role = 'main', heartbeat = current_timestamp where id = 1960 |
| 4445223 | 0.002929 | update node set role = 'main', heartbeat = current_timestamp where id = 1960 |
| 4689042 | 0.002895 | update node set role = 'main', heartbeat = current_timestamp where id = 1960 |
| 4380520 | 0.002753 | update node set role = 'main', heartbeat = current_timestamp where id = 1960 |
+----------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
根据对应的top5耗时长的sql的event_id查看该sql具体耗时在哪个地方
SELECT
event_name AS Stage,
TRUNCATE ( TIMER_WAIT / 1000000000000, 6 ) AS Duration
FROM
PERFORMANCE_SCHEMA.events_stages_history_long
WHERE
NESTING_EVENT_ID = 1436;
+--------------------------------+----------+
| Stage | Duration |
+--------------------------------+----------+
| stage/sql/starting | 0.000097 |
| stage/sql/checking permissions | 0.000014 |
| stage/sql/Opening tables | 0.000019 |
| stage/sql/init | 0.000026 |
| stage/sql/System lock | 0.000009 |
| stage/sql/optimizing | 0.000006 |
| stage/sql/statistics | 0.000014 |
| stage/sql/preparing | 0.000012 |
| stage/sql/Sorting result | 0.000007 |
| stage/sql/executing | 0.000005 |
| stage/sql/Sending data | 0.000009 |
| stage/sql/Creating sort index | 0.028948 |
| stage/sql/end | 0.000005 |
| stage/sql/query end | 0.000006 |
| stage/sql/closing tables | 0.000008 |
| stage/sql/freeing items | 0.000019 |
| stage/sql/cleaning up | 0.000001 |
+--------------------------------+----------+
6.打开RDS 增强监控
Amazon RDS 为数据库实例运行的操作系统 (OS) 实时提供指标 请查阅AWS 增强监控
7. 打开performance insight
Performance Insights 在现有 Amazon RDS 监控功能的基础上进行了扩展,以便通过示意图展示您的数据库的性能并帮助您分析影响性能的任何问题。利用 Performance Insights 控制面板,您可以可视化数据库负载并按等待状态、SQL 语句、主机或用户来筛选负载。
请查询PI官方文档