Overview
Here I briefly describe a prometheus hang I encountered that was not quite complete. Although it is not a complete process, it is a reference to the direction that storage can affect the functionality of the prometheus.
Problem Phenomenon
Figure 1: Prometheus process state |
---|
However, no service is available and port 9090 is not accessible.
Figure 2: Prometheus API inaccessible | :: |
---|---|
Locate the process
to see if network access exists
First you have to check the internal state to see if it is completely stuck or if it is just the interface providing the web service that is not responding.
[root@liqiang.io]# lsof -p 18244
I found that it was stuck and not responding, so I had some bad feelings and suspected a system problem, so I topped it first to see: ``` []# lsof -p 18244
[root@liqiang.io]# top
I found a lot of processes using a lot of CPU.
Figure 3: Processes are heavily CPU-intensive |
---|
First, make sure that the Prometheus cgroup is configured on the same CPU cores.
[root@liqiang.io]# cat /proc/18244/cgroup
... ...
10:cpuset:/xxx/app
... ...
[root@liqiang.io]# cat /etc/cgconfig.conf | grep -A 3 app
group xxx/app {
cpuset {
cpuset.cpus = "4,5";
cpuset.mems = "0-1";
Then we look at the usage of the corresponding CPU cores in top and see that the situation is not that bad.
Figure 4: CPU cores in top, press 1 |
---|
to see if there are system calls
Hmm!? The process doesn’t exist anymore.
Ending
Test partner restarted my service 。。。。
Figure : Service was restarted |
---|
Figure : Storage service log exception | |
---|---|
! |
Then it’s even ending, but unfortunately my locating process is not complete.