Overview

Here I briefly describe a prometheus hang I encountered that was not quite complete. Although it is not a complete process, it is a reference to the direction that storage can affect the functionality of the prometheus.

Problem Phenomenon

Figure 1: Prometheus process state
https://cdn.pyer.dev/blog-cdn/2020/12/23/06/06/28/11f276a28f75.jpg

However, no service is available and port 9090 is not accessible.

Figure 2: Prometheus API inaccessible ::

Locate the process

to see if network access exists

First you have to check the internal state to see if it is completely stuck or if it is just the interface providing the web service that is not responding.

[root@liqiang.io]# lsof -p 18244

I found that it was stuck and not responding, so I had some bad feelings and suspected a system problem, so I topped it first to see: ``` []# lsof -p 18244

[root@liqiang.io]# top

I found a lot of processes using a lot of CPU.

Figure 3: Processes are heavily CPU-intensive

First, make sure that the Prometheus cgroup is configured on the same CPU cores.

[root@liqiang.io]# cat /proc/18244/cgroup 
... ...
10:cpuset:/xxx/app
... ...
[root@liqiang.io]# cat /etc/cgconfig.conf | grep -A 3 app
group xxx/app {
    cpuset {
        cpuset.cpus = "4,5";
        cpuset.mems = "0-1";

Then we look at the usage of the corresponding CPU cores in top and see that the situation is not that bad.

Figure 4: CPU cores in top, press 1

to see if there are system calls

Hmm!? The process doesn’t exist anymore.

Ending

Test partner restarted my service 。。。。

Figure : Service was restarted
Figure : Storage service log exception
!

Then it’s even ending, but unfortunately my locating process is not complete.