PIDs limit — How to Change K8S POD’s Limit
In this article I will describe the way for you to change the POD’s pids.max value(the limit,number of allowed PIDs which means number of processes running) for the POD.
We would go through motivation which might be similar to your motivation, also do a light view of the CGROUPS structure created for PODs.
Motivation (MongoDB’s Mongos Processes)
I actually came to this issue because of a Sharded MongoDB Cluster which its Mongos PODS got to the limit number of processes created.
Mongos creates a process per connection, when you have reached the limit number of processes(PIDs) you fail to create connections!!
So we had to be able to change the PID’s Number limit for the Mongos Pods.
In the rest of the article we will go throw the steps to have this change.
CGroups and PIDs Limit
Containers/PODs in Kubernetes are separated via namespaces and cgroups, while cgroups are separating the resources quota given for each Process, and in Kubernetes case for each Container/Pod.
Today there are 2 existing APIs/Versions for cgroups V1 and V2.
While the transition is still undergoing, e.g. EKS 1.27 still uses cgroups-V1.
Every process is belong to a specific, cgroup which defines its limits and other resource details.
Creation of new cgroups is done via creation of new directory under the cgroups filesystem as described in the Man Page
A cgroup filesystem initially contains a single root cgroup, '/',
which all processes belong to.
A new cgroup is created by creating a directory in the cgroup filesystem
Kubernetes PODs CGroups
Kubernetes creates the needed cgroups via the Kubelet.
in EKS we can see that the structure would be by creating a sub-cgroup (a new directory)kubepods.sliceover(as seen over EKS v1.26/7)
The default value for the max PIDs would be defined at/sys/fs/cgroup/pids/kubepods.slice/pids.max
Let’s inspect the cgroups structure in our Node Instance (EKS v1.26)
Under the path /sys/fs/cgroup/pids/kubepods.slice/we would find
(the path starts with /host would be clear later…as we mount the Node Host Path to our Pod at /host).
Looking at specific Pod sub-cgroup we see the containers cgroups (cri-containerd- ).
Looking at specific Container sub-cgroup (in the POD) cgroups
Inspecting the different pids.max of the different sub-cgroups.
Note we get for the POD pids.max=32768
Hence our POD’s limit for Number of Processes is 32768.
Implementing the Change in Kubernetes
Now we know that in order to change the limit for of max PIDs for PODs in a specific node, we need to change the following:
- The System default for the pids.max limit hence change
/proc/sys/kernel/pid_max - The PODs sub-cgroup default limit for
pids.maxhence change/sys/fs/cgroup/pids/kubepods.slice/pids.max
Executing both these changes would enable us the needed change in the limit of PIDs.
Note,
these paths are available only from the Host root and not from the containers path, because of the different mount namespace of the POD and the Host.
Let’s see how we use this info to change the pids.max values.
Enforcing the Change (via a POD)
We would use a POD containing a script that enforces the change.
The POD structure is of 2 Containers a main one and initContainer.
TheinitContainerwhich would run a script pids-script.sh (…we will go over it later).
This initContainerhas mounting to the host path at /, meaning it can see all the host filesystem we referenced above!
It is not privileged container.
Why the initContainer is not privileged?
The initContainer needs to be able to write on the specific Host path of the pids cgroup , but does not need any sharing of namespaces or special System calls capabilities.
As you will see we will drop all the Linux Capabilities from the POD.
This is needed to be able to change the /proc/sys/kernel/pid_max which is not available at the container filesystem.
As well as the sub-cgroups of the PODs.
Pod Structure
The Pod would be composed out of two containers, initContainer which would run the Script and a Container to hold the POD alive (hence preventing ongoing restarts of the POD).
The initContainer would run for a few milliseconds, which is the time it would take the container to run the script, and would complete.
The Pod Container would be based on the Kubernetes Pause Image, which does nothing, but keeping the POD a live.
Pause Container
The pause container is a container that exists in each pod, it’s like a template or a parent containers from which all the new containers in the pod inherit the namespaces.
The pause container starts, then goes to “sleep”.
We are using the Pause as a trick to keep the POD alive doing nothing without restart endlessly, since our Script completes after less 1 sec.
Security Aspect
Since we use HostPath (which is creating some open way for malicious actors) we have some risk.
Since we use initContainer using the HostPath, and the initContainer is short live, we are decreasing the risk to almost nothing.
This way we are at risk of any malicious actor (since we use HostPath) for a very short time, from the creation of the initContainer until it is completing the running of pids-script.sh .
The Pod should Run actually once per Node, only a new Node should have the POD run on it, hence this should be part of a Kubernetes Daemonset.
Summary
We can see that we can change the PIDs limit for Kubernetes PODs.
Using a safe manner as the initContainer is very short live and running once.
I guess that if you use applications which for some reason need many processes you would have to make some change to the POD’s limits, and this solution would help.
FYI, if you are not using a managed Kubernetes you might have that done via the Kubelet Config, defining the Limits of pids.max.
On the other hand in this way you can change only specific nodes and not all of the cluster nodes (look out of Fork Bomb).
Lastly, to close the story.
In my specific case I used the Kubernetes POD as Daemonset, adding affinity and taint to get to the proper Nodes of the Mongos and change their limit hence allowing higher loaf of connections.
I’ll be happy to get your feedback on the solution, or collaborate if you have other ideas!
