Hey Anatole ! Thanks a lot. Had cursor try to setup the environment for kind again. It also tried k3s. But it’s still failing unfortunately, though at a different point. It’s really too bad because the cursor cloud setup is now much better than GitHub workflows, having access and gui. But GitHub workflows come with kind support (and actually it’s installed out of the box already) and the docker daemon is also a bit easier to use. But anyway, here a re it’s new investigations:
Kubernetes-in-Docker: Deep Technical Analysis
Investigation of the Cursor Cloud VM incompatibility with Kind/K3D/K3s
1. Environment Fingerprint
Kernel: 6.12.58+ (Firecracker micro-VM)
PID 1: /pod-daemon (custom Cursor init, not systemd)
Cmdline: systemd.unified_cgroup_hierarchy=1
Root FS: /dev/vda (virtio), overlay NOT supported as upperdir
cgroup: v2 unified (nsdelegate, memory_recursiveprot)
iptables: legacy mode (nftables modules absent)
2. Failure Cascade — Three Layers
Running Kubernetes in Docker hits three independent failure layers. Each must be resolved for pods to run.
Layer 1: cgroup subtree_control is frozen (blocks K3s agent, Kind)
Observation:
/sys/fs/cgroup/cgroup.controllers: cpuset cpu io memory hugetlb pids
/sys/fs/cgroup/cgroup.subtree_control: cpuset cpu pids
Six controllers exist, but only three are delegated. The memory, io, and hugetlb controllers are stuck.
Root cause: PID 1 (/pod-daemon) resides directly in the root cgroup. cgroup v2’s “no internal processes” rule prevents modifying subtree_control while any process lives directly in the cgroup. PID 1 cannot be moved (it’s the cgroup-namespace init):
# Reproducible:
echo 1 > /sys/fs/cgroup/child/cgroup.procs # → ENOTSUP
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control # → ENOTSUP
Impact:
- K3s agent:
level=fatal msg="failed to find memory cgroup (v2)" — K3s checks cgroup.controllers of its container’s cgroup, finds only cpuset cpu pids, aborts.
- Kind: systemd inside node containers needs memory controller for process management.
Who sets this: The Firecracker host writes subtree_control for the VM’s cgroup before spawning /pod-daemon. The VM can never change it afterward.
Layer 2: Child cgroups are domain invalid (blocks runc container creation)
Observation:
mkdir /sys/fs/cgroup/test_child
cat /sys/fs/cgroup/test_child/cgroup.type # → domain invalid
echo $PID > /sys/fs/cgroup/test_child/cgroup.procs # → ENOTSUP
echo "+cpu" > /sys/fs/cgroup/test_child/cgroup.subtree_control # → ENOTSUP
New child cgroups start as domain invalid and are completely non-functional. Neither cgroup.procs (domain operation) nor cgroup.subtree_control works. The only way to make a child cgroup usable is to set it to threaded mode:
echo "threaded" > /sys/fs/cgroup/test_child/cgroup.type # succeeds
echo $PID > /sys/fs/cgroup/test_child/cgroup.threads # succeeds
This is why Docker’s cgroupfs driver creates all container cgroups as threaded — it’s the only functional path. But threaded cgroups cannot run Kubernetes components.
Impact on runc (container creation):
When K3s kubelet schedules a pod, runc tries to create a domain cgroup for the container:
runc create failed: unable to apply cgroup configuration:
cannot enter cgroupv2 "/sys/fs/cgroup/k8s.io" with domain controllers
-- it is in an invalid state: unknown
This is the final blocker — even when the K3s control plane and kubelet are running, no pod containers can actually be created.
Impact on Kind: systemd inside Kind node containers needs domain-type cgroups with cgroup.procs support. In a threaded cgroup, cgroup.procs returns ENOTSUP:
Failed to create /init.scope control group: Structure needs cleaning
Layer 3: Missing kernel modules (blocks CNI networking)
Observation:
sudo iptables -t nat -A OUTPUT -j ACCEPT -m comment --comment "test"
# → Warning: Extension comment revision 0 not supported, missing kernel module?
# → iptables: No chain/target/match by that name.
The xt_comment iptables module is missing from the kernel. This breaks every CNI plugin that uses iptables NAT rules with comments (bridge, flannel, calico all use this).
Additionally:
vxlan kernel module: absent (flannel VXLAN backend fails)
overlay filesystem: present in /proc/filesystems but the root filesystem doesn’t support it as upperdir (likely tmpfs or similar)
3. Approaches Tested
| # |
Approach |
Result |
Failure Point |
| 1 |
Kind v0.22.0 (default) |
 |
systemd can’t create /init.scope (Layer 2) |
| 2 |
Kind v0.27.0 (latest) |
 |
Same Layer 2 failure |
| 3 |
K3D (K3s in Docker) |
 |
K3s fatal: no memory cgroup (Layer 1+2) |
| 4 |
Mount cgroup v1 (mount -t cgroup -o memory) |
 |
dmesg: cgroup: Unknown subsys name 'memory' — v1 memory disabled at kernel level |
| 5 |
Mount fresh cgroup v2 |
 |
Same hierarchy, same frozen subtree_control |
| 6 |
unshare --cgroup (new namespace) |
 |
Same hierarchy visible, same limitations |
| 7 |
Docker cgroupns=private |
 |
Container sees threaded cgroup, memory absent |
| 8 |
Docker cgroupns=host |
 |
Container sees host root, still can’t modify subtree_control |
| 9 |
K3s in Docker with kubelet bypass flags |
 |
failed to evacuate root cgroup (threaded cgroup.procs ENOTSUP) |
| 10 |
K3s direct on host, --disable-agent |
 |
Control plane works! API server, scheduler, controller-manager all run |
| 11 |
K3s direct on host, full (with agent) |
Partial |
Kubelet starts, node registers as Ready, but runc can’t create pod containers (Layer 2) |
| 12 |
Fake cgroup via tmpfs |
Not viable |
Would need to fake the entire cgroup v2 API surface |
Best result achieved:
$ sudo k3s server --snapshotter=native --flannel-backend=none \
--disable=traefik,servicelb,metrics-server \
--kubelet-arg=cgroups-per-qos=false \
--kubelet-arg=enforce-node-allocatable="" \
--write-kubeconfig-mode=644
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
cursor Ready control-plane,master 33s v1.31.5+k3s1
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-... 0/1 ContainerCreating 0 28s
Node is Ready. Pods are scheduled but runc can’t create containers because of the domain invalid cgroup issue (Layer 2).
4. What The Host Must Change
Change 1 (P0): Delegate all controllers in subtree_control
Before spawning /pod-daemon, the host must write:
echo "+cpuset +cpu +io +memory +hugetlb +pids" \
> /sys/fs/cgroup/<vm-cgroup>/cgroup.subtree_control
Why: This is the single configuration line that determines which controllers child cgroups can use. It must be set before any process is placed in the cgroup (before PID 1 starts), because once PID 1 exists in the root, the “no internal processes” rule prevents modification.
Implications:
- Memory accounting overhead: ~1-2% additional memory usage per cgroup (kernel tracks RSS, cache, swap per cgroup)
- IO accounting overhead: minimal, kernel already tracks block IO
- No security downside: these controllers only enable resource visibility and limits, they don’t grant capabilities
Change 2 (P0): Ensure domain-type child cgroups work
The current behavior where child cgroups are domain invalid and can’t accept processes needs investigation. This may be caused by:
-
Insufficient controller delegation: if subtree_control doesn’t delegate memory, the kernel may refuse domain operations on children. Fixing Change 1 may automatically fix this.
-
Root cgroup type domain threaded: this type allows both domain and threaded children, but in practice only threaded works. The host may need to either:
- Create the VM’s cgroup in pure
domain mode (not domain threaded)
- Or pre-create a child cgroup in
domain mode and set it as the cgroup namespace root
Verification after fix:
mkdir /sys/fs/cgroup/test
cat /sys/fs/cgroup/test/cgroup.type # Should be: domain (or domain invalid → domain after accepting proc)
echo $$ > /sys/fs/cgroup/test/cgroup.procs # Should succeed
rmdir /sys/fs/cgroup/test
Change 3 (P1): Enable xt_comment and VXLAN kernel modules
The kernel needs these modules for Kubernetes networking:
xt_comment: required by every major CNI plugin for iptables rule management
vxlan: required by Flannel VXLAN backend (the most common K8s network)
Either:
- Compile these into the kernel (
CONFIG_NETFILTER_XT_MATCH_COMMENT=y, CONFIG_VXLAN=y)
- Or load them as modules if the kernel supports module loading (currently
nomodule is in the kernel cmdline)
Workaround if modules can’t be added: Use CNI plugins that don’t need iptables comments (e.g., ptp without ipMasq), or use host-gw networking instead of VXLAN. This is fragile.
Change 4 (P1): Enable overlay2 filesystem support
dmesg: overlay: filesystem on /upper not supported as upperdir
The root filesystem doesn’t support overlay mounts. This forces Docker and containerd to use fuse-overlayfs (userspace, slow) or native (copies all layers, very slow and disk-heavy).
Fix: Use ext4 or xfs for the VM’s root filesystem with d_type support enabled. The kernel already has CONFIG_OVERLAY_FS compiled in.
Change 5 (P2): Consider systemd as PID 1
With systemd as PID 1:
- Docker can use the
systemd cgroup driver (cleaner cgroup management)
- cgroup delegation works via
Delegate=yes in systemd unit files
- The
domain threaded vs threaded issue may not arise
This is a significant architectural change and may not be desired. It’s not strictly necessary if Changes 1-2 are implemented.
5. What Would Work TODAY (Without Host Changes)
K3s control plane only (no pods)
k3s server --disable-agent --write-kubeconfig-mode=644
This gives you a working Kubernetes API server for testing manifests, RBAC, CRDs, etc. No pods will actually run.
Docker containers directly (current approach)
Docker itself works because it uses threaded cgroups. All services that can run as Docker containers (which is everything in Big Boom Phase 1) work perfectly.
6. Summary
| Failure |
Layer |
Fix Owner |
Fix Complexity |
Impact |
| memory controller not delegated |
cgroup subtree_control |
Host provisioning |
Low (one line change) |
Unblocks K3s memory check |
| domain invalid cgroups unusable |
cgroup hierarchy |
Host provisioning |
Low-Medium (may be automatic with ^) |
Unblocks runc container creation |
| xt_comment module missing |
Kernel config |
Kernel build |
Medium |
Unblocks CNI networking |
| overlay2 unsupported |
Filesystem |
VM image |
Medium |
Enables fast container storage |
| vxlan module missing |
Kernel config |
Kernel build |
Medium |
Enables Flannel VXLAN |
| systemd not PID 1 |
VM architecture |
VM image |
High |
Cleaner cgroup management |
Minimal viable fix: Changes 1 + 3 (delegate all controllers + enable xt_comment module). This would likely enable K3s running directly on the host with basic networking.
Full fix: Changes 1 + 2 + 3 + 4 would enable both Kind and K3D with full functionality.