Skip to content

Driver for HPC Environments #1393

@jmabry

Description

@jmabry

Problem Statement

Current OpenShell compute drivers (Docker, Podman, MicroVM, Kubernetes) can be difficult to integrate into large-scale HPC environments (e.g., 20k+ node Slurm/LSF deployments) due to a few practical constraints:

Infrastructure Policies: Enterprise IT guidelines frequently restrict the deployment of new daemons or orchestrators across compute nodes.

NFS Compatibility: Rootless Podman relies on fuse-overlayfs and user-namespace UID mappings. These can conflict with standard NFS setups and often require global /etc/subuid provisioning across the cluster.

Runtime Overhead: MicroVMs introduce virtualization overhead that may not be ideal for high-throughput, latency-sensitive batch jobs.

It would be highly beneficial to explore a daemonless compute driver that enables lightweight deployments while naturally aligning with existing NFS permissions and scheduler workflows.

Proposed Design

We propose the implementation of a new daemonless compute driver that leverages HPC-native or lightweight sandboxing primitives.

Possible Underlying Technologies:

Apptainer: As a widely adopted tool in HPC, it handles NFS and unprivileged execution well without requiring cluster-wide infrastructure changes. Implementing an Apptainer driver would require exploring how smoothly it can integrate with OpenShell's Landlock/seccomp policy engine while minimizing container image or OCI overhead.

Bubblewrap (bwrap): A low-level primitive that could serve as a lightweight, host-process wrapper. It avoids OCI overhead and allows for direct policy hook-ins, though it would require building custom lifecycle management and offers a lighter isolation boundary.

Alternatives Considered

Rootless Podman: Currently challenging to deploy in these specific environments without cluster-wide infrastructure modifications, largely due to NFS storage graph limitations and host-level /etc/subuid dependencies.

Agent Investigation

Investigation: Apptainer and bubblewrap as unprivileged-host backends

Our agents evaluated both runtimes named in the issue on a representative locked-down host (Rocky 8.9, NFSv3 $HOME, no /etc/subuid, cgroup v1, no sudo). Same host where rootless Podman v5.8.2 fails before launch (NFS xattrs, missing subuid, no cgroup limits on v1).

TL;DR: both runtimes are viable backends on this host class with the same network model — unshare-net plus a host-side forward-proxy UNIX socket bind-mounted in. Apptainer keeps the driver thin (built-in OCI pull + instance lifecycle + --nv); bubblewrap keeps the host footprint tiny but the driver has to supply OCI/lifecycle/networking glue. Both share the "no multi-uid images without /etc/subuid" and "no cgroup limits on v1" limitations.

Comparison at a glance

Capability Apptainer Bubblewrap
OCI image pull Built-in (docker://) Driver-supplied (skopeo/umoci, or delegate to Apptainer)
Long-lived sandbox lifecycle Built-in (instance start / instance list) Driver-supplied (PID + state files, polling)
unshare-net + bind-mounted forward-proxy socket Validated Validated
Workspace + supervisor bind mounts Built-in flags Built-in flags
Single-uid mode on hosts without /etc/subuid Default behavior --unshare-user --uid 0 --gid 0
Capability dropping / seccomp BPF hooks Limited surface First-class (--cap-drop, seccomp BPF)
Resource limits (CPU/mem) None directly; cgroup v2 best-effort None directly; cgroup v2 best-effort via systemd-run
GPU passthrough --nv (built-in) Driver-supplied (--dev-bind /dev/nvidia* + lib mounts)
Host footprint ~250 MB relocatable user install Few hundred KB, often pre-installed
Driver surface area Thin (translate spec → CLI) Thick (rootfs cache, lifecycle, networking, limits)

Network model (validated end-to-end under bwrap)

Sandbox started with the netns unshared; a host-side OpenShell forward proxy is bind-mounted in over a UNIX socket; HTTPS_PROXY is injected so every outbound call flows through gateway policy. From the supervisor inside the bwrap sandbox:

[PASS] direct TCP to 1.1.1.1:443:                 blocked: [Errno 101] Network is unreachable
[PASS] CONNECT example.com:443     (expect 200):  HTTP/1.1 200 OK
[PASS] CONNECT forbidden.example.net:443 (expect 403):  HTTP/1.1 403 Forbidden

The same shape is what an Apptainer driver would use (--network none + bind-mounted socket); no bridge, no published ports, no admin policy needed.

Shared design that lets either runtime work here

  1. Connectivity: unshare-net + host-side forward proxy over a bind-mounted UNIX socket.
  2. Storage and supervisor delivery via bind mounts, replacing libpod named volumes / image_volumes.
  3. Lifecycle: apptainer instance start + polling (Apptainer), or PID + state-file tracking with inotify/polling (bwrap).
  4. CPU/memory limits as best-effort: only enforce on unified cgroup v2 (e.g. systemd-run --user --scope); otherwise document and skip.
  5. No multi-uid images without /etc/subuid — same blocker as rootless Podman, both runtimes affected. Documented limitation.

Suggested next steps

  • Unblock end-to-end testing on the target host class: the v0.0.42 release binaries require glibc ≥ 2.39 / GLIBCXX_3.4.31, which excludes RHEL 8 / Rocky 8 / SLES 15 (glibc 2.28). An x86_64-unknown-linux-musl artifact matching the existing aarch64 musl one would unblock testing of any HPC driver.
  • Apptainer driver spike: apptainer instance start with the real OpenShell supervisor image and the gateway's forward proxy bind-mounted as a UNIX socket, plus the small in-sandbox loopback→socket bridge for HTTPS_PROXY=http://127.0.0.1:N clients.
  • Bubblewrap driver spike: promote the captured bwrap argv into a real spec builder (mirroring openshell-driver-podman::build_container_spec) and wire CreateSandbox/StopSandbox/WatchSandboxes against PID + state files.
  • Re-run capability probes on representative hosts (HPC login node, workstation, admin-enabled cluster) before committing to driver API guarantees.
Per-runtime probe results (full tables)

Apptainer

Area Result
OCI docker:// pull + exec PASS
Bind RW → /sandbox workspace PASS
Bind RO → supervisor binary path PASS
--env injection (OPENSHELL_* parity) PASS
Bind RO → /run/secrets PASS
--fakeroot → uid 0 inside PASS in root-mapped userns mode (single-uid map); subuid-backed fakeroot fails — no /etc/subuid entry
apptainer instance start (long-lived sandbox) PASS
--network none PASS
Bind RO of a host UNIX socket into sandbox PASS
--mount type=tmpfs for /run/netns FAIL — workaround: bind a host dir → /run/netns (PASS)
--cpus cgroup limit FAIL — requires unified cgroup v2
Bridge --net + reach default gateway FAIL — and not needed under the chosen network model

Bubblewrap

Area Result
OCI rootfs staging (apptainer build --sandbox in the spike; skopeo+umoci also viable) PASS — python:3.11-slim extracted in ~2.5s
--unshare-{user,pid,ipc,uts} + single-uid map PASS
Bind RW workspace, RO supervisor, --proc /proc --dev /dev --tmpfs /tmp --tmpfs /run PASS
--cap-drop ALL + --setenv OPENSHELL_* PASS
--unshare-net (direct TCP unreachable) PASS
Bind RO host UNIX socket + allowlisted CONNECT (expect 200) PASS
Bind RO host UNIX socket + denied CONNECT (expect 403) PASS
Pre-existing mount points in rootfs before --ro-bind <rootfs> / REQUIRED — driver must pre-create them
OCI image pull / lifecycle / event stream / port publish N/A — driver supplies these
CPU/memory cgroup limits FAIL on cgroup v1; on v2 wrap bwrap in systemd-run --user --scope

Environment: Rocky 8.9, x86_64, kernel 6.1, glibc 2.28, $HOME on NFSv3, no /etc/subuid entry, cgroup v1, no sudo. Apptainer 1.4.5 via the unprivileged installer; bubblewrap 0.4.0 system package.

Full per-runtime detail, artifact paths, and the Podman-fails-here context: experiments/apptainer-smoke/GITHUB-ISSUE-AGENT-INVESTIGATION.md, with companion artifacts in experiments/apptainer-smoke/ and experiments/bubblewrap-spike/.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions