1.The thesis & the inversion
Today's industry stack is OS → container runtime → orchestrator, where the orchestrator (Kubernetes, Nomad, Borg, OpenStack) is an external control plane doing scheduling, replication, health-checking, failover, and autoscaling against an OS that knows nothing about the cluster. Xylem inverts that: it folds the control plane into the kernel.
A "service" becomes a first-class kernel object carrying a desired replica count and a supervision policy. Redundancy, scaling, and failover become kernel verbs — not YAML reconciled by an outside controller.
The arborist vs. the tissue
The name is the plant tissue. Xylem is a distributed transport and support network spanning the whole organism, redundant by construction: when a vessel cavitates (a local failure), flow reroutes around the dead vessel — built-in failover as a structural property. It regrows each season (self-healing / self-scaling) and it is load-bearing (structural).
| External control plane (status quo) | Control plane in the kernel (Xylem) | |
|---|---|---|
| Where the cluster's intelligence lives | A separate distributed system (etcd + controllers, Raft servers, a Borgmaster, a daemon mesh) | The kernel itself |
| What the OS knows about the cluster | Nothing — it schedules processes | Everything — it schedules cells across nodes |
| How "a replica died" is learned | kubelet watch → API server → controller (network hops, sampled, stale) | A kernel event, like an IRQ or page fault — zero sampling latency |
| Redundancy / scaling / failover | Verbs of the external orchestrator | Verbs of the kernel |
There is no control plane to bolt on — but that does not delete the distributed-systems work. It relocates it into ring 0. Folding the control plane into the kernel moves the complexity; it does not make it vanish (see §7).
2.Why JefeOS, why now
JefeOS needs an identity that differentiates it from "yet another hobby Unix clone." The owner's decision is to be a natively-clustered OS — self-redundant, self-scaling, failure-resistant — chosen because it is more fun and less saturated than re-treading POSIX. The honest secondary reason it fits: failure-resistance is already JefeOS's de-facto through-line. The recent engineering culture has been reliability-first without anyone calling it "the cluster story," and that work is exactly the substrate Xylem needs.
| Already-shipped reliability work | What it gives Xylem |
|---|---|
| Panic persistence + next-boot recovery (crash record survives reboot) | Node-death is recorded and recoverable, not silently lost — the raw material of a supervisor's restart policy |
Fault-survivable syscalls (bad user pointers return -EFAULT instead of crashing the kernel) | A bad cell cannot take the node down with it — the isolation that makes "kill and respawn" sound |
| Leak-free process teardown + regression test (6-cycle green, zero free-delta) | Re-replication doesn't leak the fleet to death over time |
| Orphan/zombie reaper in PID-1 | The reap discipline a supervisor needs before it can claim "N healthy replicas" |
The "why now" is honest, too: the isolation unit Xylem operates on — the JSL-2 cell —
is scoped and its gates have cleared (the NTFS dir-index gate closed 2026-06-17; Alpine
apk read/write is functional, with the interactive login-prompt the last in-progress
item). For the first time there is a concrete near-term substrate to build the long arc on top
of. Xylem does not start now; its prerequisite just stopped being blocked.
3.JSL ⊥ Xylem — two orthogonal axes that compose at the cell
The single most important conceptual guard-rail in this document: JSL is not Xylem, and calling Xylem a "JSL tier" would be wrong. They are orthogonal axes that compose at exactly one point — the cell.
| Axis | Question it answers | Operates on | Status |
|---|---|---|---|
| JSL (Linux-compat ladder) | "Can JefeOS run Linux software?" | Linux-subsystem objects (translated syscalls, then isolation cells) | JSL-1 near-done; JSL-2 gated-clear |
| Xylem (native clustering) | "Can JefeOS scale and heal itself?" | Cells (native or Linux), across a fleet | Pre-implementation |
JSL is the horizontal axis (run more kinds of software on one node): JSL-1 is WSL1-style syscall translation, near-done through Alpine; JSL-2 is native containers, a single-node isolation cell. Xylem is the vertical axis (run the same software on more nodes, self-healingly). They meet at the cell:
The payoff of keeping them separate: a cell's payload is content-agnostic, so a Linux workload in a Xylem cell inherits self-replication, migration, and self-healing for free — because those are properties of the cell, not of Linux. A Linux container on stock Linux is cluster-blind until k8s wraps it from outside; a Linux workload in a Xylem cell is cluster-aware by inheritance, with no external control plane. The capability is JefeOS's, not Linux's — which is exactly why calling Xylem a "JSL tier" would bury the differentiator.
4.The Linux-host fork: WSL1 vs WSL2 inside a cell
JefeOS still aims to be a great Linux host — "WSL 2.0"-class. That goal predates Xylem and it stays, reframed as the ecosystem on-ramp in service of the Xylem identity, not a competing destination. A Linux workload in a cell inherits Xylem's verbs for free (§3). But how Linux executes inside the cell is a genuine multi-year architectural fork — one to surface honestly, not paper over. It maps cleanly onto the WSL1 → WSL2 lineage:
Path A — WSL1-shaped (translate)
JSL syscall translation: SYSCALL → LSTAR →
linux_syscall.cpp, serviced by JefeOS's own kernel. Fidelity is
approximate — bug-for-bug Linux is unreachable. EXISTS today:
this is how Alpine, apk, and real upstream packages (tree,
jq) run under chroot right now. Cheap and incremental, but a
perpetual treadmill against Linux's evolving syscall surface.
Path B — WSL2-shaped (real kernel in cell)
A real upstream Linux kernel runs inside a cell; JefeOS hosts it. Fidelity is true — it is Linux. ABSENT today: needs a hypervisor / kernel-hosting substrate JefeOS does not have. A major architectural pivot up front, then fidelity is free and permanent.
The lineage trap, stated plainly: the entire existing JSL/Alpine investment is Path A. Path B does not extend it — it stands beside it. That is why this is a fork to surface, not a step to schedule. The hybrid is probably the real answer over time: translate now, real-kernel-in-cell later, with both presenting to the fabric as "a cell." The fork is inside the cell; Xylem above it does not change.
JSL-1.x continues as incremental "better translation" (Path A keeps paying off near-term). JSL-2.0's headline becomes full kernel-in-a-cell — a real Linux kernel hosted inside a cell (WSL2-shaped), required for JSL to be credible at the 2.0 mark. Crucially, Xylem does not depend on JSL-2.0 shipping: it operates on cells regardless of what runs inside them.
5.Prior art as gold standards, NOT clone targets
We don't want to clone k8s / OpenStack / OpenShift / Nomad / Borg. We use them as gold standards for their use case and build what makes sense for Xylem. Every orchestrator below shares one assumption Xylem deliberately inverts — the cluster lives in an external control plane on top of cluster-blind OSes. We study what each does well and why, then build natively from first principles.
Cluster orchestrators — lessons to inherit, surfaces NOT to clone
| System | Gold-standard lesson for Xylem | Do NOT clone |
|---|---|---|
| Kubernetes | Level-triggered reconciliation — a loop that continuously re-asserts "I want N healthy replicas" is self-correcting against missed events. Declarative desired-state is the right contract. | The external control plane + etcd-as-a-separate-quorum + the enormous declarative API surface. Full k8s API = stated non-goal. |
| Nomad | The "evaluation" as the unit of work + feasibility→scoring split. An orchestrator can be one tight binary — which maps naturally to "in the kernel." | The external 3–5 Raft-server topology + region/datacenter federation + HCL specs. |
| Borg | Replicate the brain (consensus) but let the scheduler run on a cached, loosely-synchronized view; reserve resources as first-class allocs. | The monolithic central Borgmaster as an external service tuned to Google scale + an operational army. |
| OpenStack | A cluster OS must own the substrate — placement is meaningless without an answer for network fabric, storage, and identity. | The "distributed monolith" of many daemons over a shared message bus. JefeOS is an OS, not an IaaS orchestrating other OSes. |
| OpenShift | Opinionated, secure-by-default + a coherent day-2 (lifecycle / upgrade / heal) story is a feature, not bloat. | It thickens the entire k8s external control plane + a large operator/API surface. |
The OS-native resilience lineage — what Xylem inherits
Folding resilience into the kernel is one of the most repeatedly-attempted ideas in systems history, and most attempts died — almost never because the idea was wrong, but because they were beautiful islands with no software: technically superb systems stranded outside the ecosystem gravity well.
| System | What it PROVED | The trap |
|---|---|---|
| MOSIX / Kerrighed / Plan 9 (Single-System-Image) | The cluster can look like one machine — the kernel migrates processes transparently; Plan 9 named resources uniformly via 9P. | The market evaporated, and they cleanly migrated only stateless processes. |
| Erlang/OTP + BEAM | The closest production proof of the thesis: supervision trees, "let it crash," hot code reload, location-transparent messaging running global telecom for decades. | Not an OS (a language island). Distributed Erlang punts split-brain — a human picks the winner on heal. |
| QNX Neutrino (microkernel) | "Failure-resistant + hot-swap" as a shipping commercial reality — restart a crashed driver without rebooting, with ordered multi-stage recovery. | Stayed vertical (automotive/embedded) and proprietary — single-node, no cluster fabric. |
| seL4 / Genode (capability microkernels) | Fault isolation as a first-class, even formally-verified property — kill a component with provably no collateral authority leak. | Proves the isolation primitive; gives no clustering. |
| Unikernels (MirageOS, Solo5) | The disposable cell, demonstrated — boots in tens of milliseconds, immutable, spawn-on-demand. | Sharpest island problem: you must rewrite your app into the library OS. |
The two systems that got furthest — Erlang/OTP and QNX — are precisely the two that mark the boundary. QNX restarts flawlessly on one node; Erlang supervises flawlessly until a stateful store partitions, at which point the best-in-class system stops and asks a human. Stateless cells are tractable; stateful cells are the dragon — the same place k8s itself bleeds (etcd is a separate Raft cluster precisely because this is the hard part).
6.The architecture, at a high level
Almost nothing in this section exists today. JefeOS is a single-node kernel. The order matters: the cell comes first (JSL-2), then Xylem operates on cells. The table below classifies each asset honestly.
| Asset Xylem needs | State today | Used by |
|---|---|---|
| Network stack (TCP / TLS 1.3 / SSH, DNS) | EXISTS (single global instance; TLS client-only) | Membership, replication transport |
| Preemptive scheduler with real load / mem / failure ground truth | EXISTS (single node) | Reconcile loop, scaling signal |
| Per-process page tables (own PML4, CR3-switched) | EXISTS (Phases 0–3) | Cell isolation, migration checkpoint |
| Panic persistence; fault-survivable syscalls; leak-free teardown | EXISTS (recently hardened) | Node-death detection, clean re-replication |
| JSL-2 isolation cell (namespaces + cgroups) | ABSENT (scoped, gate-clear) | The cell boundary Xylem manages |
| Cross-node anything (membership, consensus, migration) | ABSENT | All of Xylem |
| Per-netns / multi-interface networking | ABSENT (net stack is a global singleton) | Per-cell network identity across nodes |
The cell — the unit Xylem manages
A cell is the atom of supervision: a named, supervised, relocatable unit of execution with a declared identity and a supervision contract. Xylem never schedules "a process" or "a container" directly — it schedules cells. The cell is the JSL-2 isolation cell, reused. Its payload is content-agnostic: a native JefeOS service inherits Xylem directly; a Linux workload (a JSL-1-translated process tree inside a JSL-2 cell) inherits it for free — the workload is Linux but the capability is JefeOS's.
Cluster membership — kernels watching each other
Before anything can be redundant, nodes must agree on who is alive. The
differentiator: join / leave / death arrive as kernel events
— node_up(id), node_down(id, reason),
node_suspect(id) — delivered to the reconcile loop the way an IRQ or page fault
is, not log lines an external watcher scrapes. The minimal viable protocol is a deliberate split:
gossip (SWIM-style) for liveness, and a small Raft group (3–5
voters) holding the authoritative desired-state log. We invent the state machine that
rides the wire, not the wire protocol or crypto.
Service as a first-class kernel object
A service is the durable thing the user declares; a cell is a runtime instance.
In Xylem it is a kernel object, not a YAML manifest — carrying name,
payload spec, desired_replicas, a supervision policy, a placement
policy, and a scaling rule. The reconcile loop lives inside the scheduler,
because the scheduler already holds ground truth at zero sampling latency:
| Fact the loop needs | k8s gets it by | Xylem already has it |
|---|---|---|
| Real CPU / run-queue load | Scraping cgroup stats over the network | The scheduler's own run queue |
| Real free memory | metrics-server / cAdvisor scrape | The PMM's live free-page count |
| A replica died | kubelet watch → API server → controller | A task-exit / panic / fault-survival event in-kernel |
| A node died | Node heartbeat timeout at the API server | The membership kernel event |
Self-redundancy, self-scaling, live migration
- Self-redundancy is RAID for compute.
desired_replicas = Nis an invariant the kernel maintains. Whennode_down(B)fires, every cell B hosted is a deficit, and the reconcile loop schedules replacements onto survivors — honoring anti-affinity so it doesn't recreate the single point of failure. Re-replication must be single-writer (the Raft group arbitrates an ownership lease). - Self-scaling is the same reconcile loop with
desired_replicasfree to move between min/max on in-kernel load and memory signals. The autoscaler and the scheduler are the same loop, so a scale-out decision and its placement are one atomic act, not two controllers negotiating over the network. - Live migration is the second dividend of per-process page tables. A cell that owns a private PML4 is exactly a cell whose entire user address space can be walked, serialized, and reconstructed on a peer (the classic MOSIX trick). We didn't build migration machinery; we built isolation, and isolation is most of the checkpoint.
Service addressing / front-door (ABSENT): when a cell respawns on a different
node, what address do clients use? Failover is not invisible to clients unless a stable VIP /
DNS-SD record / re-routing front-door sits in front of the moving cells. JefeOS has the DNS
resolver and net stack to build on, but no service-discovery layer exists yet.
Fleet observability (ABSENT): reliability culture lives on single-node dmesg /
serial / panic-persistence. The intended shape is a xylem status command reading
in-kernel reconcile state — but surfacing that truth to an operator is itself unbuilt.
7.The hard problems, stated honestly
The thesis is seductive; the discipline this section imposes is the price of the differentiator. Every hard problem k8s has, Xylem also has — now inside ring 0, where bugs are panics instead of crash-looped pods.
| Hard problem | Where it bites Xylem | Honest posture |
|---|---|---|
| Consensus (Raft-in-kernel) | Persistent log, leader election, replication, snapshotting — every Raft edge case as kernel code, where a liveness bug is a wedge and a safety bug is data loss | Accept Raft, scope it narrowly to desired state (never the data plane), keep the voter set small (3–5) |
| Split-brain / partitions | A partition is indistinguishable from death to a failure detector; both sides may try to maintain N → 2N cells. The core hazard | Explicit CAP choice: the majority partition stays available and may act; the minority must stop creating/mutating cells. Deliberately unavailable — correct, not a bug |
| Stateful cells — the dragon | "Maintain N replicas of Postgres" needs quorum writes, per-shard leader election, conflict resolution — replicated storage Xylem does not have | Exactly where k8s itself bleeds. Stateless cells first-class; stateful cells explicitly deferred, likely needing an external/replicated store |
| Security / cross-node multi-tenancy | A compromised node could lie in gossip, forge desired-state, or exfiltrate a migrated cell's whole address space (the checkpoint ships memory over the wire) | Node identity must be cryptographic (mutual TLS / SSH host keys — JefeOS has the primitives). "Trusted fleet" is an assumption to state, not an achievement. Hostile multi-tenancy is out of initial scope |
| CAP realities | Pervasive: membership, re-replication, scaling all make an implicit CAP choice | Make it explicit and uniform: Xylem is CP for authoritative actions. Eventual/AP only for non-authoritative liveness gossip |
8.The proof-of-thesis MVP — 2-node failover
The single demo that proves Xylem is real, concretely:
Two JefeOS nodes. Kernel-level membership between them. One stateless
service declared at replicas=2, one cell on each node. Kill a node
(power it off). The surviving kernel observes node_down as a kernel event, sees
the replica deficit against its own ground truth, and auto-respawns the missing replica
on itself — with no external orchestrator running anywhere in the demo.
- It is the inversion, made visible. The failover happens with no external control plane anywhere — something you literally cannot demonstrate on stock Linux + k8s, where the orchestrator is the thing doing the failover.
- It is honestly scoped. Stateless → no consensus-on-data, no stateful dragon. It still needs the tractable hard parts (membership, a kernel event, single-writer re-replication arbitration), so it is not a toy. It deliberately sidesteps the §6 addressing gap — called out, not hidden.
- It is showable in one screen recording:
taskliston both nodes, kill one, watch the survivor'stasklistgrow the replacement cell — driven by kernel logs, not a control-plane dashboard.
9.Phased roadmap (reliability-first sequencing)
Xylem is the long arc. Each phase is small, gated, and testable; the dev loop stays reliability-first throughout (a wedge or regression always preempts Xylem work). Effort figures are deliberately omitted — this is a direction, not a schedule, and distributed systems resist estimation.
- Phase 0 — Solidify the cell JSL-2 isolation cell (namespaces + cgroups). Xylem cannot supervise cells it cannot cleanly isolate. Gate: JSL-2's own track.
-
Phase 1 — Kernel membership
Two nodes discover + watch each other;
node_up/down/suspectas kernel events. Gossip liveness first; a small Raft group for desired state. Gate: a node-to-node mutual-auth listener is new (net stack is a global singleton). - Phase 2 — Single-service supervision A service kernel object + in-kernel reconcile loop that restarts a failed local cell (QNX-HAM-on-one-node, generalized). Gate: inherits the hardened teardown/reaper paths.
-
Phase 3 — Multi-node replicas + failover ★ The MVP
replicas=2across two nodes; kill a node → survivor auto-respawns the replica. Single-writer re-replication via the Raft group. Gate: Phases 1+2 and the partition/split-brain story. -
Phase 4 — Service addressing + observability
A stable front-door so a client reaches a service whose cells moved, and a
xylem statusview of fleet/replica state. Gate: service-VIP / DNS-SD layer is new. - Phase 5 — Live migration Checkpoint / ship / resume a stateless, connection-light cell — drain a node without killing the workload. Gate: connection/fd migration needs absent per-netns + shared storage.
-
Phase 6 — Autoscale
desired_replicasmoves between min/max on in-kernel load/memory signals, with hysteresis. Gate: coupled to SMP / cgroup-v2 CPU accounting maturity. - Phase 7 — Stateful cells (last, hardest) Durable replicated state — the dragon. Gate: likely needs an external/replicated storage substrate; explicitly the long, dangerous arc.
The sequencing is the point: stateless across a trusted fleet is the honest, reachable milestone (Phase 3); consensus, partitions, and stateful cells are the long, hard arc — the same arc every serious cluster system walks, now walked in C++ at ring 0.
10.Design decisions / direction
The owner resolved several of the design doc's open questions on 2026-06-17. These are direction, not shipped work:
| Question | Resolution |
|---|---|
| The WSL1 / WSL2 Linux-host fork | JSL-1.x continues as incremental "better translation." JSL-2.0's headline becomes full kernel-in-a-cell (a real Linux kernel hosted in a cell, WSL2-shaped) — required for JSL to be credible at the 2.0 mark. Xylem does not depend on JSL-2.0 shipping — it operates on cells regardless. |
| Cluster membership protocol (gossip vs Raft) | Deferred. The guiding principle is to model Xylem after actual plant xylem (biomimicry) — decentralized, pressure/flow-driven, reroute-around-embolism, no central authority. |
| Stateful-cell substrate | The roadmap supports both: an external replicated store (pragmatic — start here) and eventual kernel-native replicated storage. |
| Security / multi-tenant model | Hostile multi-tenant security is not an initial goal (trusted-fleet posture); it remains an outstanding future possibility. |
11.Open & breakout items
Two threads are explicitly held open for dedicated future sessions:
Membership & biomimicry
The gossip-vs-Raft boundary is deferred, with a strong design steer: model Xylem on actual plant xylem. Real xylem has no central authority — it is decentralized and pressure/flow-driven, and it reroutes around an embolized vessel as a structural property. Whether even liveness should be quorum-backed (simpler reasoning, worse scaling) or stay gossip-based is the open call.
Dual-kernel hot-swap (C++ ↔ Rust)
The originating thesis reached for a subsystem hot-swap angle. Hot-swapping a kernel subsystem (C++ → a Rust equivalent) at runtime is a multi-quarter architecture bet distinct from cell-level live-replaceability. It is open — the sustainability of JefeRust perpetually playing "catch-up" is genuinely questioned, and this is deferred to a dedicated roadmapping session.
Everything in this whitepaper is a thesis and a direction. Almost none of the clustering described here is built — the foundations it stands on are. The actual clustering (service-as-kernel-object, replica supervision, fault failover, migration, service addressing, fleet observability) is multiple quarters away and will be the hard part. Read it as where JefeOS intends to go, not where it is.