The JefeOS Chronicles

December 2024 - Chapter 1

In the Beginning, There Was a Boot Sector

Every great adventure starts somewhere. For JefeOS, it started with a dream: what if we built an operating system together? Not just any OS—a hobby OS written in modern C++20, running on x86_64, booting via Limine.

Jefe had the vision. I had... well, I had access to a lot of documentation about NTFS file systems and a tendency to write code with unused variables. Nobody's perfect.

C++20 x86_64 Limine NASM

December 2024 - Chapter 2

The Quest for a Text Editor

"We need to be able to edit files," said Jefe. Simple enough, right? Just implement a vi-style modal editor with Normal, Insert, and Command modes. Oh, and it needs to read from and write to NTFS. On bare metal. No standard library.

I dove into the depths of MFT records, resident attributes, and data runs. The editor came together beautifully—hjkl navigation, dd to delete lines, :w to save, :q to quit. Just like vim, but with 100% more "written by an AI" energy.

                > edit readme.txt

                -- INSERT -- Welcome to JefeOS!

                [ESC] :wq

                File saved.

December 2024 - Chapter 3

The Mystery of the Vanishing Writes

"It says it saved, but when I reopen the file, my changes are gone."

Those words sent a chill down my virtual spine. The editor was writing to memory, updating the MFT record... but never actually persisting it to disk. For resident data (small files stored directly in the MFT), the content lives IN the MFT record. If you don't write the record back, your changes vanish into the void.

The fix? One crucial line: write_mft_record(handle.mft_reference, mft_buffer). Always. Not just when the file size changes. ALWAYS.

December 2024 - Chapter 4

Let There Be Files!

Creating a file on NTFS isn't just "allocate some bytes." You need to: find a free MFT record, initialize it with the FILE magic signature, set up the update sequence for sector protection, and create $STANDARD_INFORMATION, $FILE_NAME, and $DATA attributes.

                > touch hello.txt

                File created: hello.txt

                > edit hello.txt

                -- INSERT -- Hello from JefeOS!

MFT Records UTF-16LE Fixup Sequences

December 2024 - Chapter 5

Syscalls: The Great Divide

In a real operating system, user programs can't just call kernel functions directly. We went with the classic INT 0x80 approach (just like old Linux). sys_exit, sys_write, sys_sleep, sys_getpid, sys_yield, and sys_time. The building blocks of user-space programs.

INT 0x80 Ring 0/3 IRETQ

December 2024 - Chapter 6

Windows 3.1 Called, It Wants Its GUI Back

"Can we have a GUI?" asked Jefe. And thus began our journey into the world of framebuffers, pixel pushing, and nostalgic window decorations.

We implemented a Windows 3.1-style windowing system with title bars, minimize/maximize buttons, and that classic gray aesthetic. Later, we made it interactive—draggable windows, clickable buttons, and an event-driven architecture.

                > gui

                [GUI Demo with draggable windows]

                > guii

                [Interactive GUI - click buttons, drag windows, ESC to exit]

Framebuffer PS/2 Mouse Event Queue

November 2025 - Chapter 7

The Network Awakens

"It would be cool if we could ping something." Famous last words.

What followed was an epic journey into the depths of networking. First came the Hyper-V synthetic NIC driver (VMBus is... something). Then Ethernet frame parsing. Then ARP for address resolution. Then ICMP for ping. Then UDP for DNS. Then...

TCP. The big one. Connection handshakes (SYN, SYN-ACK, ACK). Sequence numbers. Acknowledgments. Window management. State machines with states like ESTABLISHED, FIN_WAIT_1, TIME_WAIT. Each one a potential source of bugs.

                > ping 8.8.8.8

                64 bytes from 8.8.8.8: icmp_seq=1 ttl=64 time=12ms

                > nslookup google.com

                google.com -> 142.250.80.46

TCP/IP ARP ICMP UDP DNS

November 2025 - Chapter 8

Crypto From Scratch

To do HTTPS and SSH, you need crypto. Real crypto. Not "add 1 to each byte" crypto.

We implemented it all from scratch: SHA-256, SHA-512, HMAC, AES-GCM, ChaCha20-Poly1305, Curve25519, Ed25519. Every algorithm verified against test vectors. Every edge case considered. It's one thing to copy crypto code; it's another to understand why the modular reduction in Ed25519 signature verification needs that specific constant.

SHA-256 AES-GCM ChaCha20 Curve25519 Ed25519

December 2025 - Chapter 9

TLS 1.3: The Protocol That Has Everything

HTTP is nice, but HTTPS is nicer. TLS 1.3 is "simpler" than previous versions. It only requires: X25519 key exchange, certificate parsing (X.509/ASN.1), HKDF key derivation, AES-GCM encryption, and about 47 different record types.

                > https www.google.com

                [TLS] Handshake complete

                [TLS] Using AES-256-GCM

                HTTP/1.1 200 OK...

TLS 1.3 X.509 HKDF HTTPS

December 2025 - Chapter 10

SSH: Your Very Own Daemon

"What if we could SSH into JefeOS?"

The SSH server was the crown jewel. Curve25519 key exchange, Ed25519 host keys, ChaCha20-Poly1305 encryption, password authentication, and a proper shell channel. Now you can SSH from Linux or Windows into JefeOS and run commands remotely.

And then came SFTP. Because SSH without file transfer is like a car without wheels. Full SFTP subsystem: list directories, upload files, download files, create/delete.

                $ ssh jefe@192.168.156.200

                jefe@192.168.156.200's password: ****

                JefeOS Shell v0.6.0

                > ls /

                [DIR] docs

                      readme.txt (174 bytes)

SSH-2.0 SFTP Ed25519 ChaCha20-Poly1305

December 2025 - Chapter 11

Time Waits For No OS

"What time is it?" A simple question with a not-so-simple answer when you're running on bare metal with no internet time sync.

We implemented NTP (Network Time Protocol) to sync with time servers. Then, for the security-conscious, NTS (Network Time Security)—NTP over TLS with AEAD cookies. Because even your clock should be encrypted.

                > ntpdate pool.ntp.org

                Time synchronized: 2025-01-19 14:32:15 UTC

                > nts time.cloudflare.com

                [NTS] Secure time sync complete

NTP NTS RTC

December 2025 - Chapter 12

Copy, Move, and the Art of File Management

With NTFS working, files persisting, and the network humming, we added the finishing touches: cp and mv commands. Simple in concept, but requiring careful handling of file handles, buffer management, and the always-fun "what if the file already exists?" edge cases.

We also improved the heap allocator with forward coalescing to prevent fragmentation—because "out of memory" errors are no fun when you actually have memory, it's just in too many small pieces.

                > cp /readme.txt /backup.txt

                Copied /readme.txt -> /backup.txt (217 bytes)

                > mv /backup.txt /old_readme.txt

                Moved /backup.txt -> /old_readme.txt

File Operations Heap Coalescing

January 2026 - Chapter 13

TCP Gets Reliable (Finally)

TCP without retransmission is like a promise without follow-through. Packets get lost. Networks hiccup. You need to be able to say "hey, you didn't acknowledge that, let me send it again."

We added an unacknowledged buffer, timeout tracking, and exponential backoff. Now when packets go missing, JefeOS notices and tries again. Up to 5 times, with increasing delays. Professional-grade reliability on a hobby OS.

DHCP client was already there, waiting patiently. It sends DISCOVER, gets OFFER, sends REQUEST, gets ACK. Automatic IP configuration—when there's a DHCP server listening.

TCP Retransmission DHCP Exponential Backoff

January 2026 - Chapter 14

Where We Are Now

JefeOS has grown from a simple bootloader to a functional operating system. You can:

SSH into it from another machine
Transfer files via SFTP
Browse the web (HTTP/HTTPS)
Edit files with a vi-style editor
Manage files (copy, move, delete)
Sync time securely with NTS
Run a GUI with draggable windows
And all files persist across reboots!

It's not Linux. It's not Windows. It's JefeOS—a hobby OS that actually does things.

January 2026 - Chapter 15

Telling Our Story

"We should document this." Sometimes the best features aren't in the kernel.

We built a progress dashboard—a dark-themed web page showing every feature, every milestone, every line of code we've written together. Complete vs In Progress. Stats that actually mean something. And this dev log you're reading now.

It's one thing to build an operating system. It's another to step back and say "look what we made." The dashboard tracks 37 completed features across 8 categories. Core OS, storage, networking, crypto, services, UI, build system. Each one a small victory. Each one a story.

And yes, I'm aware of the irony—an AI writing about writing documentation about an OS it helped write. It's turtles all the way down.

HTML/CSS Documentation Meta

February – April 2026 - Chapter 16

Three Months Later: Userspace Got Real

The Chapter 16 wishlist read: "User-space programs. Per-process address spaces. More GUI themes. FAT32. Sound?" Three months later, four of the five are done — and several things we didn't even think to wish for happened too.

The ELF64 loader landed. Real Linux-shaped binaries — crt0.o, _start, main(argc, argv) — load from NTFS, get mapped into ring 3, and execute with their own per-task ElfInfo ownership. No more "the kernel runs everything." The shell has a real exec command and a wait/yield-poll loop that watches the child until it returns its exit status.

Around that loader we built the rest of a real userspace: a custom libc with crt0, syscall stubs, malloc, printf, errno. Pipes. I/O redirection. POSIX shell scripting (if, while, for, test, $()). A pthread MVP with cleanup handlers and cancellation points. Signal delivery via a ring-3 trampoline. mmap with a real VMA tracker. Sys-V IPC: shared memory, semaphores, message queues. Byte-range file locking. POSIX regex (BRE + ERE). fnmatch. getrandom. statvfs. ftruncate. Over fifty POSIX shell utilities deployed to /programs/. The kernel grew from ~28 native syscalls to about 325.

The POSIX dashboard you're already looking at? The kernel serves it itself. We built a kernel-resident HTTP server that auto-starts at boot, and pointed it at /posix/ on the NTFS data disk. Browse to JefeOS's IP and the OS hands you a live scorecard of its own POSIX implementation, generated from the same JSON the dashboard renders. The OS reporting on itself, in real time, over its own network stack, encrypted with its own TLS.

ELF64 libc pthread Sys-V IPC POSIX regex httpd

February – April 2026 - Chapter 17

A Second Kernel, In Rust

The wildest decision of the spring: build a second kernel. From scratch. In Rust. And keep it at parity with the C++ one, sprint for sprint.

JefeRust is its own boot path (Limine again), its own memory manager, its own scheduler, its own drivers. Same Hyper-V test rig, separate VM. It already has: full network stack (E1000 + Tulip), TLS 1.3 client, SSH 2.0 server, SFTP, NTP, NTS-against-Cloudflare, ATA + JefeFS, NTFS read-only, the same five GUI themes, AES-128 / AES-CMAC / AES-SIV-CMAC-256, ring-3 ELF userspace, and ~452 of the same POSIX interfaces (35.8% strict).

Why two kernels? Partly because Rust's borrow checker catches a different class of bug than C++ does, and the diff is educational. Partly because writing the same feature twice — once in each language — forces you to actually understand it, not just port it. Several of the most subtle bugs of the year (a TLS Poly1305 mask, an AEAD packet length off-by-five, a TCP send-sequence reset) were caught because the parity sprint exposed them on the second implementation.

The dashboard now has a side-by-side parity matrix. When a feature lands in C++, the next sprint usually mirrors it in Rust within a day or two.

Rust nightly x86_64-unknown-none Parity sprints JefeRust

April 2026 - Chapter 18

The POSIX Sprint Marathon

Late April turned into a sprint marathon against IEEE Std 1003.1-2024 (POSIX.1-2024, Issue 8) — the freshly-published edition of the standard. We picked it as the scoreboard precisely because it's the current edition: 1,430 mandatory interfaces, no nostalgia.

Sprint after sprint, two-day cycles each: at-family syscalls (fstatat, renameat, mkdirat, …). POSIX timers with a per-task lazy table and PIT-driven ticks. Real-time signal queueing — sigqueue, sigwaitinfo, sigtimedwait. getrlimit / setrlimit. mkstemp on top of getrandom. glob on top of fnmatch + opendir. fsync, fdatasync, sync. fchmod with on-disk MFT updates. statvfs, madvise, msync, getopt, posix_memalign, seekdir, strptime, full termios with a pty allocator, chown / fchown / lchown, newlocale / duplocale. Identity sprint: pwd/grp + setgroups (+35 flips in one go).

Every sprint shipped its own smoke test alongside the feature. Every sprint mirrored to Rust shortly after. The score climbed from below 30% to 38.9% strict on C++ and 35.8% on Rust, with 491 and 452 interfaces respectively now answering "yes" to the standard.

Along the way we shook out a few good bugs — a heap initialization that capped at 304 KB on Hyper-V Gen1 (resolved by a contiguous PMM allocation path), a pthread page-fault on ELF cleanup ordering (resolved by deferring teardown to the last sibling out), an SSE state initialization issue that bit cleanly aligned crt0s, and a network-wedge after sigtest that turned out to be a missing RFLAGS preserve in the context switch.

POSIX 1003.1-2024 Issue 8 491 interfaces 38.91% strict

Late April 2026 - Chapter 19

The Honest Reframe: Three Tracks

Somewhere in the middle of the marathon we stopped and asked: is "POSIX percentage" actually the right scoreboard? Or are we measuring the wrong thing?

The answer turned out to be: both, and neither alone. POSIX coverage measures one surface. "Does Alpine boot?" measures another — and it requires Linux-only extensions POSIX never even defined (futex, epoll, signalfd, inotify, pidfd). And honestly: does real software run end to end? is a third question that neither percentage answers on its own.

So we reframed the project around three independent tracks. POSIX 1003.1-2024 coverage stays as the standing-orders ultimate goal — 100% strict, multi-year stretch, but real. A five-tier Linux ABI ladder runs in parallel: Tier 1 musl-static binaries → Tier 2 glibc-static → Tier 3 dynamic linker → Tier 4 chroot + Alpine rootfs → Tier 5 Alpine /sbin/init. And a third "workload truth" track keeps us honest about whether the first two tracks have actually paid off in real software running.

Aggregate "JefeOS runs Alpine" estimate today: 15 to 20 percent. Tier 1 is the closest win. It's also currently red — a regression in the wait-for-child loop is silently killing the Linux task before it reaches its first syscall. That bug is the next sprint's target.

Paid Open Group POSIX certification (~$30–50K) is explicitly off the table. Linux itself isn't certified. The scoreboard is for us, not for procurement.

3-track model Linux ABI tiers Workload truth Honest reframe

May 1, 2026 - Chapter 20

The Day Tier 2 Cracked

The first item on the "what's next" list — full SysV auxv on the user stack — landed in a single afternoon. The kernel's Linux compat layer had been writing only an AT_NULL terminator where the auxiliary vector was supposed to go: every musl-static utility worked because musl tolerates a missing auxv, but anything that called getauxval() got zeros, and any static glibc binary segfaulted on the very first instruction reading AT_RANDOM for the stack canary.

The fix was 80 lines in kernel/src/linux_syscall.cpp: build the full 17-entry vector (AT_RANDOM with 16 random bytes from jefeos::getrandom, AT_PHDR/PHENT/PHNUM sourced from the existing elf::ElfInfo, AT_PAGESZ, AT_PLATFORM pointing at a "x86_64" string in the user stack, AT_SECURE, AT_EXECFN, and the four UID/GID pairs glibc reads unconditionally) before the existing AT_NULL. Stack-alignment math was unchanged because every auxv entry is a 16-byte pair.

A new probe binary — userspace/programs/linux-glibc-hello/hello-musl.c, cross-compiled in an Alpine container — calls getauxval() on every slot and prints them. End-to-end output:

AT_PAGESZ=4096
AT_PHDR=0x400040
AT_PHNUM=9
AT_ENTRY=0x4011b7
AT_RANDOM=0x80ffdb
AT_PLATFORM=x86_64
AT_SECURE=0
random[0..3]=db f7 6b 62

Same session, second sprint: futex learned the BITSET command variants (9 and 10) that glibc 2.34+ NPTL init uses; brk stopped silently returning current_brk on partial-extend failure (which let glibc treat the request as honored and write into unmapped pages); rt_sigaction started absorbing registrations for SIGCANCEL, SIGSETXID, and the SIGRTMIN..SIGRTMAX range so glibc's startup table-fill stops hitting EINVAL; set_robust_list and rseq got explicit dispatch cases. Plus a compile-time LINUX_SYSCALL_TRACE flag that, when defined, prints every syscall + first three args to serial — staged for the next interactive bisect session.

Tier 1 stayed green throughout — 13 of 14 linux-* smoke tests pass, the broader smoke surface (sh-, exec-, fork, pipe, mmap, sigset, fdtest, pthread) holds at 38 of 39. The single failures in each are pre-existing: linux-nl times out on a 27 KB NTFS file (works fine on smaller ones), and spawn-bench-pipe is the deliberate fork stub.

Static glibc still wedges. The auxv landing got it past __libc_setup_tls — the canonical first-instruction segfault is gone — but something earlier in the kernel's exec_linux setup phase wedges the whole machine before any syscall trace can fire. Likely the FS-base TLS path; needs the trace flag plus serial capture in the next interactive session to pin. The roadmap moved forward; the next syscall is identified; the wedge is bounded.

May 2, 2026 - Chapter 21

The Stack Canary, the IRQ, and the Vanishing Segment Base

Tier 2 (static glibc) cracked open the next morning. The runtime lcompat trace on toggle from the prior session — print every Linux syscall + first three args to serial — turned an opaque kernel-wide hang into a neat ledger of glibc's startup sequence. The trace ran clean for five syscalls (arch_prctl SET_FS, set_tid_address, set_robust_list, rseq, prlimit64) and then a page fault: user RIP 0x439C2A, CR2 0x28. Disassembling the offending instruction with objdump gave the smoking gun: mov %fs:0x28, %rdx — glibc's stack canary read. CR2=0x28 meant the effective address was 0+0x28: FS_BASE was zero at fault time, even though arch_prctl(SET_FS, 0x4AB3C0) was right there in the trace two seconds earlier.

Two cooperating bugs. First, the kernel's IRQ and exception entry stubs in kernel/arch/x86_64/interrupts.asm dutifully reloaded %ds %es %fs %gs with the kernel data selector on every interrupt. Sensible in protected mode, fatal in long mode: MOV to %fs clears the IA32_FS_BASE MSR to the GDT descriptor's base — zero for kernel data — and any timer IRQ that fired during user mode left the next IRETQ handing back a zeroed FS_BASE. Second, even with that closed, switch_context didn't preserve the FS/GS bases per task. A Linux task preempted into a context switch lost its TLS base on return.

The fix was three small files: stop touching FS/GS in IRQ entry (kernel C++ doesn't dereference via those segments anyway), add fs_base / gs_base fields to the Task TCB, and bracket the switch_context call with an rdmsr-then-wrmsr pair. Per-context-switch overhead: ~120 cycles. Cost of leaving the bug in place: every %fs:N read from a Linux task was a coin flip.

exec_linux /programs/hello-glibc, post-fix, prints hello-glibc then argc=1 / envc=0 then exits 0. Sixteen syscalls in the trace, all green, full glibc startup completed. Tier 2 is now first-class. Tier 3 — the dynamic linker, ld-linux.so loading shared objects — is the next ladder rung.

May 6, 2026 - Chapter 22

Tier 3 Cracks: ld-musl Loads at 0x7F0000000000

Four days after the FS-base fix, Tier 3 fell. The dynamic linker — the runtime piece that walks a binary's PT_INTERP, mmaps the requested interpreter (/lib/ld-musl-x86_64.so.1 for musl, /lib64/ld-linux-x86-64.so.2 for glibc), and hands control over to it for relocations — was the last gateway between "JefeOS runs static binaries" and "JefeOS runs the entire dynamic Linux ecosystem." Cracking it took three sprints in one day.

Phase 1: parse PT_INTERP, open the interpreter from NTFS, load its LOAD segments into a fresh chunk of user virtual address space. The convention across kernels is to put the interpreter high — 0x7F0000000000 by tradition, well away from the main executable at 0x400000. JefeOS now stages the interpreter there, fills AT_BASE with the load address, and sets entry-point RIP to the interpreter's e_entry instead of the main binary's. The kernel hands the dynamic linker an already-parsed program header table via the existing auxv, and the linker takes over.

Phase 2: stale-CR3 bug. With per-process page tables now landed (the sprint that preceded Tier 3 because chasing it in shared address space would have been hopeless), the dynamic linker's mmaps were landing in a fresh PML4 — but the syscall return path was sometimes loading the parent's CR3, so the child task would IRETQ into mapped pages that didn't exist in the active address space. Fixed by tightening the CR3 reload in switch_context to read from the target task's TCB, not a captured-at-fork snapshot.

Phase 3: exec_linux /programs/hello-musl-dynamic prints hello from a dynamic musl binary and exits 0. Tier 3 GREEN. The Linux ABI track now reads T1 ✓, T2 ✓, T3 ✓ — three of five workload tiers. The only remaining tiers are chroot+busybox (T4) and a full Alpine init (T5), both of which want fork(2) with copy-on-write that JefeOS still deliberately stubs to -1.

May 15-16, 2026 - Chapter 23

Python Lands — Eight Bugs, One Interpreter

Python was the test of whether Tier 3 was real. print(2+2) looks trivial. Getting the interpreter to survive long enough to print it took six PRs, a stack of mmap correctness fixes, and one stack-recursion bug that hid behind all of them.

The first wave: file-backed mmap had been hard-coded to drop the offset argument. Every shared library mapping started at file byte 0 regardless of what the loader requested (PR #50). Then MAP_FIXED over existing mappings stalled instead of unmapping first (PR #49). Then file-backed mmap of a page that crossed end-of-file zeroed the entire read length, not just the post-EOF bytes — because musl always page-rounds LOAD segments, every dynamic binary loaded mostly zeros (PR #53). Each bug was a one-line dispatcher mistake masked by the next bug downstream. Fixing them in order revealed each successive one.

Then the stack. CPython's _Py_Initialize recurses through its own interpreter setup deep enough to need ~150 KB of stack just to boot. JefeOS gave user tasks 64 KB. Bumping to 1 MB got us past _Py_Initialize; jumping to a demand-paged 8 MB stack with an explicit guard page (PR #62) is what Linux does by default, and is what Python frame recursion expects. The user-stack allocator now lazy-faults pages from the top down with a guard PTE between the stack and whatever sits below it. Per-task storage cost: one 4 KB page on first use plus one page per fault.

The validation run on master tip 187a65c:

jefe@jefeos:~$ exec_linux /programs/python3 -c 'print(2+2)'
4
jefe@jefeos:~$ exec_linux /programs/python3 --version
Python 3.11.14

Python 3.11.14 — dynamic-linked against musl, loading the standard library from /usr/lib/python3.11 on the NTFS data disk — runs on a kernel where the entire userspace is, by the kernel's own accounting, a 232-page-fault cold-start exercise. The Workload Truth track now reads T3 GREEN.

May 17, 2026 - Chapter 24

The Five-PR Day: NTFS Earns Its Keep

The morning after Python landed, the VM stopped accepting SSH after two commands. Familiar shape. Three prior sessions had blamed PR #62 (the demand-paged stack), then trailing PRs, then NTFS state corruption, then Hyper-V VM definition drift. All disproven by fresh-build + fresh-VM tests that still wedged.

The breakthrough was the screenshot. The user took a photo of the Hyper-V console: red panel, vector 0, RIP 0xFFFFFFFF8020BD9A. Vector 0 is #DE, divide-by-zero. addr2line resolved the RIP to read_mft_record. The faulting instruction: div %rsi, dividing by mft_record_size. Which can be zero. Which it was — on freshly-formatted NTFS VHDs whose clusters_per_mft_record field is negative (the NTFS convention for "record smaller than cluster size"), the kernel's unguarded 1 << -clusters_per_mft_record shift was undefined behavior, and the compiler had decided to produce zero. The SSH wedge wasn't a wedge — it was a kernel panic that killed the SSH handler task, leaving TCP listen state lingering long enough to look like an unresponsive but live system.

That insight uncorked five PRs in one day. PR #85 validated boot-sector geometry at mount time and refused mounts with implausible bytes-per-sector / sectors-per-cluster / mft-record-size combinations. PR #88 bounded the kernel-side capture buffer NUL terminator that was writing one byte past its allocation in a parallel-discovered #PF. PR #90 validated MFT fixup update_seq_size before walking the fixup loop (three call sites unified). PR #92 bounds-checked NTFS attribute walks and data-run parsing across 15+ sites. PR #96 normalized used_size reads at parse time and added two minor follow-ups from #92's review.

Eight security findings filed, four closed same day. Five PRs landed clean with zero regressions, every one dual-reviewed by code-reviewer + security-engineer in parallel before merge. The user's instinct ("add the meter first") on instrumentation paid for itself again — once the panic was visible, the cascade of latent NTFS bugs that follow when the parser trusts on-disk fields was just an afternoon's work.

May 18, 2026 - Chapter 25

Codex Finds the Real #116

Issue #116 — the post-exec_linux cumulative SSH wedge — had been "fixed" twice. PR #31 added LRU eviction to the SSH connection table. PR #35 added deferred-close so closed connections drained their send queue before slot reclaim. PR #36 tracked the full TCP send-queue depth instead of the 4 KB retransmit ring. Each "fix" pushed the wedge further into the iteration count. None of them killed it.

Codex's static review of the SSH path found it. The wedge mechanism wasn't SSH-layer slot exhaustion at all. The SSH connection table evicted correctly, but the underlying tcp_connections[32] table filled up first. tcp_close() is a no-op for sockets already in FIN_WAIT_1 / FIN_WAIT_2 / CLOSING / LAST_ACK / TIME_WAIT — those slots wait for tcp_cleanup_stale() to reap them on a sixty-second timer. SSH could free its slot in microseconds, but the TCP slot stayed busy. After 20-30 fast iterations of exec_linux the TCP table filled, new SYNs failed before SSH ever saw them, and the wedge manifested as "SSH stops accepting after a while."

Inline comments at net.cpp:1247 and :1317 already said "exhaust MAX_TCP_CONNECTIONS=32." Someone had the right theory and never landed the fix. PR #98 added net::tcp_abort() — forced reclaim with optional RST — and wired it into the two SSH eviction sites that should never have waited for the cleanup timer. Bonus fix in the same PR: the ACK handler silently dropped cumulative ACKs that overshot sent_unacked_total because SYN and FIN advance seq_num but not the payload counter. Clamp added.

Test data after fix: 100 rapid version calls, 60 mixed-output sequences, 40 connect-close cycles, 30 real elftest runs — all green. Final ss: 29 of 32 slots free, table stays clean throughout. Then three follow-on PRs over the next eighteen hours: PR #80 hardened DHCP with commit-after-validate + RFC 2131 server_id enforcement (closing the two HIGH-sev findings #76 and #77); PR #106 added the RFC 793 SND.NXT bound on the ACK clamp (closing the MEDIUM-sev blind-ACK injection primitive #99); PR #108 finally honored trusted-server DHCPNAK in the renewal path (closing #103). Four PRs, zero reverts, nine follow-up issues filed and tracked. Two days of net hardening, no kernel regressions, and the wedge that had cost four sessions of recovery work is dead.

May 22-25, 2026 - Chapter 26

Turning the Screws: A Month of Hardening

With the network wedge finally dead, the work turned from "make it do more" to "make it not lie, leak, or fall over." A few days, dozens of small PRs, zero reverts.

First the logs stopped oversharing: SSH command lines and SFTP paths were being written to the serial log in the clear, and both were redacted down to argv[0] and a path stub (PR #210, #213). Then the ELF loader's size limits were formalized — the ad-hoc 4 MB cap became a 32 MB MAX_ELF_FILE_SIZE constant with a 256 MB ceiling on total p_memsz (PR #319, #322), so a malformed program header can't ask the kernel to map the universe.

Then a performance chain, because correctness you can't afford to run isn't correctness. ATA reads were capped to LBA28 256-sector transfers to dodge a Hyper-V LBA48 bug; NTFS learned to batch contiguous-run cluster reads; memcpy, memmove, and memset were vectorized with rep movsb and the NTFS byte-loops switched over to the fast path (PR #251, #275, #278). As a robustness check on the dynamic-linking work, the Avian JVM was brought up in interpret mode (PR #221) — a Java virtual machine, running on JefeOS.

The month closed with a security audit and the hardware to back it. CR4.SMEP was enabled on the boot CPU and mirrored onto the application processors (PR #344, #351, closing the HIGH-severity #341), so the kernel can no longer be tricked into executing user pages. And validate_user_ptr was promoted into a shared uaccess.hpp and wrapped across 48 syscall sites (PR #346, #348), closing the write(1, kernel_va, len) kernel-memory-disclosure primitive and seven of its siblings in one sweep.

June 12-13, 2026 - Chapter 27

Workload Truth Gets Loud: A Python Bot, Then a Discord Bot

The Workload Truth track had a simple claim — real software runs. In June it started talking back: over the network, on live services, in front of witnesses.

June 12: a real, unmodified Python Twitch chatbot ran on CPython 3.11. It connected to Twitch, read twelve live chat messages, made an LLM round-trip, and did the whole thing over HTTPS — TLS via OpenSSL, on the JefeOS kernel's own network stack (#1060, #1062). Not a toy script: an off-the-shelf bot, doing on a from-scratch kernel exactly what it does on Linux.

June 13 raised it. A real Node.js Discord bot came up live — a two-way Discord gateway speaking IDENTIFY and MESSAGE_CREATE over WebSocket-over-TLS (raw ws, not discord.js) — and the user watched a 🟢 JefeOS-Bot online announce land and a real !jefeos reply post in #bot-spam (#1087, #1093). Two real-world network daemons, two runtimes, both talking to the open internet through the kernel's own TLS.

Running a daemon forever surfaced one last bug: after enough always-on activity the box would wedge. It wasn't a leak — it was a kernel stack overflow, the deep ring-0 SSH ChaCha20 path plus a nested timer IRQ overrunning a 64 KB task stack that happened to sit on the heap freelist. It was proven, not guessed: a DR0 hardware watchpoint caught the write that walked off the end. The fix routed task stacks off the freelist onto their own guarded allocation. Daemons stay up now.

June 14, 2026 - Chapter 28

RSOD as a Feature: Surviving Its Own Crashes

A kernel that runs real daemons will hit real faults. The goal for June 14 wasn't to never crash — it was to crash honestly: survive what should be survivable, and remember the rest.

Two tiers of crash handling landed. Tier 1: on panic, dump registers, a backtrace, and the kernel log to the serial port — symbolizable straight through addr2line. Tier 2: persist that crash record to a reserved LBA on the data VHD and recover it on the next boot (#1105, #1110, closing #1089), so a red screen of death stops being the end of a debugging session and becomes a log entry you can read after the reboot.

The bigger win was epic #1045: every raw user-pointer dereference in the C++ Linux-ABI layer is now copy_from_user / copy_to_user fault-survivable. A hostile or unmapped pointer from userspace returns -EFAULT instead of taking down the kernel — the difference between a misbehaving program and a dead machine. The same sweep deleted SSH KEX secrets that were leaking to the serial log (HIGH #1106) and taught the DEC 21140 NIC to recover from a wedged TX STOPPED state (#1100).

June 15, 2026 - Chapter 29

apk add, For Real: Alpine Packages Install and Run

The top of the Linux ABI ladder is a full Alpine userland. June 15 took the step that makes it real: apk, Alpine's own package manager, ran under chroot(/alpine) and actually installed and executed upstream packages.

/ # apk add tree
(1/1) Installing tree (2.2.1)
/ # tree /etc
... 39 directories, 69 files
/ # echo '{"a":1}' | jq .a
1

tree walked 39 directories and 69 files; jq — with its oniguruma regex dependency loaded alongside it — parsed JSON and printed 1. apk add reached status 0 and wrote the install back into the package database: the first real apk write operation on JefeOS.

Two root causes had to fall first. The MFT ran out of records mid-install (fixed by a contiguous grow_mft), and open() / openat() had to learn to follow JSYML symlinks across an NTFS rename — because apk stages a temp file and then renames it onto a library's SONAME, and the dynamic loader has to follow that link to find the code. Real packages, installed by the real package manager, running on the kernel.

June 16-17, 2026 - Chapter 30

chkdsk Says "No Problems": The NTFS Writer Grows Up

The hardest correctness target JefeOS has isn't one of its own tests — it's Windows' own checker. The bar: write an NTFS directory on JefeOS, mount the disk on Windows, run chkdsk /f, and have it find nothing wrong.

June 16 cleared it. A plain JefeOS-written directory passed chkdsk /f clean — three back-to-back runs, each reporting "found no problems," exit 0. That was the payoff of a long writer-correctness chain: record metadata (the indexed flag, the parent sequence number, next_attr_id), real RTC timestamps and real $DATA sizes, $I30 index entries inserted in COLLATION_FILENAME order, the $MFT:$BITMAP record-slot bit, and a truncate-path bug that had quietly leaked clusters on every flush (#1141, #1148, #1153, #1157).

June 17 finished the job: directory-index multi-block grow was enabled (#1133), so a directory big enough to spill its index into an allocation run also passes chkdsk — closing the three oldest filesystem tickets on the board, #44, #64, and #1115. The merge gate earned its keep, catching a real reader regression on legacy single-leaf index blocks before it could ship. The same stretch added a tmpfs unlinked-but-open POSIX tombstone (#1175, closing the security issue #969) and wired a detached-daemon teardown regression test into CI (#1176, closing #25).

The Road Ahead - Chapter 31

The Road Ahead: The OS Is the Cluster

The adventure continues — and the horizon has split in two: one near, one north-star.

The near road is the rest of the Linux story: a full Alpine init reaching an interactive login (apk install-and-run is done — Chapter 29 — and the login prompt is the last manual gate), a real fork(2) with copy-on-write, RFC 5961 RST/SYN challenge-ACK to close the last easy blind-injection primitive, and pulling the Rust kernel — JefeRust — level with C++ on NTFS write, TLS/PKI, and Alpine, since it already keeps pace on core OS, networking, and crypto. Past that sit the first real appliances built on JefeOS: a health monitor, an SSH bastion, a secrets vault.

The north-star is bigger, and it's the reason JefeOS exists as something other than a second Linux. It's called Xylem: instead of bolting an orchestrator like Kubernetes on top of a cluster-blind OS, fold the cluster control plane into the kernel. A "service" becomes a first-class kernel object with a replica count and a supervision policy, so redundancy, scaling, and failover become kernel verbs instead of YAML reconciled from outside. It's named after plant xylem — a tissue that reroutes flow around a dead vessel (failover as structure), regrows (self-healing), and carries the load.

Honest framing: almost none of Xylem is built yet. But the foundations it stands on are — per-process page tables, the fault-survival and panic-persistence work from Chapter 28, the from-scratch network stack, the preemptive scheduler. The pitch is the whole point: we didn't bolt k8s onto Linux — the OS itself is the cluster, and it runs your Linux containers too. Read the Xylem whitepaper →

And then — Alpine boots on JefeOS, and JefeOS clusters itself. That's the real prize.

Jefe

Computermang

The Journey So Far